WO2021217987A1 - 文本摘要生成方法、装置、计算机设备及可读存储介质 - Google Patents

文本摘要生成方法、装置、计算机设备及可读存储介质 Download PDF

Info

Publication number
WO2021217987A1
WO2021217987A1 PCT/CN2020/112349 CN2020112349W WO2021217987A1 WO 2021217987 A1 WO2021217987 A1 WO 2021217987A1 CN 2020112349 W CN2020112349 W CN 2020112349W WO 2021217987 A1 WO2021217987 A1 WO 2021217987A1
Authority
WO
WIPO (PCT)
Prior art keywords
candidate
candidate abstract
abstract
score
redundancy
Prior art date
Application number
PCT/CN2020/112349
Other languages
English (en)
French (fr)
Inventor
郑立颖
徐亮
阮晓雯
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021217987A1 publication Critical patent/WO2021217987A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • This application relates to the technical field of artificial intelligence natural language processing, and in particular to a text abstract generation method, device, computer equipment and readable storage medium.
  • the current methods of automatic text summarization are mainly divided into two categories, extractive summaries and generative summaries.
  • extractive abstract mainly extracts important sentences from the original text, and combines the number of sentences and the number of words to form an abstract.
  • Commonly used methods include textrank and its extension algorithm.
  • the advantage is that sentences can be extracted directly from the text.
  • General sentences The smoothness will be better, but the generalization is worse.
  • the generative abstract is based on the original content to refine and summarize a new summary text, which is closer to the process of human abstraction, but the generative abstract must use the Seq2Seq model and rely on the annotation data for model training, which is generally more difficult. When the amount of training data is small and the model is not fully trained, the summary obtained does not meet expectations, for example, it will cause the appearance of redundant words, which will affect the readability of the automatically generated summary.
  • the first aspect of the present application provides a method for generating a text abstract, the method including:
  • a summary with a reference score greater than a preset reference score is selected from the candidate summary as the summary corresponding to the text information.
  • a second aspect of the present application provides a text abstract generation device, the device includes:
  • the first obtaining module is used to obtain the text information to be processed
  • a conversion module for converting the text information into word vectors
  • the training module is used to input the word vector into the pre-trained preset neural network model through the cluster search algorithm to obtain the candidate abstract set of the text information and the logarithmic similarity of each candidate abstract in the candidate abstract set Probability value
  • the second obtaining module is configured to obtain a target redundancy score of each candidate abstract in the candidate abstract set, where the target redundancy score indicates the degree of redundancy of words in the candidate abstract;
  • the third obtaining module is configured to obtain the reference score of each candidate abstract according to the target redundancy score and the log-likelihood probability value of each candidate abstract;
  • the abstract selection module is configured to select abstracts with a reference score greater than a preset reference score from the candidate abstracts as the abstract corresponding to the text information.
  • a third aspect of the present application provides a computer device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and the processor implements the following steps when the processor executes the computer program:
  • a summary with a reference score greater than a preset reference score is selected from the candidate summary as the summary corresponding to the text information.
  • a fourth aspect of the present application provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the following steps are implemented:
  • a summary with a reference score greater than a preset reference score is selected from the candidate summary as the summary corresponding to the text information.
  • FIG. 1 is a schematic flowchart of a method for generating a text abstract in an embodiment of the present application
  • FIG. 2 is a schematic diagram of a process for obtaining a first redundancy score in an embodiment of the present application
  • FIG. 3 is a schematic diagram of a process for obtaining a second redundancy score in an embodiment of the present application
  • FIG. 4 is a schematic diagram of a process for obtaining a third redundancy score in an embodiment of the present application
  • FIG. 5 is a schematic diagram of a process for obtaining the reference score of each candidate abstract in an embodiment of the present application
  • FIG. 6 is a schematic diagram of another process for obtaining the reference score of each candidate abstract in an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a structure of a text summary generating apparatus in an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a structure of a computer device in an embodiment of the present application.
  • the embodiment of the present application provides a method for generating a text summary. Specifically, as shown in FIG. 1, the method may include the following steps:
  • corresponding text information can be collected for scenarios that need to be summarized.
  • the text information can include, but is not limited to, relevant text information such as Internet news, blogs, reports, and papers.
  • relevant text information such as Internet news, blogs, reports, and papers.
  • the text information may be related to news, or may be text information related to a blog, or the like, or may be text information that is a combination of news and a blog, or the like.
  • the currently acquired text information to be processed is converted into a word vector.
  • the Word2Vec method may be used to convert the to-be-processed text information into a word vector.
  • the acquired text information to be processed is converted into the form of word vectors through Word2Vec.
  • Word2Vec can also be used to convert the text information to be processed into the form of word vectors. Not limited.
  • the corresponding word vector can also be selected according to the level of the corpus.
  • the corpus level is large, you can train the word vector by yourself, for example, by directly calling the Word2Vec function in the GenSim library for training; if the corpus level is not large enough, you can use the existing word vector result file, such as obtaining online
  • the word vector data that has been trained on the public corpus can be specifically selected according to the degree of fit of the actual scene.
  • S30 Input the word vector into a pre-trained preset neural network model through the cluster search algorithm to obtain the candidate abstract set of the text information and the log-likelihood probability value of each candidate abstract in the candidate abstract set .
  • the preset neural network model is a Seq2Seq model with an attention mechanism.
  • the Seq2Seq model structure can be built first, and the Seq2Seq model can include an encoder Encoder and decoder decoder two parts, specifically can be recurrent neural convolution RNN (Recurrent neural convolution) network) structure or long short-term memory LSTM (Long Short-Term Memory) structure, the converted word vector is input into the pre-trained preset neural network model through the embedding method, and the training target can be the maximum likelihood estimation MLE (maximum likelihood estimation).
  • RNN Recurrent neural convolution
  • LSTM Long Short-Term Memory
  • the candidate summary set can be decoded and output, and the log-likelihood probability value logP of each candidate summary can be obtained.
  • the cluster search algorithm Beam Search is introduced.
  • the Beam Search algorithm is a heuristic search algorithm. It is usually used in the case of a relatively large data set. It can eliminate nodes with relatively poor quality and filter out nodes with higher quality. The main function of the node is to reduce the space and time occupied by the search. Through the Beam The Search algorithm can obtain a collection of candidate abstracts based on text information.
  • the probabilities corresponding to the candidate abstract set may be sorted, and the k abstracts with the probability ranking topk are selected as The candidate abstract set, for example, select the top 10 candidate abstract set, specifically, it can be set according to the actual situation, and it is not limited here.
  • the core of the optimization method based on the maximum likelihood estimation MLE as the training target of the Seq2Seq model is to obtain the candidate abstract set with the largest occurrence probability, which can improve the accuracy of obtaining the candidate abstract set.
  • S40 Obtain a target redundancy score of each candidate abstract in the candidate abstract set, where the target redundancy score represents a degree of redundancy of words in the candidate abstract.
  • the target redundancy score of each candidate abstract in the candidate abstract set can be obtained, and the target redundancy score body may indicate the degree of redundancy of the words in the candidate abstract.
  • the target redundancy score of each candidate abstract may be obtained through steps S41A-S43A, specifically, as shown in FIG. 2, it may include:
  • S41A For the words of each candidate abstract, calculate the similarity between each word and other remaining words, and select the number with the similarity greater than the preset value, and calculate the total number m of similar words in each candidate abstract.
  • the similarity between each word in each candidate abstract and other remaining words can be calculated for the words of each candidate abstract. Specifically, the cosine similarity between every n words and other n-1 remaining words can be calculated separately. For example, for the first word, the similarity score between the first word and the remaining other words can be calculated.
  • n-1 similarity scores and count the number m1 where the similarity scores are greater than the preset value; for the second word, you can calculate the similarity scores between the second word and the remaining other words, and you can get n-1 Similarity score, and count the number m2 where the similarity score is greater than the preset value...; for the nth word, the similarity score between the nth word and the remaining other words can be calculated, and n- 1 similarity score, and count the number mn of which the similarity score is greater than the preset value; then repeat the calculation of the cosine similarity between every n words and other n-1 remaining words n times, and select the similarity greater than the preset value
  • the preset value of the similarity score is preset to 0.9.
  • statistics can be made. And get the number of similarity scores where the similarity score is greater than the preset score 0.9 as m1, that is, the number of similar words for the first word m1 can be obtained; then for the second word, m2 can be obtained, for Corresponding to the nth word can get mn.
  • the preset value here is only used for example and is not meant to be a limitation, and other preset values can also be set.
  • S42A For each candidate abstract, divide the total number m corresponding to the candidate abstract by n*(n-1), and perform normalization processing to obtain the first redundancy score of each candidate abstract Value, where n represents the total number of words in the candidate abstract.
  • the total number m corresponding to the candidate abstract is divided by n*(n-1), specifically, it can be expressed by the formula m/(n*(n-1)) , And normalizing m/(n*(n-1)), the first redundancy score score dup1 of each candidate abstract can be obtained.
  • S43A Use the corresponding first redundancy score of each candidate abstract as the target redundancy score of each candidate abstract.
  • the first redundancy score of each candidate abstract may be correspondingly used as the target redundancy score of each candidate abstract.
  • the target redundancy score of each candidate abstract may also be obtained through steps S41B-S44B. Specifically, as shown in FIG. 3, it may include:
  • the length of the same character in each sentence of each candidate abstract is determined and acquired respectively.
  • the candidate abstract set may include multiple candidate abstracts, and each candidate abstract may include multiple sentences.
  • statistics can be combined. Obtain the length of the same character in each sentence in each candidate abstract.
  • the length sentence 1 and the length sentence 2 of the second sentence corresponding to the same character are obtained respectively. For example, suppose that one of the candidate abstracts If sentence 1 and sentence 2 have the same character, the length sentence 1 and sentence 2 length sentence 2 corresponding to the same character can be obtained respectively.
  • the second redundancy score score2 of each candidate abstract can be obtained, and the The second redundancy score of each candidate abstract is used as the target redundancy score of each candidate abstract.
  • the target redundancy score can be expressed by the following formula:
  • score dup2 2*length same character /(length sentence 1 + length sentence 2 )
  • the same length character indicates the length of the same character in each candidate abstract
  • length sentence 1 indicates the length of the first sentence corresponding to the same character
  • length sentence 2 indicates the length of the second sentence corresponding to the same character.
  • S44B Correspondingly use the second redundancy score of each candidate abstract as the target redundancy score of each candidate abstract.
  • the calculation of the similarity of each character can obtain the second redundancy score score dup of the candidate abstract set, And the corresponding second redundancy score of each candidate abstract is used as the target redundancy score of each candidate abstract.
  • the target redundancy score of each candidate abstract may also be obtained through steps S41C-S43C. Specifically, as shown in FIG. 4, it may include:
  • S41C Use the Bert model to encode each candidate abstract to obtain a sentence vector of each candidate abstract.
  • Bert (Bidirectional The Encoder Representations from Transformers) model encodes each candidate abstract in the candidate abstract set, and the sentence vector of each candidate abstract can be obtained.
  • S42C Obtain the similarity of any two sentence vectors in each candidate abstract according to the sentence vectors of each candidate abstract, and obtain the third redundancy score of each candidate abstract.
  • the cosine similarity of any two sentence vectors can be calculated, and the calculation result of the cosine similarity of any two sentence vectors is used as the third redundancy. Score, the third redundancy score of each candidate abstract can be obtained.
  • S43C Correspondingly use the third redundancy score of each candidate abstract as the target redundancy score of each candidate abstract.
  • the similarity of any two sentence vectors in each candidate abstract is obtained, the third redundancy score score dup3 of each candidate abstract can be obtained, and the score dup3 of each candidate abstract
  • the third redundancy score is used as the target redundancy score of each candidate abstract, and the second target redundancy score can be obtained according to the third redundancy score score dup3 .
  • the sentence vector of each candidate abstract can be obtained, and the similarity of any two sentence vectors in each candidate abstract can be obtained according to the sentence vector.
  • the third redundancy score score dup3 of each candidate abstract can be obtained.
  • step S50 that is, obtaining the reference score of each candidate abstract according to the target redundancy score and the log-likelihood probability value of each candidate abstract, specifically, as shown in FIG. 5, may include:
  • the weight coefficient corresponding to each candidate abstract is obtained respectively, and the weight coefficient can be set to a value between 0 and 1 and can be configured according to actual conditions.
  • different weight coefficients can be set according to different rankings.
  • the weight coefficient of the top1 abstract is set to 0.5
  • the weight coefficient of the top2 abstract is set to 0.3
  • the weight coefficient of the top3 summary is set to 0.2. It should be noted that this is only used as an example and is not limited.
  • S52A Determine the product of the weight coefficient corresponding to each candidate abstract and the target redundancy score.
  • the product of the weight coefficient corresponding to each candidate abstract and the target redundancy score is determined.
  • the product can be represented by the formula ⁇ *score dup .
  • S53A Use the difference between the log-likelihood probability value of each candidate abstract and the product as the reference score of each candidate abstract to obtain the reference score of each candidate abstract, and the reference score is stored in the blockchain.
  • the reference score of each candidate abstract can be obtained.
  • the reference score can be expressed by the formula: logP- ⁇ *score dup , where ⁇ represents a weight coefficient, and the weight coefficient can be set to a value between 0-1, which can be specifically based on actual conditions Determine the corresponding weight coefficient, which is not limited here; score dup represents the target redundancy score.
  • the reference score of each candidate abstract can be obtained through the formula: calculation and acquisition, or according to the obtained target redundancy score and log likelihood probability value, and the formula logP- ⁇ *
  • the score dup can obtain a reference score, and then reference can be made according to the reference score to achieve the acquisition of target candidate abstracts.
  • the first redundancy score, the second redundancy score and the third redundancy can be obtained respectively.
  • the first redundancy score, the second redundancy score and the third redundancy score are respectively used as the target redundancy scores.
  • different target redundancy scores can be obtained , That is, the first redundancy score score dup1 , the second redundancy score score dup2 and the third redundancy score score dup3 , based on the reference score formula logP- ⁇ *score dup , different Reference score logP- ⁇ *score dup1 , logP- ⁇ *score dup2 and logP- ⁇ *score dup3 .
  • this application also relates to blockchain technology.
  • different reference scores are obtained based on the aforementioned reference score formula logP- ⁇ *score dup , and the reference scores can also be stored in the block Chain.
  • blockchain technology is a technology that can store, verify, transmit and communicate network data through its own distributed nodes without relying on third parties. It has "unforgeable”, “full traces”, “traceable”, and “public”. In this embodiment, it can be understood that by storing the reference score in the blockchain, the privacy and security of the reference score can be improved.
  • step S50 that is, obtain the reference score of each candidate abstract according to the redundancy value and the log-likelihood probability value of each candidate abstract, wherein the reference score of each candidate abstract can also pass step S51B- S53B performs acquisition, specifically, as shown in FIG. 6, including:
  • the corresponding first redundancy score in each candidate abstract is obtained.
  • the first weight coefficient corresponding to the first redundancy score in each candidate abstract, the second weight coefficient corresponding to the second redundancy score in each candidate abstract, and the third redundancy score in each candidate abstract are determined and obtained respectively.
  • S52B and respectively determine the first product of the first redundancy score and the first weight coefficient of each candidate abstract, the second product of the second weight coefficient and the second redundancy score, and the third The third product of the weight coefficient and the third redundancy score.
  • the first product of the first redundancy score and the first weight coefficient, the second product of the second weight coefficient and the second redundancy score, and the third weight of each candidate abstract are determined respectively The third product of the coefficient and the third redundancy score.
  • the first product may be expressed by the formula ⁇ 1 *score dup1
  • the second product may be expressed by the formula ⁇ 2 *score dup2
  • the third product may be expressed by the formula ⁇ 3 *score dup3 .
  • S53B Use the difference between the log-likelihood probability value of each candidate abstract and the first product, the second product, and the third product as the reference score of each candidate abstract to obtain the reference score of each candidate abstract.
  • the difference between the log-likelihood probability value of each candidate abstract and the first product, the second product, and the third product is used as the reference score of each candidate abstract, and then the score of each candidate abstract can be obtained.
  • Reference score Specifically, the following formula can be used:
  • ⁇ 1 represents the first weight coefficient
  • ⁇ 2 represents the second weight coefficient
  • ⁇ 3 represents the third weight coefficient
  • score dup1 represents the first redundancy score
  • score dup2 represents the second redundancy score
  • score dup3 Indicates the third redundancy score.
  • S60 Select an abstract with a reference score greater than a preset reference score from the candidate abstracts as the abstract corresponding to the text information.
  • a summary greater than the preset score can be selected as the abstract corresponding to the text information according to the reference score.
  • the multiple reference scores can be sorted and a reference score can be preset.
  • the reference score is preset to 0.9, then you can The abstract corresponding to the obtained reference score greater than 0.9 is selected as the abstract corresponding to the text information.
  • other numbers such as 0.95, 0.85, etc. can also be preset, which are only used as examples here and are not limited.
  • the method for generating a text summary in the foregoing embodiment adds a variety of calculation methods for obtaining redundancy scores, which can optimize the results of automatically generated summaries, and By configuring different weight coefficients for the redundancy scores obtained in different ways, the scores of more repetitive words appearing in the obtained candidate abstract set can be reduced, that is, the probability of more repetitive words being selected is reduced, so that The possibility of more repeated words in the obtained target candidate abstract is reduced, thereby improving the credibility and readability of the automatically generated abstract.
  • a text summary generating device which realizes a one-to-one correspondence with the corresponding steps of the text summary generating method in the foregoing embodiment.
  • the text summary generating device includes a first acquisition module 10, a conversion module 20, a training module 30, a second acquisition module 40, a third acquisition module 50 and a summary selection module 60.
  • the detailed description of each functional module is as follows:
  • the first obtaining module 10 is used to obtain the text information to be processed
  • the conversion module 20 is used to convert the text information into word vectors
  • the training module 30 is used to input the word vector into a pre-trained preset neural network model through a cluster search algorithm to obtain a set of candidate abstracts of the text information and the logarithm of each candidate abstract in the set of candidate abstracts Likelihood probability value;
  • the second obtaining module 40 is configured to obtain a target redundancy score of each candidate abstract in the candidate abstract set, where the target redundancy score indicates the degree of redundancy of words in the candidate abstract;
  • the third obtaining module 50 is configured to obtain the reference score of each candidate abstract according to the target redundancy score and the log-likelihood probability value of each candidate abstract;
  • the abstract selection module 60 is configured to select abstracts with a reference score greater than a preset reference score from the candidate abstracts as the abstract corresponding to the text information.
  • the second acquisition module 40 is also used for:
  • For each candidate abstract calculate the similarity between each word and the remaining words, and select a number whose similarity is greater than a preset value to obtain the total number m of similar words in each candidate abstract;
  • n the total number of words in the candidate abstract
  • the first redundancy score of each candidate abstract is used as the target redundancy score of each candidate abstract.
  • the second acquisition module 40 is also used for:
  • the second redundancy score of each candidate abstract is used as the target redundancy score of each candidate abstract.
  • the second acquisition module 40 is also used for:
  • the third redundancy score of each candidate abstract is used as the target redundancy score of each candidate abstract.
  • the shown third acquisition module 50 is also used for:
  • the difference between the log-likelihood probability value of each candidate abstract and the product is used as the reference score of each candidate abstract to obtain the reference score of each candidate abstract.
  • Each module in the above-mentioned text summary generating device can be implemented in whole or in part by software, hardware and a combination thereof.
  • the above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.
  • a computer-readable storage medium is provided.
  • the computer-readable storage medium may be volatile or non-volatile.
  • a computer program is stored on the computer-readable storage medium.
  • the method for generating the text summary of the embodiment is implemented, for example, steps S10-S60 shown in FIG. 1 or steps S41A-S43A shown in FIG. 2, steps S41B-S44B shown in FIG. 3, and step S41C shown in FIG. 4 -S43C and steps S51A-S53A shown in FIG. 5 or steps S51B-S53B shown in FIG. 6 and so on. To avoid repetition, I won’t repeat them here.
  • the computer-readable storage medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), electric carrier signal and telecommunication signal, etc.
  • a computer device is provided.
  • the computer device 60 of this embodiment includes a processor 61, a memory 62, and a computer program 63 that is stored in the memory 62 and can run on the processor 61.
  • the processor 61 executes the computer program 63 to implement the text summary generating method of the foregoing embodiment, such as steps S10-S60 shown in FIG. 1 or steps S41A-S43A shown in FIG. 2, steps S41B-S44B shown in FIG. 3 and shown in FIG. Show steps S41C-S43C and steps S51A-S53A shown in FIG. 5 or steps S51B-S53B shown in FIG. 6 and so on.
  • the processor 61 executes the computer program 63, the functions of the modules in the text summary generating apparatus of the above embodiment are realized, for example, the first acquisition module 10, the conversion module 20, the training module 30, and the second acquisition module 40 shown in FIG.
  • the third acquisition module 50 and the abstract selection module 60 have functions of modules.
  • Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (SynchlinK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
  • the blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种文本摘要生成方法、装置、计算机设备和可读存储介质,涉及人工智能。方法部分包括:获取待处理的文本信息(S10);将文本信息转化为词向量(S20);通过集束搜索算法,将词向量输入预训好的预设神经网络模型,以得到文本信息的候选摘要集合以及候选摘要集合中各候选摘要的对数似然概率值(S30);获取候选摘要集合中各候选摘要的目标冗余性分值(S40);根据各候选摘要的目标冗余性分值和对数似然概率值获取各候选摘要的参考分值(S50);从各候选摘要中选取参考分值大于预设参考分值的摘要作为文本信息对应的摘要(S60)。此外,所述方法还涉及区块链技术,参考分值可存储于区块链中。该方法可以优化自动生成摘要的冗余词,从而提高自动生成文本摘要的可读性。

Description

文本摘要生成方法、装置、计算机设备及可读存储介质
本申请要求于2020年4月30日提交中国专利局、申请号为CN202010367822.9、名称为“文本摘要生成方法、装置、计算机设备及可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能自然语言处理的技术领域,尤其涉及一种文本摘要生成方法、装置、计算机设备及可读存储介质。
背景技术
随着文本信息爆发式的增长,每天都会有海量的文本信息产生,其中包括但不局限于互联网新闻、博客、报告和论文等。摘要是能够反映文本信息的一段文本,从大量的文本信息中提取需要的内容,不仅能够帮助人们在阅读长篇的文章时缩短阅读时间,还能使人们大幅地提高对信息阅读效率,从而使人们可以更加高效地利用信息来生活和工作。基于上述的需求,自动文本摘要生成技术是知识管理***核心功能之一,近年来得到了迅速的发展;而且自动文本摘要有非常多的应用场景,例如报告自动生成、新闻标题生成、搜索结果预览等。
技术问题
目前的自动文本摘要生成的方法主要分为两类,抽取式摘要和生成式摘要。发明人意识到抽取式摘要主要是从原文中抽取重要的句子,结合句子数量及字数要求等拼凑形成摘要,常用的方法有textrank及其延伸算法,其好处在于可以直接从文中抽取句子,一般句子的通顺度会更好,但概括性较差。而生成式摘要是根据原文内容进行提炼总结形成一段新的汇总文字,更接近人进行摘要的过程,但生成式摘要必须要采用Seq2Seq模型并依赖于标注数据进行模型训练,一般的难度较大,当训练数据量较小,模型训练不充分的话得到的摘要不符合预期,例如会导致冗余词的出现,则会影响自动生成摘要可读性。
技术解决方案
本申请第一方面提供一种文本摘要生成方法,所述方法包括:
获取待处理的文本信息;
将所述文本信息转化为词向量;
通过集束搜索算法,将所述词向量输入预先训练好的预设神经网络模型,以得到所述文本信息的候选摘要集合以及所述候选摘要集合中各候选摘要的对数似然概率值;
获取所述候选摘要集合中各候选摘要的目标冗余性分值,所述目标冗余性分值表示所述候选摘要中的词的冗余程度;
根据各候选摘要的所述对数似然概率值和所述目标冗余性分值获取所述各候选摘要的参考分值;
从所述各候选摘要中选取参考分值大于预设参考分值的摘要作为所述文本信息对应的摘要。
本申请第二方面提供一种文本摘要生成装置,所述装置包括:
第一获取模块,用于获取待处理的文本信息;
转化模块,用于将所述文本信息转化为词向量;
训练模块,用于通过集束搜索算法,将所述词向量输入预先训练好的预设神经网络模型,以得到所述文本信息的候选摘要集合以及所述候选摘要集合中各候选摘要的对数似然概率值;
第二获取模块,用于获取所述候选摘要集合中各候选摘要的目标冗余性分值,所述目标冗余性分值表示所述候选摘要中的词的冗余程度;
第三获取模块,用于根据各候选摘要的所述目标冗余性分值和对数似然概率值获取所述各候选摘要的参考分值;
摘要选取模块,用于从所述各候选摘要中选取参考分值大于预设参考分值的摘要作为所述文本信息对应的摘要。
本申请第三方面提供一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如下步骤:
获取待处理的文本信息;
将所述文本信息转化为词向量;
通过集束搜索算法,将所述词向量输入预先训练好的预设神经网络模型,以得到所述文本信息的候选摘要集合以及所述候选摘要集合中各候选摘要的对数似然概率值;
获取所述候选摘要集合中各候选摘要的目标冗余性分值,所述目标冗余性分值表示所述候选摘要中的词的冗余程度;
根据各候选摘要的所述对数似然概率值和所述目标冗余性分值获取所述各候选摘要的参考分值;
从所述各候选摘要中选取参考分值大于预设参考分值的摘要作为所述文本信息对应的摘要。
本申请第四方面提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现如下步骤:
获取待处理的文本信息;
将所述文本信息转化为词向量;
通过集束搜索算法,将所述词向量输入预先训练好的预设神经网络模型,以得到所述文本信息的候选摘要集合以及所述候选摘要集合中各候选摘要的对数似然概率值;
获取所述候选摘要集合中各候选摘要的目标冗余性分值,所述目标冗余性分值表示所述候选摘要中的词的冗余程度;
根据各候选摘要的所述对数似然概率值和所述目标冗余性分值获取所述各候选摘要的参考分值;
从所述各候选摘要中选取参考分值大于预设参考分值的摘要作为所述文本信息对应的摘要。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本申请一实施例中文本摘要生成方法的一流程示意图;
图2是本申请一实施例中获取第一冗余性分值的一流程示意图;
图3是本申请一实施例中获取第二冗余性分值的一流程示意图;
图4是本申请一实施例中获取第三冗余性分值的一流程示意图;
图5是本申请一实施例中获取各候选摘要参考分值的一流程示意图;
图6是本申请一实施例中获取各候选摘要参考分值的另一流程示意图;
图7是本申请一实施例中文本摘要生成装置的一架构示意图;
图8是本申请一实施例中计算机设备的一架构示意图。
本发明的实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请实施例提供一种文本摘要生成方法,具体地,如图1所示,可以包括如下步骤:
S10:获取待处理的文本信息。
在一个实施例中,可以对需要进行摘要总结的场景进行相应的文本信息收集,其中的文本信息可以包括但不局限于例如互联网新闻、博客、报告和论文等相关的文本信息。示例性地,文本信息可以是涉及新闻,或者可以是博客相关的文本信息等,或者可以是新闻和博客组合的文本信息等。获取待处理的文本信息,具体可以获取大量的新闻、博客等文本信息。
S20:将所述文本信息转化为词向量。
在一个实施例中,将当前获取的待处理的文本信息转化为词向量。具体可以采用Word2Vec方式将所述待处理的文本信息转化为词向量。该实施例中,通过Word2Vec方式将获取的待处理的文本信息转化为词向量的形式,当然还可以通过其他的方式例如One-Hot将待处理的文本信息转化为词向量的形式,此处并不限定。
另外,还可以根据语料级别选择对应的词向量。示例性地,如果语料量级较大可以自行训练词向量,例如通过直接调用GenSim库中的Word2Vec函数进行训练;如果语料量级不够大则可以使用现有的词向量结果文件,例如获取基于网上公开语料训练好的词向量数据,具体可以根据实际场景的贴合程度选择对应的词向量。
S30:通过集束搜索算法,将所述词向量输入预先训练好的预设神经网络模型,以得到所述文本信息的候选摘要集合以及所述候选摘要集合中各候选摘要的对数似然概率值。
在一个实施例中,基于人工智能领域的神经网络和深度学习等技术,预设神经网络模型为带有注意力机制的Seq2Seq模型,具体地,可以先搭建Seq2Seq模型结构,该Seq2Seq模型可以包括encoder编码器以及decoder解码器两个部分,具体可以由循环神经卷积RNN(Recurrent neural network)结构或者长短期记忆LSTM(Long Short-Term Memory)结构组成,通过embedding嵌入方式将转化的词向量输入预先训练好的预设神经网络模型,其中的训练目标可以为最大似然估计MLE(maximum likelihood estimation)。根据集束搜索算法Beam Search,可以得到基于文本信息的候选摘要集合,也即,可以理解,基于引入了集束搜索算法Beam Search,通过Seq2Seq模型可以解码输出候选摘要集合,则可以获取其中各候选摘要的对数似然概率值logP。
该实施例中,通过引入集束搜索算法Beam Search,该Beam Search算法是一种启发式的搜索算法,通常用在数据集比较大的情况,可以剔除质量比较差的结点,筛选出质量较高的结点,其作用主要在于减少搜索所占用的空间和时间,通过Beam Search算法可以获取基于文本信息的候选摘要集合。
基于对获取候选摘要集合数据量的考量,在一个实施例中,基于上述步骤S30获取的候选摘要集合,可以将获取候选摘要集合对应的概率进行排序,并选取其中概率排名topk的k条摘要作为候选摘要集合,例如选择其中的top10的10条候选摘要集合,具体地,可以根据实际进行设置,此处并不限定。另外,基于将最大似然估计MLE作为Seq2Seq模型训练目标的优化方法,其核心在于实现获取出现概率最大的候选摘要集合,则可以提高获取候选摘要集合的准确度。
S40:获取所述候选摘要集合中各候选摘要的目标冗余性分值,所述目标冗余性分值表示所述候选摘要中的词的冗余程度。
基于获取的候选摘要集合,可以获取候选摘要集合中各候选摘要的目标冗余性分值,目标冗余性分值体可以表示候选摘要中的词的冗余程度。
在一个实施例中,各候选摘要的目标冗余性分值可以通过步骤S41A-S43A获取,具体地,如图2所示,可以包括:
S41A:针对所述各候选摘要的词,分别计算每个词与其他剩余词的相似度,并选取相似度大于预设值的数量,统计得到各候选摘要的相似词的总个数m。
在一个实施例中,假设候选摘要集合中包括n个词,针对各候选摘要的词,可以分别计算各候选摘要中每个词与其他剩余词的相似度。具体地,可以分别计算每n个词与其他n-1个剩余词的余弦相似度,示例性地,例如针对第一个词,可以计算第一个词与剩余其他词的相似度得分,可以得到n-1个相似度得分,并统计其中相似度得分大于预设值的数量m1;针对第二个词,可以计算第二个词与剩余其他词的相似度得分,可以得到n-1个相似度得分,并统计其中相似度得分大于预设值的数量m2......;则针对第n个词,可以计算第n个词与剩余其他词的相似度得分,可以得到n-1个相似度得分,并统计其中相似度得分大于预设值的数量mn;则重复计算每n个词与其他n-1个剩余词的余弦相似度n次,并选取相似度大于预设值的数量,可以得到各候选摘要的相似词的总个数m,其中m=m1+m2+...+mn。其中,针对每一个词,获取相似度得分大于预设得分的数量,在一实施例中,相似度得分的预设值预设为0.9,示例性地,例如针对第一个词,则可统计并获取其中相似度得分大于预设得分0.9的相似度得分的个数为m1,也即可以得到针对第一个词的相似词的个数m1;则针对第二个词对应可得到m2,针对第n个词对应可得到mn。需要说明的是,此处的预设值仅用于举例,并不表示限定,还可以设置其他的预设值。
S42A:针对所述各候选摘要,将所述候选摘要对应的总个数m除以n*(n-1),并进行归一化处理,得到所述各候选摘要的第一冗余性分值,其中,n表示所述候选摘要的词的总数量。
该实施例中,针对各候选摘要,将所述候选摘要对应的总个数m除以n*(n-1),具体地,可以通过公式m/(n*(n-1))进行表示,并将m/(n*(n-1))进行归一化处理,则可以得到各候选摘要的第一冗余性分值score dup1
S43A:将各所述各候选摘要的第一冗余性分值对应作为所述各候选摘要的目标冗余性分值。
在一个实施例中,可以将各候选摘要的第一冗余性分值对应作为各候选摘要的目标冗余性分值。
在一个实施例中,各候选摘要的目标冗余性分值可以还可以通过步骤S41B-S44B获取,具体地,如图3所示,可以包括:
S41B:针对所述各候选摘要,分别获取其中相同字符的长度。
在一个实施例中,针对各候选摘要的每个句子,分别确定并获取各候选摘要中的每个句子其中的相同字符的长度length 相同字符
该实施例中,可以理解,基于获取的候选摘要集合,该候选摘要集合可以包括多条候选摘要,各候选摘要中可以包括多个句子,针对各候选摘要中的每个句子,分别可以统计并获取各候选摘要中的每个句子其中的相同字符的长度length 相同字符
S42B:针对所述各候选摘要中的相同字符,分别获取所述相同字符对应的第一句子的长度和第二句子的长度。
在一个实施例中,针对各候选摘要中的相同字符,分别获取相同字符对应的第一句子的长度length 句子 1和第二句子的长度length 句子 2,示例性地,假设其中一条候选摘要中的句子1和句子2有相同字符,则可以分别获取该相同字符对应句子1的长度length 句子 1和句子2的长度length 句子 2
S43B:根据所述各候选摘要的相同字符的长度、以及所述第一句子的长度和第二句子的长度,对应获取所述各候选摘要的第二冗余性分值。
在一个实施例中,根据相同字符的长度length 相同字符、以及第一句子的长度length 句子 1和第二句子的长度length 句子 2,可以获取各候选摘要的第二冗余性分值score2,将各候选摘要的第二冗余性分值作为各候选摘要的目标冗余性分值。具体地,目标冗余性分值则可以通过以下公式表示:
score dup2=2*length 相同字符/(length 句子 1+length 句子 2)
其中,length 相同字符表示各候选摘要中相同字符的长度,length 句子 1表示相同字符对应第一句子的长度,length 句子 2表示相同字符对应的第二句子的长度。
S44B:将各所述各候选摘要的第二冗余性分值对应作为各所述候选摘要的目标冗余性分值。
该实例中,基于候选摘要中的字符相似原则,通过确定相同字符的长度以及相同字符对应句子的长度,针对各字符相似度的计算可以获取候选摘要集合的第二冗余性分值score dup,并将各候选摘要的第二冗余性分值对应作为各候选摘要的目标冗余性分值。
在一个实施例中,各候选摘要的目标冗余性分值还可以通过步骤S41C-S43C获取,具体地,如图4所示,可以包括:
S41C:采用Bert模型对所述各候选摘要进行编码,得到各候选摘要的句子向量。
在一个实施例中,采用Bert(Bidirectional Encoder Representations from Transformers)模型对候选摘要集合中的各候选摘要进行编码,可以得到各候选摘要的句子向量。
S42C:根据所述各候选摘要的句子向量,获取所述各候选摘要中任意两个句子向量的相似度,得到所述各候选摘要的第三冗余性分值。
在一个实施例中,可以理解,基于获取到各候选摘要的句子向量,可以计算任意两个句子向量的余弦相似度,将任意两个句子向量的余弦相似度的计算结果作为第三冗余性分值,则可以得到各候选摘要的第三冗余性分值。
S43C:将各所述各候选摘要的第三冗余性分值对应作为各所述候选摘要的目标冗余性分值。
在一个实施例中,根据各候选摘要的句子向量,获取各候选摘要中任意两个句子向量的相似度,可以得到各候选摘要的第三冗余性分值score dup3,并将各候选摘要的第三冗余性分值作为各候选摘要的目标冗余性分值,则可根据第三冗余性分值score dup3得到第目标冗余性分值。该实施例中,基于获取的各候选摘要,在采用Bert模型对各候选摘要进行编码,可以得到各候选摘要的句子向量,并根据句子向量获取各候选摘要中任意两个句子向量的相似度,具体可通过余弦相似度的公式计算,则可以得出各候选摘要的第三冗余性分值score dup3。其中,获取两个句子向量的相似度分值越高,则说明两个句子相似度越高,为避免累赘,此处不展开描述。
S50:根据各候选摘要的所述目标冗余性分值和所述对数似然概率值获取所述各候选摘要的参考分值。
在一实施例中,步骤S50,也即根据各候选摘要的目标冗余性分值和对数似然概率值获取各候选摘要的参考分值,具体地,如图5所示,可以包括:
S51A:针对各候选摘要,分别获取各候选摘要对应的权重系数。
在一个实施例中,针对各候选摘要,分别获取各候选摘要对应的权重系数,该权重系数可以设置0-1之间的数值,可以根据实际情况进行配置。示例性,例如基于获取摘要集合中的top3条摘要,可以根据排名不同设置不同的权重系数,示例性地,例如将排名top1摘要的权重系数设为0.5,排名top2摘要的权重系数设为0.3,排名top3摘要的权重系数设为0.2,需要说明的是,此处仅用于举例,并不限定。
S52A:确定所述各候选摘要对应的权重系数与所述目标冗余性分值的乘积。
在一个实施例中,确定各候选摘要对应的权重系数与目标冗余性分值的乘积,具体地,该乘积可以通过公式α*score dup表示。
S53A:将各候选摘要的对数似然概率值与所述乘积的差值作为各候选摘要的参考分值,以得到各候选摘要的参考分值,该参考分值存储于区块链中。
将各候选摘要的对数似然概率值与乘积的差值作为各候选摘要的参考分值,则可以得到各候选摘要的参考分值。在一个实施例中,具体地,参考分值可以通过公式:logP-α*score dup表示,其中,α表示权重系数,该权重系数可以设置为0-1之间的值,具体可以根据实际情况确定相应的权重系数,此处并不限定;score dup表示目标冗余性分值。可见,本实施例中获取各候选摘要的参考分值可以通过公式:进行计算和获取,也即可根据获取的目标冗余性分值和对数似然概率值,并通过公式logP-α*score dup可以得到参考分值,则可以根据该参考分值的进行参考,实现获取目标候选摘要。上述的实施例中,可以理解,基于上述步骤S41A-S43A,步骤S41B-S44B以及步骤S41C-S43C中,分别可以获取第一冗余性分值,第二冗余性分值和第三冗余性分值,并将第一冗余性分值,第二冗余性分值和第三冗余性分值分别作为目标冗余性分值,如此,可以得到不同的目标冗余性分值,也即第一冗余性分值score dup1,第二冗余性分值score dup2和第三冗余性分值score dup3,则基于参考分值公式logP-α*score dup,可以获取不同的参考分值logP-α*score dup1、logP-α*score dup2和logP-α*score dup3。此外,本申请还涉及了区块链技术,在一个实施例中,具体地,基于上述参考分值公式logP-α*score dup获取不同的参考分值,该参考分值还可以存储于区块链中。其中,区块链技术是一种可以不依赖第三方,通过自身分布式节点进行网络数据的存储、验证、传递和交流的技术,具有“不可伪造”“全程留痕”“可以追溯”“公开透明”“集体维护”等特点,该实施例中,可以理解,通过将参考分值存储于区块链中,可以实现提高该参考分值的私密性以及安全性。
在一个实施例中,根据不同的目标冗余性分值获取不同的参考分值。具体地,还可以对不同的目标冗余性分值设置不同的权重系数进行优化,以使得到的参考分值更加接近合理和真实。需要强调的是,为进一步保证上述参考分值的私密和安全性,上述参考分值还可以存储于一区块链的节点中。在一个实施例中,步骤S50,也即根据各候选摘要的冗余性数值和对数似然概率值获取各候选摘要的参考分值,其中各候选摘要的参考分值还可以通过步骤S51B-S53B进行获取,具体地,如图6所示,包括:
S51B:针对所述各候选摘要中的第一冗余性分值、所述第二冗余性分值和第三冗余性分值,分别获取各候选摘要中对应的第一权重系数、第二权重系数和第三权重系数。
在一个实施例中,针对各候选摘要中第一冗余性分值score dup1、第二冗余性分值score dup2和第三冗余性分值score dup3,分别获取各候选摘要中对应的第一权重系数α 1、第二权重系数α 2和第三权重系数α 3。具体地,分别确定和获取各候选摘要中第一冗余性分值对应的第一权重系数、各候选摘要中第二冗余性分值对应的第二权重系数和各候选摘要中第三冗余性分值对应的第三权重系数系数。
S52B:并分别确定各候选摘要的第一冗余性分值与第一权重系数的第一乘积、所述第二权重系数与第二冗余性分值的第二乘积,和所述第三权重系数与第三冗余性分值的第三乘积。
在一个实施例中,分别确定各候选摘要的第一冗余性分值与第一权重系数的第一乘积、第二权重系数与第二冗余性分值的第二乘积,和第三权重系数与第三冗余性分值的第三乘积。具体地,第一乘积可以通过公式α 1*score dup1表示,第二乘积可以通过公式α 2*score dup2表示,第三乘积可以通过公式α 3*score dup3表示。
S53B:将各候选摘要的对数似然概率值与所述第一乘积、第二乘积和第三乘积的差值作为各候选摘要的参考分值,以得到各候选摘要的参考分值。
在一个实施例中,将各候选摘要的对数似然概率值与所述第一乘积、第二乘积和第三乘积的差值作为各候选摘要的参考分值,则可以得到各候选摘要的参考分值。具体可以通过以下公式:
参考分值=logP-α 1*score dup12*score dup23*score dup3
其中,α 1表示第一权重系数、α 2表示第二权重系数、α 3表示第三权重系数,score dup1表示第一冗余性分值、score dup2表示第二冗余性分值、score dup3表示第三冗余性分值。
该实施例中,可以理解,针对不同的目标冗余性分值设置不同的权重系数,可以使获取的参考分值更加合理和准确。
S60:从所述各候选摘要中选取参考分值大于预设参考分值的摘要作为所述文本信息对应的摘要。
基于获取各候选摘要的参考分值score,则可以根据参考分值选取大于预设分值的摘要作为文本信息对应的摘要。在一个应用场景中,例如在获取各候选摘要的多个参考分值后,可以将该多个参考分值进行排序,并预设一个参考分值,例如参考分值预设为0.9,则可以选择将获取的参考分值大于0.9对应的摘要作为文本信息对应的摘要,当然还可以预设其他的数字例如0.95、0.85等,此处仅用于举例,并不限定。
上述的实施例中,通过获取基于文本信息的候选摘要集合和候选摘要集合中各候选摘要的对数似然概率值,并获取候选摘要集合中各候选摘要的目标冗余性分值,则可以根据各候选摘要的目标冗余性分值和对数似然概率值获取各候选摘要的参考分值,并从各候选摘要中选取参考分值大于预设参考分值的摘要作为该文本信息对应的摘要。可以理解,上述实施例的一种文本摘要的生成方法,相比于传统的Seq2Seq模型结构,增加了多种获取冗余性分值的运算方式,可以实现对自动生成摘要的结果进行优化,并针对不同方式获取的冗余性分值配置不同的权重系数,则可使得获取的候选摘要集合中出现重复性较多的词得分降低,即重复性较多的词被选择的概率降低,从而使得获取的目标候选摘要中出现较多重复词的可能性降低,从而提高自动生成摘要的可信度和可读性。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
在一个实施例中,提供一种文本摘要生成装置,实现功能与上述实施例中文本摘要生成方法对应的步骤一一对应。具体地,如图7所示,该文本摘要生成装置包括第一获取模块10、转化模块20、训练模块30、第二获取模块40、第三获取模块50和摘要选取模块60。各功能模块详细说明如下:
第一获取模块10,用于获取待处理的文本信息;
转化模块20,用于将所述文本信息转化为词向量;
训练模块30,用于通过集束搜索算法,将所述词向量输入预先训练好的预设神经网络模型,以得到所述文本信息的候选摘要集合以及所述候选摘要集合中各候选摘要的对数似然概率值;
第二获取模块40,用于获取所述候选摘要集合中各候选摘要的目标冗余性分值,所述目标冗余性分值表示所述候选摘要中的词的冗余程度;
第三获取模块50,用于根据各候选摘要的所述目标冗余性分值和对数似然概率值获取所述各候选摘要的参考分值;
摘要选取模块60,用于从所述各候选摘要中选取参考分值大于预设参考分值的摘要作为所述文本信息对应的摘要。
优选地,第二获取模块40还用于:
针对所述各候选摘要,分别计算每个词与剩余词的相似度,并选取相似度大于预设值的数量,得到各候选摘要的相似词的总个数m;
针对所述各候选摘要,将所述候选摘要对应的总个数m除以n*(n-1),并进行归一化处理,得到所述各候选摘要的第一冗余性分值,其中,n表示所述候选摘要的词的总数量;
将各所述候选摘要的第一冗余性分值对应作为各所述候选摘要的目标冗余性分值。
优选地,第二获取模块40还用于:
针对所述各候选摘要,分别获取其中相同字符的长度;
针对所述各候选摘要中的相同字符,分别获取所述相同字符对应的第一句子的长度和第二句子的长度;
根据所述各候选摘要的相同字符的长度、以及所述第一句子的长度和第二句子的长度,对应获取所述各候选摘要的第二冗余性分值;
将各所述候选摘要的第二冗余性分值对应作为各所述候选摘要的目标冗余性分值。
优选地,第二获取模块40还用于:
采用Bert模型对所述各候选摘要进行编码,得到各候选摘要的句子向量;
根据所述各候选摘要的句子向量,获取所述各候选摘要中任意两个句子向量的相似度,得到所述各候选摘要的第三冗余性分值;
将各所述候选摘要的第三冗余性分值对应作为各所述候选摘要的目标冗余性分值。
优选地,所示第三获取模块50还用于:
针对各候选摘要,分别获取其中各候选摘要对应的权重系数;
确定所述各候选摘要对应的权重系数与所述目标冗余性分值的乘积;
将各候选摘要的对数似然概率值与所述乘积的差值作为各候选摘要的参考分值,以得到各候选摘要的参考分值。
关于文本摘要生成装置的具体限定可以参见上文中对于文本摘要生成方法的限定,在此不再赘述。上述文本摘要生成装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一个实施例中,提供了一种计算机可读存储介质,所述计算机可读存储介质可以是易失性,也可以是非易失性,该计算机可读存储介质上存储有计算机程序,该计算机程序被处理器执行时实现实施例文本摘要生成方法,例如图1所示的步骤S10-S60或者图2所示的步骤S41A-S43A,图3所示步骤S41B-S44B、图4所示步骤S41C-S43C以及图5所示的步骤S51A-S53A或者图6所示的步骤S51B-S53B等。为避免重复,这里不再赘述。或者,该计算机程序被处理器执行时实现实施例2中文本摘要生成装置中各模块的功能,例如图7所示的第一获取模块10、转化模块20、训练模块30、第二获取模块40、第三获取模块50和摘要选取模块60等模块的功能,为避免重复,这里不再赘述。可以理解地,所述计算机可读存储介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号和电信信号等。
在一个实施例中,如图8所示,提供一种计算机设备。具体地,该实施例的计算机设备60包括:处理器61、存储器62以及存储在存储器62中并可在处理器61上运行的计算机程序63。处理器61执行计算机程序63时实现上述实施例文本摘要生成方法,例如图1所示的步骤S10-S60或者图2所示的步骤S41A-S43A,图3所示步骤S41B-S44B、图4所示步骤S41C-S43C以及图5所示的步骤S51A-S53A或者图6所示的步骤S51B-S53B等等。或者,处理器61执行计算机程序63时实现上述实施例文本摘要生成装置中各模块的功能,例如图7所示的第一获取模块10、转化模块20、训练模块30、第二获取模块40、第三获取模块50和摘要选取模块60等模块的功能。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性或易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(SynchlinK) DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块、子模块和单元完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。

Claims (20)

  1. 一种文本摘要生成方法,其中,所述方法包括:
    获取待处理的文本信息,将所述文本信息转化为词向量;
    通过集束搜索算法,将所述词向量输入预先训练好的预设神经网络模型,以得到所述文本信息的候选摘要集合以及所述候选摘要集合中各候选摘要的对数似然概率值;
    获取所述候选摘要集合中各候选摘要的目标冗余性分值,所述目标冗余性分值表示所述候选摘要中的词的冗余程度;
    根据各候选摘要的所述对数似然概率值和所述目标冗余性分值获取所述各候选摘要的参考分值;
    从所述各候选摘要中选取参考分值大于预设参考分值的摘要作为所述文本信息对应的摘要。
  2. 如权利要求1所述的一种文本摘要生成方法,其中,所述获取所述候选摘要集合中各候选摘要的目标冗余性分值,包括:
    针对所述各候选摘要的词,分别计算每个词与其他剩余词的相似度,并选取相似度大于预设值的数量,统计得到各候选摘要的相似词的总个数m;
    将所述候选摘要对应的总个数m除以n*(n-1),并进行归一化处理,以得到所述各候选摘要的第一冗余性分值,其中,n表示所述候选摘要的词的总数量;
    将各所述候选摘要的第一冗余性分值对应作为各所述候选摘要的目标冗余性分值。
  3. 如权利要求1所述的一种文本摘要生成方法,其中,所述获取所述候选摘要集合中各候选摘要的目标冗余性分值,包括:
    针对所述各候选摘要,分别获取其中相同字符的长度;
    针对所述各候选摘要中的相同字符,分别获取所述相同字符对应的第一句子的长度和第二句子的长度;
    根据所述各候选摘要的相同字符的长度、以及所述第一句子的长度和第二句子的长度,对应获取所述各候选摘要的第二冗余性分值;
    将各所述候选摘要的第二冗余性分值对应作为各所述候选摘要的目标冗余性分值。
  4. 如权利要求1所述的一种文本摘要生成方法,其中,所述获取所述候选摘要集合中各候选摘要的目标冗余性分值,包括:
    采用Bert模型对所述各候选摘要进行编码,得到各候选摘要的句子向量;
    根据所述各候选摘要的句子向量,获取所述各候选摘要中任意两个句子向量的相似度,得到所述各候选摘要的第三冗余性分值;
    将各所述候选摘要的第三冗余性分值对应作为各所述候选摘要的目标冗余性分值。
  5. 如权利要求1-4任一项所述的一种文本摘要生成方法,其中,所述根据各候选摘要的所述对数似然概率值和所述目标冗余性分值获取所述各候选摘要的参考分值,包括:
    针对各候选摘要,分别获取其中各候选摘要对应的权重系数;
    确定所述各候选摘要对应的权重系数与所述目标冗余性分值的乘积;
    将各候选摘要的对数似然概率值与所述乘积的差值作为各候选摘要的参考分值,以得到各候选摘要的参考分值,所述参考分值存储于区块链中。
  6. 如权利要求1-4任一项所述的一种文本摘要生成方法,其中,所述根据各候选摘要的所述对数似然概率值和所述目标冗余性分值获取所述各候选摘要的参考分值,包括:
    针对所述各候选摘要中的第一冗余性分值、所述第二冗余性分值和第三冗余性分值,分别获取各候选摘要中对应的第一权重系数、第二权重系数和第三权重系数;
    并分别确定各候选摘要的第一冗余性分值与第一权重系数的第一乘积、所述第二权重系数与第二冗余性分值的第二乘积,和所述第三权重系数与第三冗余性分值的第三乘积;
    将各候选摘要的对数似然概率值与所述第一乘积、第二乘积和第三乘积的差值作为各候选摘要的参考分值,以得到各候选摘要的参考分值。
  7. 一种文本摘要生成装置,其中,所述装置包括:
    第一获取模块,用于获取待处理的文本信息;
    转化模块,用于将所述文本信息转化为词向量;
    训练模块,用于通过集束搜索算法,将所述词向量输入预先训练好的预设神经网络模型,以得到所述文本信息的候选摘要集合以及所述候选摘要集合中各候选摘要的对数似然概率值;
    第二获取模块,用于获取所述候选摘要集合中各候选摘要的目标冗余性分值,所述目标冗余性分值表示所述候选摘要中的词的冗余程度;
    第三获取模块,用于根据各候选摘要的所述目标冗余性分值和对数似然概率值获取所述各候选摘要的参考分值;
    摘要选取模块,用于从所述各候选摘要中选取参考分值大于预设参考分值的摘要作为所述文本信息对应的摘要。
  8. 如权利要求7所述的一种文本摘要生成装置,其中,所述第二获取模块还用于:
    针对所述各候选摘要的词,分别计算每个词与其他剩余词的相似度,并选取相似度大于预设值的数量,统计得到各候选摘要的相似词的总个数m;
    针对所述各候选摘要,将所述候选摘要对应的总个数m除以n*(n-1),并进行归一化处理,得到所述各候选摘要的第一冗余性分值,其中,n表示所述候选摘要的词的总数量;
    将各所述候选摘要的第一冗余性分值对应作为各所述候选摘要的目标冗余性分值。
  9. 如权利要求7所述的一种文本摘要生成装置,其中,所述第二获取模块还用于:
    针对所述各候选摘要,分别获取其中相同字符的长度;
    针对所述各候选摘要中的相同字符,分别获取所述相同字符对应的第一句子的长度和第二句子的长度;
    根据所述各候选摘要的相同字符的长度、以及所述第一句子的长度和第二句子的长度,对应获取所述各候选摘要的第二冗余性分值;
    将各所述各候选摘要的第二冗余性分值对应作为各所述候选摘要的目标冗余性分值。
  10. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现如下步骤:
    获取待处理的文本信息,将所述文本信息转化为词向量;
    通过集束搜索算法,将所述词向量输入预先训练好的预设神经网络模型,以得到所述文本信息的候选摘要集合以及所述候选摘要集合中各候选摘要的对数似然概率值;
    获取所述候选摘要集合中各候选摘要的目标冗余性分值,所述目标冗余性分值表示所述候选摘要中的词的冗余程度;
    根据各候选摘要的所述对数似然概率值和所述目标冗余性分值获取所述各候选摘要的参考分值;
    从所述各候选摘要中选取参考分值大于预设参考分值的摘要作为所述文本信息对应的摘要。
  11. 如权利要求10所述的一种计算机设备,其中,所述获取所述候选摘要集合中各候选摘要的目标冗余性分值,包括:
    针对所述各候选摘要的词,分别计算每个词与其他剩余词的相似度,并选取相似度大于预设值的数量,统计得到各候选摘要的相似词的总个数m;
    将所述候选摘要对应的总个数m除以n*(n-1),并进行归一化处理,以得到所述各候选摘要的第一冗余性分值,其中,n表示所述候选摘要的词的总数量;
    将各所述候选摘要的第一冗余性分值对应作为各所述候选摘要的目标冗余性分值。
  12. 如权利要求10所述的一种计算机设备,其中,所述获取所述候选摘要集合中各候选摘要的目标冗余性分值,包括:
    针对所述各候选摘要,分别获取其中相同字符的长度;
    针对所述各候选摘要中的相同字符,分别获取所述相同字符对应的第一句子的长度和第二句子的长度;
    根据所述各候选摘要的相同字符的长度、以及所述第一句子的长度和第二句子的长度,对应获取所述各候选摘要的第二冗余性分值;
    将各所述候选摘要的第二冗余性分值对应作为各所述候选摘要的目标冗余性分值。
  13. 如权利要求10所述的一种计算机设备,其中,所述获取所述候选摘要集合中各候选摘要的目标冗余性分值,包括:
    采用Bert模型对所述各候选摘要进行编码,得到各候选摘要的句子向量;
    根据所述各候选摘要的句子向量,获取所述各候选摘要中任意两个句子向量的相似度,得到所述各候选摘要的第三冗余性分值;
    将各所述候选摘要的第三冗余性分值对应作为各所述候选摘要的目标冗余性分值。
  14. 如权利要求10-13任一项所述的一种计算机设备,其中,所述根据各候选摘要的所述对数似然概率值和所述目标冗余性分值获取所述各候选摘要的参考分值,包括:
    针对各候选摘要,分别获取其中各候选摘要对应的权重系数;
    确定所述各候选摘要对应的权重系数与所述目标冗余性分值的乘积;
    将各候选摘要的对数似然概率值与所述乘积的差值作为各候选摘要的参考分值,以得到各候选摘要的参考分值,所述参考分值存储于区块链中。
  15. 如权利要求10-13任一项所述的一种计算机设备,其中,所述根据各候选摘要的所述对数似然概率值和所述目标冗余性分值获取所述各候选摘要的参考分值,包括:
    针对所述各候选摘要中的第一冗余性分值、所述第二冗余性分值和第三冗余性分值,分别获取各候选摘要中对应的第一权重系数、第二权重系数和第三权重系数;
    并分别确定各候选摘要的第一冗余性分值与第一权重系数的第一乘积、所述第二权重系数与第二冗余性分值的第二乘积,和所述第三权重系数与第三冗余性分值的第三乘积;
    将各候选摘要的对数似然概率值与所述第一乘积、第二乘积和第三乘积的差值作为各候选摘要的参考分值,以得到各候选摘要的参考分值。
  16. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其中,所述计算机程序被处理器执行时实现如下步骤:
    获取待处理的文本信息,将所述文本信息转化为词向量;
    通过集束搜索算法,将所述词向量输入预先训练好的预设神经网络模型,以得到所述文本信息的候选摘要集合以及所述候选摘要集合中各候选摘要的对数似然概率值;
    获取所述候选摘要集合中各候选摘要的目标冗余性分值,所述目标冗余性分值表示所述候选摘要中的词的冗余程度;
    根据各候选摘要的所述对数似然概率值和所述目标冗余性分值获取所述各候选摘要的参考分值;
    从所述各候选摘要中选取参考分值大于预设参考分值的摘要作为所述文本信息对应的摘要。
  17. 如权利要求16所述的一种计算机可读存储介质,其中,所述获取所述候选摘要集合中各候选摘要的目标冗余性分值,包括:
    针对所述各候选摘要的词,分别计算每个词与其他剩余词的相似度,并选取相似度大于预设值的数量,统计得到各候选摘要的相似词的总个数m;
    将所述候选摘要对应的总个数m除以n*(n-1),并进行归一化处理,以得到所述各候选摘要的第一冗余性分值,其中,n表示所述候选摘要的词的总数量;
    将各所述候选摘要的第一冗余性分值对应作为各所述候选摘要的目标冗余性分值。
  18. 如权利要求16所述的一种计算机可读存储介质,其中,所述获取所述候选摘要集合中各候选摘要的目标冗余性分值,包括:
    针对所述各候选摘要,分别获取其中相同字符的长度;
    针对所述各候选摘要中的相同字符,分别获取所述相同字符对应的第一句子的长度和第二句子的长度;
    根据所述各候选摘要的相同字符的长度、以及所述第一句子的长度和第二句子的长度,对应获取所述各候选摘要的第二冗余性分值;
    将各所述候选摘要的第二冗余性分值对应作为各所述候选摘要的目标冗余性分值。
  19. 如权利要求16所述的一种计算机可读存储介质,其中,所述获取所述候选摘要集合中各候选摘要的目标冗余性分值,包括:
    采用Bert模型对所述各候选摘要进行编码,得到各候选摘要的句子向量;
    根据所述各候选摘要的句子向量,获取所述各候选摘要中任意两个句子向量的相似度,得到所述各候选摘要的第三冗余性分值;
    将各所述候选摘要的第三冗余性分值对应作为各所述候选摘要的目标冗余性分值。
  20. 如权利要求16-19任一项所述的一种计算机可读存储介质,其中,所述根据各候选摘要的所述对数似然概率值和所述目标冗余性分值获取所述各候选摘要的参考分值,包括:
    针对各候选摘要,分别获取其中各候选摘要对应的权重系数;
    确定所述各候选摘要对应的权重系数与所述目标冗余性分值的乘积;
    将各候选摘要的对数似然概率值与所述乘积的差值作为各候选摘要的参考分值,以得到各候选摘要的参考分值,所述参考分值存储于区块链中。
PCT/CN2020/112349 2020-04-30 2020-08-31 文本摘要生成方法、装置、计算机设备及可读存储介质 WO2021217987A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010367822.9A CN111666402B (zh) 2020-04-30 2020-04-30 文本摘要生成方法、装置、计算机设备及可读存储介质
CN202010367822.9 2020-04-30

Publications (1)

Publication Number Publication Date
WO2021217987A1 true WO2021217987A1 (zh) 2021-11-04

Family

ID=72383200

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/112349 WO2021217987A1 (zh) 2020-04-30 2020-08-31 文本摘要生成方法、装置、计算机设备及可读存储介质

Country Status (2)

Country Link
CN (1) CN111666402B (zh)
WO (1) WO2021217987A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115600586A (zh) * 2022-12-15 2023-01-13 阿里巴巴(中国)有限公司(Cn) 摘要文本生成方法、计算设备及存储介质
CN115965033A (zh) * 2023-03-16 2023-04-14 安徽大学 基于序列级前缀提示的生成式文本摘要方法和装置
CN116595164A (zh) * 2023-07-17 2023-08-15 浪潮通用软件有限公司 一种生成单据摘要信息的方法、***、设备和存储介质
CN117610548A (zh) * 2024-01-22 2024-02-27 中国科学技术大学 一种基于多模态的自动化论文图表标题生成方法

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112328783A (zh) * 2020-11-24 2021-02-05 腾讯科技(深圳)有限公司 一种摘要确定方法和相关装置
CN112861543A (zh) * 2021-02-04 2021-05-28 吴俊� 一种面向研发供需描述文本撮合的深层语义匹配方法和***
CN113660541B (zh) * 2021-07-16 2023-10-13 北京百度网讯科技有限公司 新闻视频的摘要生成方法及装置
CN114398478A (zh) * 2022-01-17 2022-04-26 重庆邮电大学 一种基于bert和外部知识的生成式自动文摘方法
CN114996441B (zh) * 2022-04-27 2024-01-12 京东科技信息技术有限公司 文档处理方法、装置、电子设备和存储介质
CN115374884B (zh) * 2022-10-26 2023-01-31 北京智源人工智能研究院 基于对比学习的摘要生成模型的训练方法和摘要生成方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104915335A (zh) * 2015-06-12 2015-09-16 百度在线网络技术(北京)有限公司 为主题文档集生成摘要的方法和装置
CN106407178A (zh) * 2016-08-25 2017-02-15 中国科学院计算技术研究所 一种会话摘要生成方法及装置
CN107688652A (zh) * 2017-08-31 2018-02-13 苏州大学 面向互联网新闻事件的演化式摘要生成方法
CN109344391A (zh) * 2018-08-23 2019-02-15 昆明理工大学 基于神经网络的多特征融合中文新闻文本摘要生成方法
US20190362020A1 (en) * 2018-05-22 2019-11-28 Salesforce.Com, Inc. Abstraction of text summarizaton

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100287162A1 (en) * 2008-03-28 2010-11-11 Sanika Shirwadkar method and system for text summarization and summary based query answering
CN104216875B (zh) * 2014-09-26 2017-05-03 中国科学院自动化研究所 基于非监督关键二元词串提取的微博文本自动摘要方法
CN108182247A (zh) * 2017-12-28 2018-06-19 东软集团股份有限公司 文摘生成方法和装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104915335A (zh) * 2015-06-12 2015-09-16 百度在线网络技术(北京)有限公司 为主题文档集生成摘要的方法和装置
CN106407178A (zh) * 2016-08-25 2017-02-15 中国科学院计算技术研究所 一种会话摘要生成方法及装置
CN107688652A (zh) * 2017-08-31 2018-02-13 苏州大学 面向互联网新闻事件的演化式摘要生成方法
US20190362020A1 (en) * 2018-05-22 2019-11-28 Salesforce.Com, Inc. Abstraction of text summarizaton
CN109344391A (zh) * 2018-08-23 2019-02-15 昆明理工大学 基于神经网络的多特征融合中文新闻文本摘要生成方法

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115600586A (zh) * 2022-12-15 2023-01-13 阿里巴巴(中国)有限公司(Cn) 摘要文本生成方法、计算设备及存储介质
CN115600586B (zh) * 2022-12-15 2023-04-11 阿里巴巴(中国)有限公司 摘要文本生成方法、计算设备及存储介质
CN115965033A (zh) * 2023-03-16 2023-04-14 安徽大学 基于序列级前缀提示的生成式文本摘要方法和装置
CN115965033B (zh) * 2023-03-16 2023-07-11 安徽大学 基于序列级前缀提示的生成式文本摘要方法和装置
CN116595164A (zh) * 2023-07-17 2023-08-15 浪潮通用软件有限公司 一种生成单据摘要信息的方法、***、设备和存储介质
CN116595164B (zh) * 2023-07-17 2023-10-31 浪潮通用软件有限公司 一种生成单据摘要信息的方法、***、设备和存储介质
CN117610548A (zh) * 2024-01-22 2024-02-27 中国科学技术大学 一种基于多模态的自动化论文图表标题生成方法
CN117610548B (zh) * 2024-01-22 2024-05-03 中国科学技术大学 一种基于多模态的自动化论文图表标题生成方法

Also Published As

Publication number Publication date
CN111666402B (zh) 2024-05-28
CN111666402A (zh) 2020-09-15

Similar Documents

Publication Publication Date Title
WO2021217987A1 (zh) 文本摘要生成方法、装置、计算机设备及可读存储介质
US20230205610A1 (en) Systems and methods for removing identifiable information
US10515155B2 (en) Conversational agent
US11816442B2 (en) Multi-turn dialogue response generation with autoregressive transformer models
CN110929515B (zh) 基于协同注意力和自适应调整的阅读理解方法及***
US11301637B2 (en) Methods, devices, and systems for constructing intelligent knowledge base
CN110347799B (zh) 语言模型训练方法、装置和计算机设备
US20200293714A1 (en) Method, system and computer program product for generating artificial documents
US9940355B2 (en) Providing answers to questions having both rankable and probabilistic components
CN109902290B (zh) 一种基于文本信息的术语提取方法、***和设备
US20230098398A1 (en) Molecular structure reconstruction method and apparatus, device, storage medium, and program product
Yan et al. A knowledge-driven generative model for multi-implication chinese medical procedure entity normalization
CN109902273B (zh) 关键词生成模型的建模方法和装置
Jiang et al. A GDPR-compliant ecosystem for speech recognition with transfer, federated, and evolutionary learning
Xu et al. Multi-level matching networks for text matching
JP2020166735A (ja) 生成方法、学習方法、生成プログラム、及び生成装置
EP3525107A1 (en) Conversational agent
Ding et al. Joint linguistic steganography with bert masked language model and graph attention network
CN112348041B (zh) 日志分类、日志分类训练方法及装置、设备、存储介质
JP7211011B2 (ja) 学習方法、学習プログラム及び生成方法
CN112463161B (zh) 基于联邦学习的代码注释生成方法、***及装置
CN112329933B (zh) 数据处理方法、装置、服务器及存储介质
US11550777B2 (en) Determining metadata of a dataset
Flocon-Cholet et al. An investigation of temporal feature integration for a low-latency classification with application to speech/music/mix classification
CN113220841B (zh) 确定鉴别信息的方法、装置、电子设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20933553

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20933553

Country of ref document: EP

Kind code of ref document: A1