CN114547287B - Generation type text abstract method - Google Patents

Generation type text abstract method Download PDF

Info

Publication number
CN114547287B
CN114547287B CN202111373234.7A CN202111373234A CN114547287B CN 114547287 B CN114547287 B CN 114547287B CN 202111373234 A CN202111373234 A CN 202111373234A CN 114547287 B CN114547287 B CN 114547287B
Authority
CN
China
Prior art keywords
vector
word
text
sentence
news
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111373234.7A
Other languages
Chinese (zh)
Other versions
CN114547287A (en
Inventor
田玲
康昭
惠孛
孙麟
罗光春
袁铭潮
陈仙莹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202111373234.7A priority Critical patent/CN114547287B/en
Publication of CN114547287A publication Critical patent/CN114547287A/en
Application granted granted Critical
Publication of CN114547287B publication Critical patent/CN114547287B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)

Abstract

A method for generating text abstract belongs to the technical field of natural language processing. The invention is improved on the basis of the CBOW model of Word2Vec, and syllable marking information is blended to enhance the feature representation capability of the text; the method adopts an Encoder-Decoder framework based on LSTM to realize news abstract generation, and focuses on solving the problem of unknown words in the generation process, thereby effectively improving the effect of news abstract generation.

Description

Generation type text abstract method
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a generating text summarization method.
Background
Along with the improvement of computer hardware equipment technology in the modern times, the computer performance is also rapidly improved, and the internet industry is developed vigorously. The popularity of personal computers and the rapidly growing internet industry have led to the emergence of various text messages on a variety of carriers in people's daily lives. Because of the huge amount of information in this era, people face an unavoidable and challenging information overload problem, and meanwhile, because of the huge amount of information on the network, difficulty is brought to information retrieval. Therefore, how to solve the problem of data disaster caused by information overload and effectively solve the problem that people are difficult to acquire information from texts are one of hot contents concerned in the current global field. Text summarization aims at converting a text or a collection of texts into a short summary containing key information, and the technology has emerged to address the problem of information overload.
Early automated text summarization technical studies adopted both rule-based methods and methods based on traditional machine learning, but they were unsatisfactory for generating summaries because it was difficult to learn articles as human understands them. With the development of deep learning related research, the recurrent neural network model has flexible computational steps whose output depends on previous computations, which enables it to capture contextual dependencies in languages and to model various text lengths. However, the traditional framework based on the recurrent neural network has a potential problem that in the actual model prediction process, since the Vocabulary predicted by the model is open during prediction, if words which are not in the Vocabulary Of the generated words exist in the predicted text, the model cannot process and generate the words, which is an Out-Of-Vocabulary (OOV) problem. Some uncommon words in the original text may contain important information in the summary generation process, but the word list cannot be added in the training process due to the low frequency of the word list, and the model retraining cost is very high after new words are added due to the fact that the existing model is larger and larger, so that the OOV problem cannot be well solved by the traditional method.
Disclosure of Invention
The invention aims to provide a method for generating a text abstract, aiming at the defects in the background technology.
In order to realize the purpose, the technical scheme adopted by the invention is as follows:
a method of generating a text summary, comprising the steps of:
step 1, data crawling:
the method comprises the steps that a data source website crawls original corpus of a dimension language news text, and the dimension language news text is obtained after analysis;
step 2, data preprocessing:
s21, data cleaning: carrying out data cleaning on the dimensional language news text obtained in the step 1 to obtain the cleaned dimensional language news text;
s22, data format processing: carrying out data format processing on the cleaned Uygur news text to obtain a processed Uygur news text;
s23, word segmentation: performing word segmentation on the processed dimension language news text by adopting a grammar analysis word segmentation algorithm to obtain a dimension language news text after word segmentation;
s24, syllable labeling: performing syllable labeling on the segmented dimensional language news text by adopting a dimensional language voice harmony rule processing algorithm, adopting 1 to represent vowels and 0 to represent consonants, and constructing a dimensional voice syllable vector with the same dimension as the segmented dimensional language news text to obtain syllable data of the dimensional language news text;
step 3, text feature representation:
s31, initialization: firstly, traversing the segmented wiki news text obtained in the step S23 to obtain the number V of words and the word frequency of each word in the segmented wiki news text, arranging the V words according to the sequence of the word frequencies from large to small, and constructing a vocabularyTable Vocab: { w 1 ,w 2 ,…,w i ,…,w V },w i Represents the ith word in the vocabulary; generating One-Hot code of V dimension according to the position of the word in the vocabulary table, and for the ith word w i And the generated One-Hot code is marked as One _ Hot i
S32, generating word vectors and performing iterative training: generating a word vector by adopting the One-Hot code generated in the step S31; for the word w i The generation process specifically comprises the following steps:
a. defining the length of a word vector as N and the window size as c;
b. random initialization weight matrix W V×N Calculating to obtain a hidden vector h of the middle layer i
Figure RE-GDA0003585028030000021
c. Randomly initializing weight matrix W' N×V Calculating the word w i Probability distribution y of (a):
y=softmax(h i ·W′ N×V )
d. iterative training: adopting a gradient descent method, and continuously iterating and training when one _ hot i When y is lower than a preset threshold value, stopping iteration to obtain a hidden vector h of the middle layer after training i ', hidden vector h of middle layer after training i ' is the word w i Trained word vector h i ′;
S33, syllable information fusion: comparing the dimensional phonetic byte vector obtained in step S24 with the word vector h obtained in step S32 i ' splicing ' to obtain a word vector h ' fused with syllable information i
S34, word vector adjustment based on the neural network: randomly extracting a contained word w from the segmented wiki news text i Assuming that the sentence W is composed of m words, the word W i The j-th position in the sentence W is marked as W j ,W={w 1 ,w 2 ,···w m The sentence W is corresponding to a sentence vector with syllable information blended therein, the sentence vector is
Figure RE-GDA0003585028030000031
Figure RE-GDA0003585028030000032
Wherein->
Figure RE-GDA0003585028030000033
Word W representing the jth position of the line in sentence W j Corresponding word vectors fused with syllable information; then, inputting each word vector in the sentence vector H merged into the syllable information into a neural network, obtaining a hidden layer vector G = { G = 1 ,g 2 ,···,g j ,···,g m In which g is j Is a word vector->
Figure RE-GDA0003585028030000034
The hidden layer vector of (2);
s35, adjusting word vectors based on an attention mechanism:
a. for hidden layer vector G = { G 1 ,g 2 ,···,g j ,···,g m Compute attention weights:
Figure RE-GDA0003585028030000035
wherein, V 'and M' are matrixes initialized randomly, V 'is a matrix of 1 row and x column, M' is a matrix of x row and 1 column, x is a preset value, and b is a value initialized randomly;
b. training V ', M ' and b by adopting a gradient descent method to obtain a trained attention weight A ' = [ a ] 1 ′,a 2 ′…,a j ′…,a m ′];
c. Packed layer vector g with trained attention weight j Updating to obtain updated hidden layer vector g' j
Figure RE-GDA0003585028030000036
/>
Step 4, news abstract generation:
s41, word vector representation: suppose that the news vector S is composed of k sentence vectors, S = { S = { S = 1 ,···,s p ,···,s k Where, sentence vector s p Is composed of m' word vectors, s p ={g′ 1 ,···,g′ q ,···,g′ m′ },g′ q Expression in sentence vector s p A word vector with the median position of q;
s42, coding: inputting the news vector S of the step S41 into an LSTM model for coding to obtain a semantic vector T:
T=LSTM(S)
s43, decoding: inputting the semantic vector T obtained in the step S42 into another LSTM model for decoding to generate a text abstract vector S':
S′=LSTM(T)
the text digest vector S ' is composed of k ' sentence vectors, S ' = { S = { (S) } 1 ′,···,s p′ ′···,s k′ ' }, where sentence vector s p′ 'consists of m' word vectors;
s44, copying the unknown words:
a. calculating a sentence vector s p Chinese word vector g' q Probability distribution of (2):
P vocab (g′ q )=softmax(V″(V″′[s p ,s p, ′]+b′)+b″)
wherein, [ s ] p ,s p′ ′]The sentence vector S obtained in step S41 is shown p And sentence vector S obtained in step S43 p′ ' performing vector splicing operation; v "and V '" are matrices derived from random initialization, V "has dimensions of 1 x ', the dimension of V ' is x '. Multidot.1, x ' is a preset value; b 'and b' are randomly initialized values;
b. calculate word vector g' q Is generated with probability P gen
P gen =sigmoid(s p ·M 1 +s p′ ′·M 2 +A′·M 3 +b gen )
Wherein M is 1 、M 2 And M 3 For randomly initializing the resulting matrix, M 1 、M 2 、M 3 The dimensions of (a) are m ', m ' and m ' respectively; b is a mixture of gen A value initialized randomly;
c. to give a term vector g' q The final generation probability of (c):
Figure RE-GDA0003585028030000041
wherein a is j Represents the attention weight of step S35;
d. if P vocab (g′ q ) Is a 0 vector, then the word vector g 'is updated from the news vector S' q Covering the word vector with the highest attention weight in S'; if P vocab (g′ q ) If the word vector is a non-zero vector, updating the word vector with the highest attention weight in the generated text abstract vector S' into the word vector with the highest final generation probability;
s45, mapping: and mapping each word vector into a word for the generated text abstract vector S' updated in the step S44 to obtain a final text abstract.
The invention has the beneficial effects that:
the invention provides a method for generating a text abstract, which is improved on the basis of a CBOW model of Word2Vec, and syllable marking information is integrated to enhance the feature representation capability of the text; the method adopts an Encoder-Decoder framework based on LSTM to realize news abstract generation, and focuses on solving the problem of unknown words in the generation process, thereby effectively improving the effect of news abstract generation.
Drawings
Fig. 1 is a flowchart of a method for generating a text abstract according to the present invention.
Detailed Description
The embodiments of the present invention will be described in detail with reference to the accompanying drawings.
A method for generating a text abstract specifically comprises the following steps:
step 1, data crawling:
according to the embodiment of the invention, a news text on a Wei language news website is crawled as basic data for subsequent data preprocessing, such as a Wei language news text on a central Wei language broadcast network; the method comprises the following specific steps:
s11, data acquisition: inputting a URL address of a target data source website in a script crawler frame to obtain original corpus of a dimension language news text in a Json character string format;
s12, data analysis: performing regular expression analysis on the original corpus of the dimension language news text obtained in the step S11 to obtain a dimension language news text; the Uygur news text consists of a plurality of sentences, and each sentence consists of a plurality of words;
step 2, data preprocessing:
the step mainly involves preprocessing the wiki news text obtained in step S12 to improve the data analysis processing capability of the downstream model. The data preprocessing process comprises the following steps: data cleaning, data format processing, word segmentation and syllable marking. The method specifically comprises the following steps:
s21, data cleaning: data cleaning is carried out on the dimensional Language news text obtained in the step S12 by adopting a Structured Query Language (SQL) or Excel-based manual proofreading method, and the cleaned dimensional Language news text can be obtained by means of integrity check, spelling check correction, non-text information removal, invalid data discarding and the like;
s22, data format processing: carrying out data format processing on the cleaned dimension language news text by adopting an SQL or Excel-based manual proofreading method, specifically comprising case and case conversion, numerical format unification and the like, so as to obtain a processed dimension language news text;
s23, word segmentation: performing word segmentation on the processed dimension language news text by adopting a grammar analysis word segmentation algorithm to obtain a dimension language news text after word segmentation; the segmented Uygur news text consists of sentences, and the sentences consist of words; the word segmentation processing of the step is to perform word segmentation on the processed dimension language news text, and identifiers are added among certain characters in the dimension language news text to indicate which characters in the news text form a word, and the word is not changed into a vocabulary list after the word segmentation processing; for example [ i/like/eat/apple ] is the word segmentation processing result of the text [ i like eating apple ].
S24, syllable labeling: the vowels and consonants of the dimensional language are distinguished obviously, and the expression meanings of the dimensional language of the vowel and the dimensional language of the consonant are different to a certain extent. And (4) carrying out syllable labeling on the segmented dimensional language news text by adopting a dimensional language voice harmony rule processing algorithm, and constructing a dimensional voice syllable vector with the same dimensionality as the segmented dimensional language news text to obtain dimensional language news text syllable data.
And step 3, text feature representation:
the method mainly aims at the problem that the text features generated by the traditional text feature representation method are discrete and sparse, and is improved on the basis Of a CBOW (Continuous Bag-Of-Words Model) Model Of Word2Vec, syllable marking information is blended to enhance the text feature representation capability Of the Model, and the Bi-LSTM and attention mechanism are utilized to improve the text representation capability.
S31, initialization: and generating One-Hot codes by adopting the segmented wiki news text obtained in the step S23. The specific process is as follows: firstly, traversing the segmented dimensional language news text to obtain the number V of words and the word frequency of each word in the segmented dimensional language news text, arranging the V words according to the sequence of the word frequencies from large to small, and constructing a vocabulary table Vocab: { w 1 ,w 2 ,…,w i ,…,w V },w i Represents the ith word in the vocabulary; generating One-Hot code of V dimension according to the position of each word in the vocabulary, and generating w for the ith word i Indicating that it is ranked in the ith position in the vocabulary Vocab, and the generated One-Hot code is marked as One _ Hot i The specific generation process is as follows:
for the word w i When it is ranked in the ith position in the vocabulary Vocab, its corresponding One-Hot code One _ Hot i Comprises the following steps: [0, 8230;, 1,0, 8230;, 0]The dimension of the code is V, the ith bit is 1, and all the other bits are 0.
S32, generating word vectors and iteratingTraining: generating a word vector by adopting the One-Hot code generated in the step S31; for the word w i The generation process specifically comprises the following steps:
a. defining the length of a word vector to be N and the size of a window to be c;
b. randomly initializing a weight matrix W according to a Gaussian distribution V×N Wherein V represents the row number of the matrix, namely the dimension V of One-Hot coding, and N represents the column number of the matrix, namely the length N of the defined word vector; general word w i C preceding words w i-c ,w i-c+1 …,w i-1 And c words w i+1 ,w i+2 …,w i+c One-Hot coding of (1), i.e. One _ Hot i-c ,one_hot i-c+1, ...,one_hot i-1 ,one_hot i + 1 ,one_hot i+2 ,...,one_hot i+c
Respectively with W V×N Multiplying and averaging to obtain a hidden vector h of the middle layer i (ii) a The calculation formula is as follows:
Figure RE-GDA0003585028030000071
c. randomly initializing a weight matrix W 'according to Gaussian distribution' N×V Where N denotes the number of rows of the matrix, i.e. the defined word vector length N, and V denotes the number of columns of the matrix, i.e. the dimension V of the One-Hot code; will hide the vector h i Right by W' N×V And obtaining the word w through an activation function softmax i Probability distribution y of (a):
y=softmax(h i ·W′ N×V )
d. iterative training: the goal of the iterative training is to make the word w i Is closest to the true probability distribution, i.e. closest to the word w i One-Hot encoding of (1). The method specifically comprises the following steps: using a gradient descent method, one _ hot i The gradient of-y counter-propagates to W V×N And W' N×V Constantly correcting W V×N And W' N×V Such that one _ hot i -y is gradually decreasing; when one _ hot i -y is lower than oneStopping iteration when a preset threshold value is set (the threshold value is self-defined, and a numerical value approaching 0 is generally selected during setting, such as 0.001), and obtaining the hidden vector h of the middle layer after training i ', the hidden vector is the word w i Trained word vector h i ′;
S33, syllable information fusion: the dimensional speech pitch vector obtained in step S24 and the word vector h obtained in step S32 are combined i ' splicing to get the word w i Word vector h' blended with syllable information i
S34, adjusting word vectors based on Bi-LSTM (bidirectional long-short term memory network): the word w obtained in step S33 can be made available through Bi-LSTM (bidirectional Long-short term memory network) i Word vector h' merged with syllable information i The method comprises more context information, and the specific process is as follows:
for the word w obtained in step S33 i Word vector h' merged with syllable information i Firstly, randomly extracting a word w from the segmented wiki news text i Assuming that this sentence W is composed of m words, word W i Ranked at the jth position in the sentence W, denoted as W j Then the sentence can be represented as a set of words W = { W = 1 ,w 2 ,···w j ,···w m } (word w mentioned in step S31 i Refers to the i-th word in the vocabulary Vocab, where w is 1 ,w 2 ,···w m Refers to a word with a position of 1,2, \ 8230;, m) in the sentence W. The sentence vector of syllable information is merged into corresponding sentence
Figure RE-GDA0003585028030000081
Wherein->
Figure RE-GDA0003585028030000082
Word W representing the jth position of the line in sentence W j Corresponding word vectors fused with syllable information; then, each of the sentence vectors H into which the syllable information is merged into a word vector of the syllable information
Figure RE-GDA0003585028030000083
Sequentially inputting into a neural network composed of Bi-LSTM units to obtain
Figure RE-GDA0003585028030000084
Corresponding hidden layer vector: />
Figure RE-GDA0003585028030000085
Wherein, g j Is a word vector incorporating syllable information
Figure RE-GDA0003585028030000086
G is a word vector of m words in the sentence W which is integrated with syllable information->
Figure RE-GDA0003585028030000087
Corresponding hidden layer vector g 1 ,g 2 ,···,g m A set of compositions;
s35, adjusting word vectors based on an attention mechanism: the influence degree of different words on other words is different, and the attention mechanism is utilized to apply to the word w obtained in step S34 j Hidden layer vector g of j And adjusting to receive the influence of other words in different degrees. The method specifically comprises the following steps:
a. hidden layer vector G = { G) for m words 1 ,g 2 ,···,g j ,···,g m }, calculating the attention weight [ a ] 1 ,a 2 …,a j …,a m ]The formula is as follows:
Figure RE-GDA0003585028030000088
wherein A represents the attention weight a 1 ,a 2 …,a j …,a m Vector of composition, a j Is a numerical value, the softmax function will result in a vector of dimension m, a j The value of the j th bit in the vector output by the softmax function is obtained; v 'and M' are two matrices randomly initialized according to a gaussian distribution,v 'is a matrix of 1 row and x columns, M' is a matrix of x rows and 1 column (where x is a predetermined value, preferably approaching vector g) j B) is a value that is randomly initialized according to a gaussian distribution;
b. training V ', M ' and b in the formula by adopting a gradient descent method to obtain a trained attention weight A ' = [ a ] 1 ′,a 2 ′…,a j ′…,a m ′];
c. Using a trained attention weight A' = [ a ] 1 ′,a 2 ′…,a j ′…,a m ′]Vector g of hidden layer j Updating:
Figure RE-GDA0003585028030000091
get the word w j Updated hidden layer vector g' j
Step 4, news abstract generation:
the step mainly aims at the problem that the traditional news abstract generating method is poor in effect, the news abstract is generated by adopting an Encoder-Decoder framework based on LSTM, and the problem Of Out-Of-Vocabulary (OOV) is solved in an oriented mode in the generating process, so that the effect Of generating the news abstract is improved.
S41, word vector representation: and summarizing the segmented dimensional language news text obtained in the step S23 to generate a summary. Suppose that the news vector S is composed of k sentence vectors, i.e., S = { S = { S } 1 ,···,s p ,···,s k Where, sentence vector s p Consisting of m' word vectors, s p ={g′ 1 ,···,g′ q ,···,g′ m′ }, wherein g' q Expression in sentence vector s p A word vector with the median position of q;
s42, encoding:
inputting the news vector S of the step S41 into a unidirectional LSTM model for encoding, and generating a semantic vector T by the LSTM based on the news vector S:
T=LSTM(S)
the semantic vector T contains all the information of the news.
S43, decoding: inputting the semantic vector T obtained in the step S42 into another different unidirectional LSTM model for decoding to generate a text abstract vector S'; the generated text digest vector S ' is composed of k ' sentence vectors, S ' = { S = { (S) } 1 ′,···,s p′ ′···,s k′ ' }, where sentence vector s p′ 'is composed of m' word vectors, s p′ ′={g′ 1 ,···g′ q′ ,··· g′ m″ },g′ q′ As a vector s of sentences p′ 'the word vector representation with position q' in (LSTM when used for decoding, one vector can be expanded into multiple vectors):
s'=LSTM(T)
s44, copying the unknown words: after the text abstract vector S ' is obtained in step S43, it is determined whether each word vector in S ' is a vector of an unknown word (i.e., it is determined whether a word vector corresponding to a word in the Vocab vocabulary is consistent with the word vector in S ', and if so, a word copy operation is required). The specific process is as follows:
a. calculating a sentence vector s p Chinese word vector g' q Probability distribution P of vocab The formula is as follows:
P vocab (g′ q )=softmax(y″(y″′[S p ,S p′ ′]+b′)+b″)
wherein [ s ] p ,s p′ ′]The sentence vector S obtained in step S41 is shown p And sentence vector S obtained in step S43 p′ ' performing vector splicing operation; v "and V '" are two matrices randomly initialized according to a gaussian distribution, with V "having a dimension of 1 × x ' and V '" having a dimension of x ' × 1 (x ' is a predetermined value, which is about 1000); b 'and b' are two values randomly initialized according to a Gaussian distribution; v 'and V', b 'and b' all require constant modification of their parameters by gradient descent to increase P vocab (g′ q ) The accuracy of (3).
b. Calculate word vector g' q Is generated with a probability P gen
P gen =sigmoid(S p ·M 1 +S p′ ′·M 2 +A′·M 3 +b gen )
Wherein M is 1 、M 2 And M 3 Is a matrix, s, obtained by random initialization based on a Gaussian distribution p Is the sentence vector, S, obtained in step S41 p′ 'is the sentence vector obtained in step S43, A' is the set of trained attention weights obtained in step S35, b gen Is a value randomly initialized according to a gaussian distribution; m is a group of 1 、M 2 、M 3 The dimensions of (a) are m ', m ", m'm", respectively; m 1 、M 2 、M 3 And b gen All need to continuously modify their parameters by gradient descent method to increase P gen The accuracy of (3).
c. Synthesize the above probability distribution P vocab And generating a probability P gen To obtain the word vector g' q The final generation probability of (c):
Figure RE-GDA0003585028030000101
wherein a is j The attention weight of step S35 is shown, and m' shows the sentence vector length of step S41.
d. If P vocab (g′ q ) Calculated as a 0 vector, the word vectors corresponding to all words in Vocab in step S31 are illustrated as g' q All are different, and the word vector g 'needs to be updated directly from S at this time' q Covering the word vector with the highest attention A 'weight in S'; if P vocab (g′ q ) Computing as a non-zero vector, then the term vector g' q The corresponding word is present in the vocabulary table Vocab of step S31, and is in accordance with P (g' q ) Selecting the word vector with the highest generation probability, namely updating the word vector with the highest weight of the attention force A 'in the generated text abstract vector S' into the word vector with the highest final generation probability, thereby solving the OOV problem;
s45, mapping: for the updated generation of step S44Text abstract vector S', and vector S of each sentence in S p′ 'the word vector g' q′ Mapping into words to obtain the final text abstract S final ={W 1 final ,···,W i final ,···,W k′ final In which the sentence W i final Consisting of m' words, W i final ={w 1 ,w 2 ,···w m′ In which w 1 ,w 2 ,···w m′ Is a word.
The invention thus achieves a generative text summarization method.

Claims (1)

1. A method for generating a text abstract, comprising the steps of:
step 1, data crawling:
the method comprises the steps that a data source website crawls original linguistic data of a news text, and the news text is obtained after analysis;
step 2, data preprocessing:
s21, data cleaning: carrying out data cleaning on the news text obtained in the step 1 to obtain the cleaned news text;
s22, data format processing: carrying out data format processing on the cleaned news text to obtain a processed news text;
s23, word segmentation: performing word segmentation on the processed news text by adopting a grammatical analysis word segmentation algorithm to obtain a word segmented news text;
s24, syllable labeling: carrying out syllable annotation on the segmented news text by adopting a speech harmony law processing algorithm, adopting 1 to represent vowels and 0 to represent consonants, and constructing syllable vectors with the same dimension as the segmented news text to obtain syllable data of the news text;
and step 3, text feature representation:
s31, initialization: firstly, traversing the segmented news text obtained in the step S23 to obtain the number V of words and the word frequency of each word in the segmented news text, arranging the V words according to the sequence of the word frequencies from large to small, and constructing a vocabulary table Vocab:{w 1 ,w 2 ,…,w i ,…,w V },w i represents the ith word in the vocabulary; generating One-Hot code of V dimension according to the position of the word in the vocabulary table, and for the ith word w i And the generated One-Hot code is marked as One _ Hot i
S32, generating word vectors and performing iterative training: generating a word vector by adopting the One-Hot code generated in the step S31; for the word w i The generation process specifically comprises the following steps:
a. defining the length of a word vector as N and the window size as c;
b. random initialization weight matrix W V×N Calculating to obtain a hidden vector h of the middle layer i
Figure RE-FDA0003585028020000011
c. Weight matrix W 'is initialized randomly' N×V Calculating the word w i Probability distribution y of (a):
y=softmax(h i ·W′ N×V )
d. iterative training: adopting a gradient descent method, continuously iterating and training when the one _ hot i When y is lower than a preset threshold value, stopping iteration to obtain a hidden vector h of the middle layer after training i ', hidden vector h of middle layer after training i ' is the word w i Trained word vector h i ′;
S33, syllable information integration: the syllable vector obtained in step S24 and the word vector h obtained in step S32 are combined i ' splicing ' to obtain a word vector h ' fused with syllable information i
S34, word vector adjustment based on the neural network: randomly extracting a word w from the segmented news text i Assuming that the sentence W is composed of m words, the word W i The j-th position in the sentence W is marked as W j ,W={w 1 ,w 2 ,…w m The sentence W is corresponding to a sentence vector with syllable information blended therein, the sentence vector is
Figure RE-FDA0003585028020000021
Wherein->
Figure RE-FDA0003585028020000022
Word W representing the jth position of the line in sentence W j Corresponding word vectors fused with syllable information; then, each word vector in the sentence vector H blended with syllable information is input into a neural network, obtaining a hidden layer vector G = { G = 1 ,g 2 ,…,g j ,…,g m In which g is j Is a word vector->
Figure RE-FDA0003585028020000023
The hidden layer vector of (2);
s35, adjusting word vectors based on an attention mechanism:
a. for hidden layer vector G = { G 1 ,g 2 ,…,g j ,…,g m Compute attention weights:
Figure RE-FDA0003585028020000024
wherein, V 'and M' are matrixes initialized randomly, V 'is a matrix of 1 row and x column, M' is a matrix of x row and 1 column, x is a preset value, and b is a value initialized randomly;
b. training V ', M ' and b by adopting a gradient descent method to obtain a trained attention weight A ' = [ a ] 1 ′,a 2 ′…,a j ′…,a m ′];
c. Packed layer vector g with trained attention weight j Updating to obtain updated hidden layer vector g' j
Figure RE-FDA0003585028020000025
Step 4, news abstract generation:
s41, word vector representation: suppose that the news vector S is composed of k sentence vectors, S = { S = { S = 1 ,…,s p ,…,s k Where, sentence vector s p Is composed of m' word vectors, s p ={g′ 1 ,…,g′ q ,…,g′ m′ },g′ q Expression in sentence vector s p A word vector with the median position of q;
s42, encoding: inputting the news vector S of the step S41 into an LSTM model for coding to obtain a semantic vector T:
T=LSTM(S)
s43, decoding: inputting the semantic vector T obtained in the step S42 into another LSTM model for decoding to generate a text abstract vector S':
S′=LSTM(T)
the text digest vector S ' is composed of k ' sentence vectors, S ' = { S = } 1 ′,…,s p′ ′…,s h′ ' }, where sentence vector s p′ 'consists of m' word vectors;
s44, copying the unknown words:
a. calculating a sentence vector s p Chinese word vector g' q Probability distribution of (2):
P vocab (g′ q )=softmax(V″(V″′[s p ,s p′ ′]+b′)+b″)
wherein [ s ] p ,s p′ ′]The sentence vector S obtained in step S41 is shown p And sentence vector S obtained in step S43 p′ ' performing vector splicing operation; v "and V '" are matrices derived from random initialization, V "has dimensions of 1 x ', the dimension of V ' is x '. Multidot.1, x ' is a preset value; b 'and b' are randomly initialized values;
b. calculate word vector g' q Is generated with probability P gen
P gen =sigmoid(s p ·M 1 +s p′ ′·M 2 +A′·M 3 +b gen )
Wherein M is 1 、M 2 And M 3 For randomly initializing the resulting matrix, M 1 、M 2 、M 3 The dimensions of (a) are m ', m ", m'm", respectively; b gen A value initialized randomly;
c. to obtain a word vector g' q The final generation probability of (c):
Figure RE-FDA0003585028020000031
wherein a is j Represents the attention weight of step S35;
d. if P vocab (g′ q ) Is a 0 vector, then the word vector g 'is updated from the news vector S' q Covering the word vector with the highest attention weight in S'; if P vocab (g′ q ) If the word vector is a non-zero vector, updating the word vector with the highest attention weight in the generated text abstract vector S' into the word vector with the highest final generation probability;
s45, mapping: and mapping each word vector into a word for the generated text abstract vector S' updated in the step S44 to obtain a final text abstract.
CN202111373234.7A 2021-11-18 2021-11-18 Generation type text abstract method Active CN114547287B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111373234.7A CN114547287B (en) 2021-11-18 2021-11-18 Generation type text abstract method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111373234.7A CN114547287B (en) 2021-11-18 2021-11-18 Generation type text abstract method

Publications (2)

Publication Number Publication Date
CN114547287A CN114547287A (en) 2022-05-27
CN114547287B true CN114547287B (en) 2023-04-07

Family

ID=81668710

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111373234.7A Active CN114547287B (en) 2021-11-18 2021-11-18 Generation type text abstract method

Country Status (1)

Country Link
CN (1) CN114547287B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018135723A1 (en) * 2017-01-17 2018-07-26 경북대학교 산학협력단 Device and method for generating abstract summary of multiple-paragraph text, and recording medium for performing same method
CN109344391A (en) * 2018-08-23 2019-02-15 昆明理工大学 Multiple features fusion Chinese newsletter archive abstraction generating method neural network based
CN109635109A (en) * 2018-11-28 2019-04-16 华南理工大学 Sentence classification method based on LSTM and combination part of speech and more attention mechanism
CN109657051A (en) * 2018-11-30 2019-04-19 平安科技(深圳)有限公司 Text snippet generation method, device, computer equipment and storage medium
CN110209801A (en) * 2019-05-15 2019-09-06 华南理工大学 A kind of text snippet automatic generation method based on from attention network
CN110378409A (en) * 2019-07-15 2019-10-25 昆明理工大学 It is a kind of based on element association attention mechanism the Chinese get over news documents abstraction generating method
CN110619043A (en) * 2019-08-30 2019-12-27 西南电子技术研究所(中国电子科技集团公司第十研究所) Automatic text abstract generation method based on dynamic word vector
CN111782810A (en) * 2020-06-30 2020-10-16 湖南大学 Text abstract generation method based on theme enhancement
JP2021033995A (en) * 2019-08-16 2021-03-01 株式会社Nttドコモ Text processing apparatus, method, device, and computer-readable storage medium
CN113127631A (en) * 2021-04-23 2021-07-16 重庆邮电大学 Text summarization method based on multi-head self-attention mechanism and pointer network
CN113254610A (en) * 2021-05-14 2021-08-13 廖伟智 Multi-round conversation generation method for patent consultation

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11138378B2 (en) * 2019-02-28 2021-10-05 Qualtrics, Llc Intelligently summarizing and presenting textual responses with machine learning

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018135723A1 (en) * 2017-01-17 2018-07-26 경북대학교 산학협력단 Device and method for generating abstract summary of multiple-paragraph text, and recording medium for performing same method
CN109344391A (en) * 2018-08-23 2019-02-15 昆明理工大学 Multiple features fusion Chinese newsletter archive abstraction generating method neural network based
CN109635109A (en) * 2018-11-28 2019-04-16 华南理工大学 Sentence classification method based on LSTM and combination part of speech and more attention mechanism
CN109657051A (en) * 2018-11-30 2019-04-19 平安科技(深圳)有限公司 Text snippet generation method, device, computer equipment and storage medium
CN110209801A (en) * 2019-05-15 2019-09-06 华南理工大学 A kind of text snippet automatic generation method based on from attention network
CN110378409A (en) * 2019-07-15 2019-10-25 昆明理工大学 It is a kind of based on element association attention mechanism the Chinese get over news documents abstraction generating method
JP2021033995A (en) * 2019-08-16 2021-03-01 株式会社Nttドコモ Text processing apparatus, method, device, and computer-readable storage medium
CN110619043A (en) * 2019-08-30 2019-12-27 西南电子技术研究所(中国电子科技集团公司第十研究所) Automatic text abstract generation method based on dynamic word vector
CN111782810A (en) * 2020-06-30 2020-10-16 湖南大学 Text abstract generation method based on theme enhancement
CN113127631A (en) * 2021-04-23 2021-07-16 重庆邮电大学 Text summarization method based on multi-head self-attention mechanism and pointer network
CN113254610A (en) * 2021-05-14 2021-08-13 廖伟智 Multi-round conversation generation method for patent consultation

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
Automatic paper writing based on a RNN and the TextRank algorithm;Hei-Chia Wang等;《Applied Soft Computing》;第1-12页 *
Variational neural decoder for abstractive text summarization;Zhao Huan等;《Computer Science and Information Systems》;第17卷(第2期);第537-552页 *
基于全局和局部注意力交互机制的语义理解模型;侯珍珍等;《桂林电子科技大学学报》(第02期);第55-59页 *
基于改进Encoder-Decoder模型的新闻摘要生成方法;李晨斌等;《计算机应用》;第25-28页 *
基于深度学习的自动文摘句排序方法;何凯霖等;《计算机工程与设计》(第12期);第275-278页 *
基于词性软模板注意力机制的短文本自动摘要方法;张亚飞等;《模式识别与人工智能》(第06期);第76-83页 *

Also Published As

Publication number Publication date
CN114547287A (en) 2022-05-27

Similar Documents

Publication Publication Date Title
CN109783657B (en) Multi-step self-attention cross-media retrieval method and system based on limited text space
CN106980683B (en) Blog text abstract generating method based on deep learning
Faruqui et al. Morphological inflection generation using character sequence to sequence learning
Su et al. A two-stage transformer-based approach for variable-length abstractive summarization
US8386240B2 (en) Domain dictionary creation by detection of new topic words using divergence value comparison
CN110929030A (en) Text abstract and emotion classification combined training method
CN108446271A (en) The text emotion analysis method of convolutional neural networks based on Hanzi component feature
CN110032638B (en) Encoder-decoder-based generative abstract extraction method
CN113704416B (en) Word sense disambiguation method and device, electronic equipment and computer-readable storage medium
WO2023134083A1 (en) Text-based sentiment classification method and apparatus, and computer device and storage medium
CN112905736A (en) Unsupervised text emotion analysis method based on quantum theory
Habib et al. An exploratory approach to find a novel metric based optimum language model for automatic bangla word prediction
CN111145914B (en) Method and device for determining text entity of lung cancer clinical disease seed bank
Hifny Hybrid LSTM/MaxEnt networks for Arabic syntactic diacritics restoration
CN113961706A (en) Accurate text representation method based on neural network self-attention mechanism
ELAffendi et al. A simple Galois Power-of-Two real time embedding scheme for performing Arabic morphology deep learning tasks
CN112445887B (en) Method and device for realizing machine reading understanding system based on retrieval
CN114547287B (en) Generation type text abstract method
CN113743113A (en) Emotion abstract extraction method based on TextRank and deep neural network
CN113449517A (en) Entity relationship extraction method based on BERT (belief propagation) gating multi-window attention network model
CN112131859A (en) Tibetan composition plagiarism detection prototype system
WO2019163752A1 (en) Morpheme analysis learning device, morpheme analysis device, method, and program
Yadav et al. Different Models of Transliteration-A Comprehensive Review
Xu et al. Neural dialogue model with retrieval attention for personalized response generation
Wai et al. Myanmar (Burmese) String Similarity Measures based on Phoneme Similarity

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant