CN110929022A

CN110929022A - Text abstract generation method and system

Info

Publication number: CN110929022A
Application number: CN201811088867.1A
Authority: CN
Inventors: 李卫; 白子龙
Original assignee: Archimedes (shanghai) Media Co Ltd
Current assignee: Archimedes (shanghai) Media Co Ltd
Priority date: 2018-09-18
Filing date: 2018-09-18
Publication date: 2020-03-27

Abstract

The invention discloses a text abstract generating method and a text abstract generating system. The method comprises the steps of extracting key sentences of a text based on a TextRank algorithm, embedding words into the extracted key sentences by adopting a word2vec model, removing key sentences with lower key degrees in two key sentences with similar semantemes according to semantic relevance of participles, extracting semantic units where key sentences which are not removed are located one by one, removing duplication to serve as new key sentences, and finally organizing a text abstract according to the sequence of the new key sentences appearing in the original text. The method and the system provided by the invention only calculate the semantic similarity of the key sentences extracted by the TextRank, and the calculated amount is small; key sentences with repeated semantemes are removed by utilizing semantic similarity, so that semantic diversity is ensured on the premise of the same abstract length; the whole readability and semantic consistency of the generated abstract are improved by extracting the complete semantic unit as the key sentence.

Description

Text abstract generation method and system

Technical Field

The invention discloses a text abstract generation method and a text abstract generation system, relates to the field of natural language processing, in particular to the field of automatic text abstract generation, and particularly relates to the field of content abstract of internet text.

Background

At present, the generation modes of text automatic abstractions based on natural languages are mainly divided into two main categories, namely extraction type and generation type. The core idea of the extraction type text abstract automatic generation method is that the central idea of a document can be summarized by using a certain sentence in the document, so that an abstract task is changed into a problem of sentence importance sequencing; the most representative algorithms are the TF-IDF algorithm and the TextRank algorithm. The TF-IDF algorithm measures the importance of words by word frequency and inverse document frequency, and further considers that the importance of sentences containing a large number of keywords is higher. The TextRank algorithm uses sentences as nodes in a graph based on a graph model, uses relations between the sentences as edges in the graph, generally measures the relations between the sentences by using literal similarity, and sorts the importance of the sentences by using a voting mechanism. The TextRank algorithm does not need to learn and train a specific model in advance like a neural network, and is simple and effective to be widely applied. The generation type abstract generation method directly generates abstract sentences according to the meaning expression of the text, and is similar to the process of writing abstract by people. The representative algorithm is seq2seq + attack model based on deep learning technology. The model training needs a large amount of labeled texts as training samples, the calculation requirement on the system is high, and the generated abstract also has the problems of grammar error, poor readability and the like. Therefore, the technology is still not mature enough and has limited application range.

At present, the most applied and most mainstream text abstract generation method is still a generation type abstract. The most representative TF-IDF algorithm and the TextRank algorithm respectively measure the importance of sentences based on word frequency statistics and word face similarity; both of them lack the measurement of semantic level, resulting in that under the limitation of limited abstract length, the generated abstract lacks semantic diversity, and the abstract composed of the extracted isolated sentences also lacks semantic integrity and consistency, and readability is poor.

Disclosure of Invention

In order to overcome the defects of the existing extraction type text abstract generating method, the invention provides a text abstract generating method, which comprises the following steps:

a. preprocessing the original text with the abstracted abstract;

b. extracting key sentences in the original text and corresponding key degrees based on a standard TextRank algorithm;

c. adopting a word segmentation tool to segment the extracted key sentences and filter preset stop words, carrying out word embedding processing on the filtered segments based on a word2vec model, and calculating semantic correlation between the segments in every two key sentences according to word embedding processing results so as to obtain semantic similarity between every two key sentences;

d. filtering out key sentences with highly similar semantemes and lower criticality according to the semantic similarity between every two key sentences;

e. and extracting a complete semantic unit according to the filtered key sentences to be used as a new key sentence for sequencing, and then generating a text abstract.

Further, the preprocessing of the original text in the step a includes removing non-text contents such as space characters, line feed characters and the like, and text coding is performed on the remaining text contents. The filtering of the key sentence in the step d specifically comprises the following steps: and presetting a similarity threshold, removing the key sentences with lower criticality from the two key sentences with semantic similarity exceeding the similarity threshold, and only keeping the other key sentence with higher criticality.

Further, in order to solve the problems that the abstract generated by the existing extraction-type abstract generation method lacks semantic integrity and coherence and has poor readability, the abstract generated by the filtered key sentence needs to be further optimized. Therefore, the invention provides a step e of a text abstract generation method, which is specifically realized as follows: dividing the original text into a plurality of semantic units according to the Chinese end character identifiers, extracting the semantic units containing the filtered key sentences, and removing duplication to serve as new key sentences; and sequencing the new key sentences according to the sequence of the new key sentences appearing in the original text to form a final abstract.

Corresponding to the method, the invention also provides a text abstract generating system, which comprises:

the text preprocessing module is used for preprocessing the original text with the abstracted abstract;

the key sentence extraction module is used for extracting key sentences in the original text and corresponding key degrees by adopting a standard TextRank algorithm;

the key sentence semantic similarity calculation module is used for filtering preset stop words after word segmentation is carried out on the extracted key sentences by adopting a word segmentation tool, then carrying out word embedding processing on the filtered word segments on the basis of a word2vec model, calculating semantic correlation between the words in every two key sentences according to word embedding processing results, and further obtaining semantic similarity between every two key sentences;

the key sentence filtering module is used for filtering out key sentences with high semantic similarity and lower criticality according to the semantic similarity between every two key sentences and the corresponding criticality;

and the text abstract completing module is used for taking the semantic unit where the filtered key sentence is located as a new key sentence and sequentially linking the semantic unit according to the sequence of the new key sentence appearing in the text to generate a final text abstract.

Further, the text preprocessing module preprocesses the original text, including removing non-text contents such as space characters and line feed characters, and performs text coding on the remaining text contents. The encoding method adopted by the text encoding can comprise the following steps: UTF-8, GB2312, GBK, ASCII and other Chinese and English codes. The filtering process of the key sentence filtering module is specifically as follows: and presetting a similarity threshold, removing the key sentences with lower criticality from the key sentence pairs with semantic similarity reaching the similarity threshold, and only keeping the other key sentences with higher criticality. In addition, the invention provides a text abstract finishing module, which is specifically realized as follows: dividing the original text into a plurality of semantic units according to Chinese ending identifier, wherein the Chinese ending identifier at least comprises: ". ","? ","! ","? ". And extracting semantic units containing the filtered key sentences, removing duplicates to serve as new key sentences, and sequentially linking the new key sentences in the sequence of the new key sentences in the text to generate a final text abstract.

Drawings

Fig. 1 is a flowchart of a text summary generation method according to the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantages solved by the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1 of the specification, the present invention provides a text summary generating method, which includes the following steps:

a. preprocessing the text: removing space character, line feed character and other non-text content in original text, and UTF-8 coding the rest text content;

c. calculating the similarity between the key sentences: adopting a Jieba word segmentation tool to segment the extracted key sentences, filtering preset stop words, carrying out word embedding processing on the filtered segments based on a word2vec model, calculating semantic correlation between the segments in every two key sentences according to word embedding processing results, and further obtaining semantic similarity between every two key sentences;

d. filtering the key sentences: filtering out key sentences with lower key degrees in the two key sentences with similar semantemes according to the semantic similarity between every two key sentences and the corresponding key degrees;

e. and extracting the complete semantic unit where the filtered key sentences are located, and sequencing the complete semantic unit as a new key sentence to generate a text abstract.

The TextRank algorithm involved in the step b is a representative algorithm of the automatic generation method of the extraction type text abstract, and is a widely adopted mature key sentence extraction algorithm. The TextRank algorithm is a graph-based ranking algorithm for text. The basic idea is derived from the PageRank algorithm of Google, a text is divided into a plurality of composition units (words and sentences), a graph model is established, important components in the text are sequenced by using a voting mechanism, and keyword extraction and an extraction type abstract can be realized only by using the information of a single document. The general model of the TextRank algorithm can be represented as a directed weighted graph G ═ (V, E), consisting of a set of points V and a set of edges E, E being a subset of V × V. Any two points V in the figure_i,V_jRight of edge betweenHeavy is w_jiFor a given point V_i,In(V_i) To point to the set of points at that point, Out (V)_i) Is a point V_iSet of other points pointed to. Point V_iThe score of (c) is defined as follows:

wherein d is a damping coefficient, the value range is 0 to 1, which represents the probability of pointing to any other point from a certain point in the graph, and the value is generally 0.85. When calculating the score of each point in the graph by using the TextRank algorithm, it is necessary to assign an arbitrary initial value to the point in the graph and recursively calculate until convergence is reached, that is, when the error rate of any point in the graph is less than a given limit value, the limit value is generally 0.0001.

When extracting key sentences and key degrees based on the TextRank algorithm, the method mainly comprises the following steps:

1. pretreatment: the input text or text set T is divided according to the sentence S, namely T ═ S₁，S₂，S₃…S_m](ii) a For each sentence S_iPerforming word segmentation and removing stop words to obtain:

where n is the number of candidate keywords of the sentence, t_i,j∈S_iAre the candidate keywords after retention.

2. And (3) carrying out similarity calculation on the sentences: constructing an edge set E in the graph G, giving similarity of two sentences based on the content coverage rate between the sentences, and calculating by adopting the following formula:

wherein S_i、S_jRespectively representing two sentences, w_kRepresenting words in a sentence, the numerator portion means the number of the same word appearing in both sentences at the same time, and the denominator is the logarithmic sum of the number of words in the sentence. The denominator is designed in such a way that the denominator can be restrainedThe long sentence has an advantage in similarity calculation. If the similarity between two sentences is greater than a given threshold, the two sentences are considered semantically related and connected, namely the weight w of the edge_ji＝Similarity(S_i，S_j)；

And 3, calculating sentence weight: according to formula 1, iteratively propagating the weight to calculate the score of each sentence;

and 4, extracting abstract sentences: and (4) carrying out reverse ordering according to the sentence scores (namely the criticality) obtained in the last step, and extracting T sentences with the highest criticality as key sentences.

The word2vec model in step c is an NLP tool derived by *** in 2013, and is characterized in that all participles are vectorized (namely, word embedding), the relationship between words is quantitatively described, and the semantic correlation between two participles can be measured through the matrix operation result of two word vectors. Thus, the similarity between sentences can be quantitatively described by weighting operation of semantic relevance of each word between sentences. Details about the concrete implementation principle of the word2vec model are not described here.

Further, the filtering of the key sentence in the step d specifically includes: and c, presetting a similarity threshold, removing the key sentence with lower criticality from the two key sentences with semantic similarity exceeding the similarity threshold according to the correlation between the key sentences obtained by calculation in the step c, and only keeping the other key sentence with higher criticality. In addition, step e is specifically implemented as: dividing the original text into a plurality of semantic units according to the Chinese end character identifiers, extracting the semantic units containing the filtered key sentences, and removing duplication to serve as new key sentences; and sequencing the new key sentences according to the sequence of the new key sentences appearing in the original text to form a final abstract. The generated abstract gives consideration to the semantic integrity and the continuity of the abstract to a certain extent, and the readability of the abstract is improved.

Corresponding to the method, the invention also provides a text abstract generating system, which comprises: the system comprises a text preprocessing module, a key sentence extraction module, a key sentence semantic similarity calculation module, a key sentence filtering module and a text abstract finishing module; the text preprocessing module is used for preprocessing the original text with the abstracted abstract; the key sentence extraction module is used for extracting key sentences in the original text and corresponding key degrees by adopting a standard TextRank algorithm; the key sentence semantic similarity calculation module is used for filtering preset stop words after word segmentation is carried out on the extracted key sentences by adopting a word segmentation tool, then carrying out word embedding processing on the filtered word segments on the basis of a word2vec model, calculating semantic correlation between the words in every two key sentences according to word embedding processing results, and further obtaining semantic similarity between every two key sentences; the key sentence filtering module is used for filtering key sentences according to the semantic similarity between every two key sentences and the corresponding key degrees; and the text abstract finishing module is used for extracting the complete semantic unit where the filtered key sentences are located as new key sentences and generating the text abstract after sequencing.

Further, the text preprocessing module preprocesses the original text, including removing non-text contents such as space characters and line feed characters, and performs text coding on the remaining text contents. The key sentence filtering module specifically filters the key sentences as follows: and presetting a similarity threshold, removing the key sentences with lower criticality from the two key sentences with semantic similarity reaching the similarity threshold, and only keeping the other key sentence with higher criticality. In addition, the text abstract completion module of the text abstract generation system provided by the invention is specifically realized as follows: dividing the original text into a plurality of semantic units according to the Chinese end character identifiers, extracting the semantic units containing the filtered key sentences, and removing duplication to serve as new key sentences; and sequencing the new key sentences according to the sequence of the new key sentences appearing in the original text to form a final abstract.

Compared with the prior art, the text abstract generating method and the text abstract generating system provided by the invention have the following advantages:

1. semantic similarity is calculated only for the extracted key sentences, and the calculation amount is small;

2. key sentences based on semantic similarity measurement are filtered, and semantic diversity of generated summaries is guaranteed under the same summary length;

3. each key sentence is expanded to a complete semantic unit in the original text, so that the overall readability of the generated abstract is improved, the key sentences are reordered according to the sequence of the key sentences in the original text, and the semantic consistency of the abstract is improved.

Claims

1. A text abstract generating method comprises the following steps:

a. preprocessing the original text with the abstracted abstract;

c. adopting a word segmentation tool to segment the extracted key sentences, filtering preset stop words, carrying out word embedding processing on the filtered segments based on a word2vec model, and calculating semantic similarity between every two key sentences according to the similarity of the segments in the key sentences;

d. deleting one sentence with lower key degree in the two key sentences with the similarity exceeding the similarity threshold value, and realizing the filtering of the key sentences;

e. and taking the semantic unit where the filtered key sentence is as a new key sentence, and sequencing to generate the abstract.

2. The method according to claim 1, wherein the pretreatment in step a is specifically: and after removing the space character, the line feed character and other non-text contents in the original text, performing text coding on the remaining text contents.

3. The method of claim 1, wherein step d is embodied as: and setting a similarity threshold, and deleting the key sentences with lower criticality from two similar key sentences of which the semantic similarity exceeds the threshold.

4. The method of claim 2, wherein the text encoding employs an encoding method comprising: UTF-8, GB2312, GBK and ASCII.

5. The method of claim 1, wherein step e is embodied as: dividing the original text into a plurality of semantic units according to the Chinese end symbol identifiers; and extracting semantic units containing the filtered key sentences, removing duplicates to serve as new key sentences, and sequencing the new key sentences according to the sequence of the new key sentences appearing in the original text to form a final abstract.

6. The method of claim 5, wherein the chinese ending identifier comprises at least: ". ","? ","! ","? ".

7. A text summary generation system, the system comprising: the system comprises a text preprocessing module, a key sentence extraction module, a key sentence semantic similarity calculation module, a key sentence filtering module and a text abstract finishing module; wherein

The text preprocessing module is used for preprocessing the original text needing to extract the abstract;

the key sentence semantic similarity calculation module is used for filtering preset stop words after word segmentation is carried out on the extracted key sentences by adopting a word segmentation tool, then carrying out word embedding processing on the filtered word segments on the basis of a word2vec model, and calculating semantic correlation between every two key sentence segments according to word embedding processing results so as to obtain semantic similarity between every two key sentences;

the key sentence filtering module is used for removing key sentences with lower key degrees in two key sentences with semantic similarity exceeding a preset threshold value according to the semantic similarity between every two key sentences and the corresponding key degrees;

and the text abstract completing module is used for taking the semantic unit where the filtered key sentence is located as a new key sentence and sequentially linking the semantic unit according to the sequence of the new key sentence appearing in the text to generate the text abstract.

8. The system of claim 7, wherein the text pre-processing module is embodied as: and removing non-text contents such as space characters, line feed characters and the like in the original text, and then performing text coding on the remaining text contents.

9. The system of claim 7, wherein the key sentence filtering module is embodied as: and setting a similarity threshold, and deleting the key sentences with lower key degrees for two similar key sentences of which the semantic similarity exceeds the threshold.

10. The system of claim 7, wherein the text summarization completion module is embodied as: dividing the original text into a plurality of semantic units according to the Chinese end symbol identifiers; and extracting semantic units containing the filtered key sentences, removing duplicates to serve as new key sentences, and sequencing the new key sentences according to the sequence of the new key sentences appearing in the original text to form a final abstract.