CN111209752A

CN111209752A - Chinese extraction integrated unsupervised abstract method based on auxiliary information

Info

Publication number: CN111209752A
Application number: CN202010211550.3A
Authority: CN
Inventors: 马帅; 蒋浩谊; 华轶名
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2019-11-13
Filing date: 2020-03-24
Publication date: 2020-05-29

Abstract

The invention provides a Chinese extraction integrated unsupervised abstract method based on auxiliary information, which comprises the following steps of 1, text preprocessing, wherein the preprocessing comprises the steps of segmenting a paragraph and removing stop words; step 2, providing the news text data set for processing based on a graph and a clustering extraction automatic summarization algorithm; and step 3, obtaining the final news abstract in the following mode:

wherein Q represents the title of the news; s represents a news digest that has been selected, and Sim () is used to calculate the score between sentencesSimilarity, S refers to a certain sentence in the original article, Sim (S, Q) is used for calculating the similarity between the certain sentence in the original article and the news title, Sim (S, S) is used for calculating the similarity between the paragraph in the original article and the obtained abstract, and ArgMax gives the index set of the largest elements in the set.

Description

Chinese extraction integrated unsupervised abstract method based on auxiliary information

Technical Field

The invention relates to a method for generating an abstract, in particular to a Chinese extraction integrated unsupervised abstract method based on auxiliary information.

Background

As the text information has increased explosively, readers need more efficient and faster ways to learn about the main content of articles. The automatic summarization task is a branch of natural language processing, and is a technology for generating a short text from one or more long texts. Automatic summarization may be applied in a variety of scenarios, such as news texts, meeting records, medical profiles, social texts, etc. Automated summarization has been widely studied and the prior art is divided into two categories: an abstract automatic summary and a generative automatic summary. The extraction automatic abstract selects important language information from the original text and splices the important information to form a final abstract; the generative automatic abstract captures key information through learning rules to generate a sentence which does not appear in the original article. Generative automatic summarization techniques have evolved rapidly in recent years, but generative automatic summarization requires a large amount of training data and as a result does not have the poor generalization capability. The industry typically uses an abstract automatic summary rather than a generative automatic summary. Traditional abstracted automatic summarization has no requirement for language classes. Unlike western languages, chinese processing is very challenging. The biggest difference is that Chinese needs to be better processed by word segmentation tools. The quality of the word segmentation tool directly or indirectly influences the quality of the final text abstract.

In the prior art, the essence of a TextRank algorithm of an automatic summarization method based on a graph is still a Pagerank algorithm proposed by Google, the largest problem of the Pagerank algorithm is that the ranking sinks, and overlapped sentences can be caused to be in a final summarization in the TextRank result; the automatic summarization method based on the center seriously depends on the quality of a clustering algorithm, and the robustness has problems; the automatic summarization method based on the submodular function does not consider similarity between words; the automatic summarization method based on deep learning relies heavily on a large amount of labeled training data and the results are not stable enough.

The prior art abstract automatic summarization algorithm has many areas which can be improved, and is mainly embodied in three aspects. First, most of the abstract automatic summarization algorithms are based on a single mathematical model, and many algorithms such as TextRank have the defect of being difficult to improve by themselves. Secondly, with the popularization of the internet and mobile phones, more and more news text data are spread on the internet, and no algorithm capable of combining news headlines exists in the market at present. Thirdly, the traditional unsupervised algorithm has the characteristic of high robustness, but many of the unsupervised algorithms do not have the characteristic of understanding semantics provided by the deep learning algorithm; most deep learning algorithms can understand semantics but rely mostly on high quality labeling data.

Disclosure of Invention

Therefore, the invention provides a Chinese extraction integrated unsupervised summarization method based on auxiliary information. Modeling is carried out on the data set by adopting two different unsupervised learning methods, and the result abstract is extracted from the extracted result through an improved version of MMR and a news title.

The method comprises the following specific steps:

step 1, text preprocessing including word segmentation and word stopping and other processes;

and 2, providing the news text data set to a TextRank algorithm and an Affinity prediction algorithm. Because the Affinity Propagation algorithm cannot directly process text data, the algorithm firstly carries out vectorization operation on Chinese words by using 800 ten thousand Chinese word vectors provided by Tencent, and then converts the obtained Chinese word vectors into Chinese sentence vectors. The advantage of using pre-training word vectors is that the results are relatively stable while semantic information between words can be obtained. And finally, defining the Affinity Propagation algorithm to obtain a sentence with the centroid as a news abstract calculated by the Affinity Propagation.

And 3, the final news abstract is as follows:

wherein Q represents the title of the news; s represents the selected news abstract; s denotes a paragraph in the news. The algorithm takes into account the characteristics of the sequence of paragraphs in the news, and the earlier paragraphs in the process of writing the news contain more important information. λ controls the diversity and accuracy of the MMR algorithm. When the lambda value is larger, the accuracy of the summary selected by the MMR is higher; when the lambda value is smaller, the summary diversity selected by MMR is higher. The ultimate goal is to balance the diversity and accuracy of the digests. λ in MMR is preset to 0.7, and the input of MMR algorithm is the output of TextRank algorithm and Affinity prediction algorithm, that is, the respective summaries generated for news text.

The extraction automatic summarization algorithm based on ensemble learning and Chinese news headline assistance improves the robustness of the existing algorithm and overcomes the self defect of a single mathematical model through ensemble learning; the algorithm of the invention reasonably utilizes news headline information to improve the performance of the algorithm, because the news headlines generally extract main contents in news with high quality; and finally, combining the semantic information with the traditional algorithm to obtain a better result by using the pre-training word vector.

Drawings

FIG. 1 is a flow chart embodying the present invention;

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

The invention provides a Chinese extraction integrated unsupervised abstract method based on auxiliary information. Modeling is carried out on the data set by adopting two different unsupervised learning methods, and the result abstract is extracted from the extracted result through an improved version of MMR and a news title.

The method comprises the following specific steps:

step 1, text preprocessing, wherein the preprocessing comprises the processes of word segmentation, word stopping and the like; in the first text preprocessing step, a sentence splitting operation is firstly performed on a text to split the text into n paragraphs. After segmenting into paragraphs, we use a segmentation tool (end segmentation) to segment the paragraphs. At the same time, stop words in the passage are filtered. Stop words are words that appear in large numbers in general text, such as i, you, he, etc. The filtered result contains paragraphs of stem information in the article.

And 2, providing the news text data set to a TextRank algorithm and an Affinity prediction algorithm. The concrete process of the TextRank processing is that firstly, each sentence is represented by a vector by using a result after preprocessing; then, calculating the similarity between sentence vectors and storing the similarity in a matrix; then, the similarity matrix conversion takes sentences as nodes and the similarity as edges, and the graph structure is calculated by using a PageRank algorithm; and finally, selecting a group of sentences with the highest rank to form a final abstract.

The idea of the propagation sub-algorithm is to consider all samples as nodes of the network and then compute the center of each sample through message passing of each edge in the network. In the clustering process, two kinds of messages are transmitted at each node, namely attraction degree and attribution degree. The input to the Affinity Propagation algorithm is a similarity matrix between samples. After preprocessing, chinese sentence vectors corresponding to sentences are constructed using the chinese word vectors and the SIF algorithm, and then the similarity between the sentences is calculated. R (i, j) in affinity propagation describes the class representation degree of sample j suitable for sample i; a (i, j) to describe how well sample i chooses sample j as its class representation. The larger the sum of r (i, j) and a (i, j), the greater the likelihood that point j will be the center of the cluster.

The sentence corresponding to the center obtained by Affinity Propagation is extracted as a final abstract. The invention uses 800 ten thousand Chinese word vectors provided by Tencent to firstly carry out vectorization operation on Chinese words, uses SIF [10] to convert the obtained Chinese word vectors into Chinese sentence vectors, and has the advantages of stable result and capability of acquiring semantic information among words by using pre-training word vectors. The Tencent Chinese word vector can learn the association relationship between words in a high-dimensional space. On a trained english word2vec word vector model, semantic association relations between words can be obtained, such as king-man + wman ═ queen. Another representation method of the word vector is a one-hot representation form, but the association relationship between words cannot be learned. The invention converts the word vector obtained by the previous Tencent into a sentence vector, and the method comprises the following two steps: step 2.1, multiplying each vector in the sentence by a weight a/(a + P _ w), wherein a is a constant (taking 0.001), and P _ w is the word frequency of the word; for words with higher frequency of occurrence, the smaller their weight; step 2.2 calculates the first principal component of the sentence vector matrix, letting each sentence vector subtract its projection on u.

And step 3, obtaining the final news abstract in the following mode:

wherein Q represents the title of the news; s represents a news digest that has been selected, and Sim () is used to calculate the similarity between sentences. S is a sentence in the original article, Sim (S, Q) is used to calculate the similarity between the sentence and the title in the original article, Sim (S, S) calculates the similarity between the sentence and the obtained abstract in the original article. ArgMax gives the index of the largest element in the set. Based on the characteristics of the order of paragraphs in the news, the more important the earlier paragraphs in the process of composing the news contain. The lambda controls the diversity and accuracy of the MMR algorithm, and when the lambda value is larger, the accuracy of the summary selected by the MMR is higher; when the lambda value is smaller, the summary diversity selected by MMR is higher. The ultimate goal is to balance the diversity and accuracy of the digests. λ in MMR is preset to 0.7.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A Chinese abstraction integrated unsupervised abstract method based on auxiliary information is characterized by comprising the following steps of 1, text preprocessing, wherein the preprocessing comprises the steps of segmenting words and stopping words of paragraphs; step 2, providing the news text data set for processing based on a graph and a clustering extraction automatic summarization algorithm; and 3, obtaining a final news abstract by using an improved MMR algorithm in the following mode:

wherein Q represents the title of the news; s represents the selected news abstract, Sim () is used for calculating the similarity between sentences, S refers to a certain sentence in the original article, Sim (S, Q) is used for calculating the similarity between a certain sentence and a title in the original article, Sim (S, S) is used for calculating the similarity between a certain sentence and the obtained abstract in the original article, and ArgMax gives an index set of the largest element in the set.

2. The method as claimed in claim 1, wherein the preprocessing is performed by first performing a sentence segmentation operation on the text, dividing the text into a plurality of paragraphs, then performing a word segmentation on the paragraphs by using a word segmentation tool, and filtering stop words in the paragraphs, wherein the stop words are words that appear in a large amount in the text and have no actual meaning, and the filtered result is a paragraph of the main information in the text.

3. The method of claim 2, wherein the unsupervised abstracted automatic summarization algorithm is implemented by first using the result after preprocessing to vector each sentence; then, calculating the similarity between sentence vectors and storing the similarity in a matrix; then, the similarity matrix conversion takes sentences as nodes and the similarity as edges, and the graph structure is calculated by using a PageRank algorithm; and finally, selecting a group of sentences with the highest rank to form a final abstract.

4. The method of claim 3, wherein in the clustering algorithm, r (i, j) is a class representation degree to which a sample j fits a sample i; a (i, j) is the suitability of selecting a sample j as a class representation of the sample i, the larger the sum of r (i, j) and a (i, j) is, the higher the possibility that a point j is taken as a clustering center is, and finally, a sentence corresponding to the center obtained by a clustering algorithm is extracted to be used as a final abstract.