CN111209752A - Chinese extraction integrated unsupervised abstract method based on auxiliary information - Google Patents

Chinese extraction integrated unsupervised abstract method based on auxiliary information Download PDF

Info

Publication number
CN111209752A
CN111209752A CN202010211550.3A CN202010211550A CN111209752A CN 111209752 A CN111209752 A CN 111209752A CN 202010211550 A CN202010211550 A CN 202010211550A CN 111209752 A CN111209752 A CN 111209752A
Authority
CN
China
Prior art keywords
news
abstract
similarity
algorithm
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010211550.3A
Other languages
Chinese (zh)
Inventor
马帅
蒋浩谊
华轶名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Publication of CN111209752A publication Critical patent/CN111209752A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a Chinese extraction integrated unsupervised abstract method based on auxiliary information, which comprises the following steps of 1, text preprocessing, wherein the preprocessing comprises the steps of segmenting a paragraph and removing stop words; step 2, providing the news text data set for processing based on a graph and a clustering extraction automatic summarization algorithm; and step 3, obtaining the final news abstract in the following mode:
Figure DDA0002422997770000011
Figure DDA0002422997770000012
wherein Q represents the title of the news; s represents a news digest that has been selected, and Sim () is used to calculate the score between sentencesSimilarity, S refers to a certain sentence in the original article, Sim (S, Q) is used for calculating the similarity between the certain sentence in the original article and the news title, Sim (S, S) is used for calculating the similarity between the paragraph in the original article and the obtained abstract, and ArgMax gives the index set of the largest elements in the set.

Description

Chinese extraction integrated unsupervised abstract method based on auxiliary information
Technical Field
The invention relates to a method for generating an abstract, in particular to a Chinese extraction integrated unsupervised abstract method based on auxiliary information.
Background
As the text information has increased explosively, readers need more efficient and faster ways to learn about the main content of articles. The automatic summarization task is a branch of natural language processing, and is a technology for generating a short text from one or more long texts. Automatic summarization may be applied in a variety of scenarios, such as news texts, meeting records, medical profiles, social texts, etc. Automated summarization has been widely studied and the prior art is divided into two categories: an abstract automatic summary and a generative automatic summary. The extraction automatic abstract selects important language information from the original text and splices the important information to form a final abstract; the generative automatic abstract captures key information through learning rules to generate a sentence which does not appear in the original article. Generative automatic summarization techniques have evolved rapidly in recent years, but generative automatic summarization requires a large amount of training data and as a result does not have the poor generalization capability. The industry typically uses an abstract automatic summary rather than a generative automatic summary. Traditional abstracted automatic summarization has no requirement for language classes. Unlike western languages, chinese processing is very challenging. The biggest difference is that Chinese needs to be better processed by word segmentation tools. The quality of the word segmentation tool directly or indirectly influences the quality of the final text abstract.
In the prior art, the essence of a TextRank algorithm of an automatic summarization method based on a graph is still a Pagerank algorithm proposed by Google, the largest problem of the Pagerank algorithm is that the ranking sinks, and overlapped sentences can be caused to be in a final summarization in the TextRank result; the automatic summarization method based on the center seriously depends on the quality of a clustering algorithm, and the robustness has problems; the automatic summarization method based on the submodular function does not consider similarity between words; the automatic summarization method based on deep learning relies heavily on a large amount of labeled training data and the results are not stable enough.
The prior art abstract automatic summarization algorithm has many areas which can be improved, and is mainly embodied in three aspects. First, most of the abstract automatic summarization algorithms are based on a single mathematical model, and many algorithms such as TextRank have the defect of being difficult to improve by themselves. Secondly, with the popularization of the internet and mobile phones, more and more news text data are spread on the internet, and no algorithm capable of combining news headlines exists in the market at present. Thirdly, the traditional unsupervised algorithm has the characteristic of high robustness, but many of the unsupervised algorithms do not have the characteristic of understanding semantics provided by the deep learning algorithm; most deep learning algorithms can understand semantics but rely mostly on high quality labeling data.
Disclosure of Invention
Therefore, the invention provides a Chinese extraction integrated unsupervised summarization method based on auxiliary information. Modeling is carried out on the data set by adopting two different unsupervised learning methods, and the result abstract is extracted from the extracted result through an improved version of MMR and a news title.
The method comprises the following specific steps:
step 1, text preprocessing including word segmentation and word stopping and other processes;
and 2, providing the news text data set to a TextRank algorithm and an Affinity prediction algorithm. Because the Affinity Propagation algorithm cannot directly process text data, the algorithm firstly carries out vectorization operation on Chinese words by using 800 ten thousand Chinese word vectors provided by Tencent, and then converts the obtained Chinese word vectors into Chinese sentence vectors. The advantage of using pre-training word vectors is that the results are relatively stable while semantic information between words can be obtained. And finally, defining the Affinity Propagation algorithm to obtain a sentence with the centroid as a news abstract calculated by the Affinity Propagation.
And 3, the final news abstract is as follows:
Figure BDA0002422997750000021
wherein Q represents the title of the news; s represents the selected news abstract; s denotes a paragraph in the news. The algorithm takes into account the characteristics of the sequence of paragraphs in the news, and the earlier paragraphs in the process of writing the news contain more important information. λ controls the diversity and accuracy of the MMR algorithm. When the lambda value is larger, the accuracy of the summary selected by the MMR is higher; when the lambda value is smaller, the summary diversity selected by MMR is higher. The ultimate goal is to balance the diversity and accuracy of the digests. λ in MMR is preset to 0.7, and the input of MMR algorithm is the output of TextRank algorithm and Affinity prediction algorithm, that is, the respective summaries generated for news text.
The extraction automatic summarization algorithm based on ensemble learning and Chinese news headline assistance improves the robustness of the existing algorithm and overcomes the self defect of a single mathematical model through ensemble learning; the algorithm of the invention reasonably utilizes news headline information to improve the performance of the algorithm, because the news headlines generally extract main contents in news with high quality; and finally, combining the semantic information with the traditional algorithm to obtain a better result by using the pre-training word vector.
Drawings
FIG. 1 is a flow chart embodying the present invention;
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The invention provides a Chinese extraction integrated unsupervised abstract method based on auxiliary information. Modeling is carried out on the data set by adopting two different unsupervised learning methods, and the result abstract is extracted from the extracted result through an improved version of MMR and a news title.
The method comprises the following specific steps:
step 1, text preprocessing, wherein the preprocessing comprises the processes of word segmentation, word stopping and the like; in the first text preprocessing step, a sentence splitting operation is firstly performed on a text to split the text into n paragraphs. After segmenting into paragraphs, we use a segmentation tool (end segmentation) to segment the paragraphs. At the same time, stop words in the passage are filtered. Stop words are words that appear in large numbers in general text, such as i, you, he, etc. The filtered result contains paragraphs of stem information in the article.
And 2, providing the news text data set to a TextRank algorithm and an Affinity prediction algorithm. The concrete process of the TextRank processing is that firstly, each sentence is represented by a vector by using a result after preprocessing; then, calculating the similarity between sentence vectors and storing the similarity in a matrix; then, the similarity matrix conversion takes sentences as nodes and the similarity as edges, and the graph structure is calculated by using a PageRank algorithm; and finally, selecting a group of sentences with the highest rank to form a final abstract.
The idea of the propagation sub-algorithm is to consider all samples as nodes of the network and then compute the center of each sample through message passing of each edge in the network. In the clustering process, two kinds of messages are transmitted at each node, namely attraction degree and attribution degree. The input to the Affinity Propagation algorithm is a similarity matrix between samples. After preprocessing, chinese sentence vectors corresponding to sentences are constructed using the chinese word vectors and the SIF algorithm, and then the similarity between the sentences is calculated. R (i, j) in affinity propagation describes the class representation degree of sample j suitable for sample i; a (i, j) to describe how well sample i chooses sample j as its class representation. The larger the sum of r (i, j) and a (i, j), the greater the likelihood that point j will be the center of the cluster.
The sentence corresponding to the center obtained by Affinity Propagation is extracted as a final abstract. The invention uses 800 ten thousand Chinese word vectors provided by Tencent to firstly carry out vectorization operation on Chinese words, uses SIF [10] to convert the obtained Chinese word vectors into Chinese sentence vectors, and has the advantages of stable result and capability of acquiring semantic information among words by using pre-training word vectors. The Tencent Chinese word vector can learn the association relationship between words in a high-dimensional space. On a trained english word2vec word vector model, semantic association relations between words can be obtained, such as king-man + wman ═ queen. Another representation method of the word vector is a one-hot representation form, but the association relationship between words cannot be learned. The invention converts the word vector obtained by the previous Tencent into a sentence vector, and the method comprises the following two steps: step 2.1, multiplying each vector in the sentence by a weight a/(a + P _ w), wherein a is a constant (taking 0.001), and P _ w is the word frequency of the word; for words with higher frequency of occurrence, the smaller their weight; step 2.2 calculates the first principal component of the sentence vector matrix, letting each sentence vector subtract its projection on u.
And step 3, obtaining the final news abstract in the following mode:
Figure BDA0002422997750000051
wherein Q represents the title of the news; s represents a news digest that has been selected, and Sim () is used to calculate the similarity between sentences. S is a sentence in the original article, Sim (S, Q) is used to calculate the similarity between the sentence and the title in the original article, Sim (S, S) calculates the similarity between the sentence and the obtained abstract in the original article. ArgMax gives the index of the largest element in the set. Based on the characteristics of the order of paragraphs in the news, the more important the earlier paragraphs in the process of composing the news contain. The lambda controls the diversity and accuracy of the MMR algorithm, and when the lambda value is larger, the accuracy of the summary selected by the MMR is higher; when the lambda value is smaller, the summary diversity selected by MMR is higher. The ultimate goal is to balance the diversity and accuracy of the digests. λ in MMR is preset to 0.7.
The extraction automatic summarization algorithm based on ensemble learning and Chinese news headline assistance improves the robustness of the existing algorithm and overcomes the self defect of a single mathematical model through ensemble learning; the algorithm of the invention reasonably utilizes news headline information to improve the performance of the algorithm, because the news headlines generally extract main contents in news with high quality; and finally, combining the semantic information with the traditional algorithm to obtain a better result by using the pre-training word vector.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (4)

1. A Chinese abstraction integrated unsupervised abstract method based on auxiliary information is characterized by comprising the following steps of 1, text preprocessing, wherein the preprocessing comprises the steps of segmenting words and stopping words of paragraphs; step 2, providing the news text data set for processing based on a graph and a clustering extraction automatic summarization algorithm; and 3, obtaining a final news abstract by using an improved MMR algorithm in the following mode:
Figure FDA0002422997740000011
wherein Q represents the title of the news; s represents the selected news abstract, Sim () is used for calculating the similarity between sentences, S refers to a certain sentence in the original article, Sim (S, Q) is used for calculating the similarity between a certain sentence and a title in the original article, Sim (S, S) is used for calculating the similarity between a certain sentence and the obtained abstract in the original article, and ArgMax gives an index set of the largest element in the set.
2. The method as claimed in claim 1, wherein the preprocessing is performed by first performing a sentence segmentation operation on the text, dividing the text into a plurality of paragraphs, then performing a word segmentation on the paragraphs by using a word segmentation tool, and filtering stop words in the paragraphs, wherein the stop words are words that appear in a large amount in the text and have no actual meaning, and the filtered result is a paragraph of the main information in the text.
3. The method of claim 2, wherein the unsupervised abstracted automatic summarization algorithm is implemented by first using the result after preprocessing to vector each sentence; then, calculating the similarity between sentence vectors and storing the similarity in a matrix; then, the similarity matrix conversion takes sentences as nodes and the similarity as edges, and the graph structure is calculated by using a PageRank algorithm; and finally, selecting a group of sentences with the highest rank to form a final abstract.
4. The method of claim 3, wherein in the clustering algorithm, r (i, j) is a class representation degree to which a sample j fits a sample i; a (i, j) is the suitability of selecting a sample j as a class representation of the sample i, the larger the sum of r (i, j) and a (i, j) is, the higher the possibility that a point j is taken as a clustering center is, and finally, a sentence corresponding to the center obtained by a clustering algorithm is extracted to be used as a final abstract.
CN202010211550.3A 2019-11-13 2020-03-24 Chinese extraction integrated unsupervised abstract method based on auxiliary information Pending CN111209752A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2019111045687 2019-11-13
CN201911104568 2019-11-13

Publications (1)

Publication Number Publication Date
CN111209752A true CN111209752A (en) 2020-05-29

Family

ID=70788950

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010211550.3A Pending CN111209752A (en) 2019-11-13 2020-03-24 Chinese extraction integrated unsupervised abstract method based on auxiliary information

Country Status (1)

Country Link
CN (1) CN111209752A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114201600A (en) * 2021-12-10 2022-03-18 北京金堤科技有限公司 Public opinion text abstract extraction method, device, equipment and computer storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107844609A (en) * 2017-12-14 2018-03-27 武汉理工大学 A kind of emergency information abstracting method and system based on style and vocabulary

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107844609A (en) * 2017-12-14 2018-03-27 武汉理工大学 A kind of emergency information abstracting method and system based on style and vocabulary

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
石元兵 等: ""一种基于TextRank的中文自动摘要方法"", 《通信技术》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114201600A (en) * 2021-12-10 2022-03-18 北京金堤科技有限公司 Public opinion text abstract extraction method, device, equipment and computer storage medium

Similar Documents

Publication Publication Date Title
CN107480143B (en) Method and system for segmenting conversation topics based on context correlation
CN108255813B (en) Text matching method based on word frequency-inverse document and CRF
CN110929038B (en) Knowledge graph-based entity linking method, device, equipment and storage medium
JP6335898B2 (en) Information classification based on product recognition
CN111191022B (en) Commodity short header generation method and device
CN108538286A (en) A kind of method and computer of speech recognition
CN108027814B (en) Stop word recognition method and device
CN109002473A (en) A kind of sentiment analysis method based on term vector and part of speech
CN102279890A (en) Sentiment word extracting and collecting method based on micro blog
CN110196910B (en) Corpus classification method and apparatus
CN112100365A (en) Two-stage text summarization method
CN110287314A (en) Long text credibility evaluation method and system based on Unsupervised clustering
CN111475608B (en) Mashup service characteristic representation method based on functional semantic correlation calculation
CN112860896A (en) Corpus generalization method and man-machine conversation emotion analysis method for industrial field
CN113392305A (en) Keyword extraction method and device, electronic equipment and computer storage medium
CN112434533A (en) Entity disambiguation method, apparatus, electronic device, and computer-readable storage medium
CN116628173B (en) Intelligent customer service information generation system and method based on keyword extraction
CN113486143A (en) User portrait generation method based on multi-level text representation and model fusion
CN112765977A (en) Word segmentation method and device based on cross-language data enhancement
CN117216214A (en) Question and answer extraction generation method, device, equipment and medium
CN111209752A (en) Chinese extraction integrated unsupervised abstract method based on auxiliary information
CN108427769B (en) Character interest tag extraction method based on social network
CN110705285A (en) Government affair text subject word bank construction method, device, server and readable storage medium
CN110874408A (en) Model training method, text recognition device and computing equipment
CN111949781B (en) Intelligent interaction method and device based on natural sentence syntactic analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200529

RJ01 Rejection of invention patent application after publication