CN107688652B

CN107688652B - Evolution type abstract generation method facing internet news events

Info

Publication number: CN107688652B
Application number: CN201710775894.5A
Authority: CN
Inventors: 吴仁守; 王红玲
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2017-08-31
Filing date: 2017-08-31
Publication date: 2020-12-29
Anticipated expiration: 2037-08-31
Also published as: CN107688652A

Abstract

The invention relates to an evolution type abstract generating method facing to internet news events, which comprises the following steps: inputting a related news document set; representing a document as a subject feature vector through an LDA subject model, wherein the dimension of the subject feature vector is a first preset value; clustering documents represented as topic feature vectors, wherein each class represents a topic; calculating local scores of the documents in each topic; calculating the global scores of all the documents in each topic; calculating the final scores of the documents in each topic; extracting the document titles with high scores from each topic and sequencing the document titles according to time to serve as abstracts; and outputting the abstract. The evolution type abstract generating method facing the internet news events ensures that the extracted abstract has dynamic evolution and is continuous front and back, the readability is strong, and the experimental result shows that compared with the traditional multi-document abstract system, the system is greatly improved in the aspects of redundancy, continuity, dynamic evolution and the like.

Description

Evolution type abstract generation method facing internet news events

Technical Field

The invention relates to a summary generation method, in particular to an evolution type summary generation method facing to internet news events.

Background

Because the evolution abstract facing the internet news event is a novel multi-document automatic abstract, the following research situations of the multi-document automatic abstract and the evolution abstract characteristics are introduced respectively, and the research situations comprise: redundant control and topic evolution.

Multi-document automatic Abstract State of research

Multi-document Summarization (MDS) is a natural language processing technique (Radev et al 2002) that abstracts one text summary by compressing the main information about multiple text descriptions under the same topic. The general multi-document automatic digest comprises three steps: text analysis, text content selection and abstract generation. The method can be divided into two types according to the abstract selection mode: the digest formed by the decimation method is called an Extraction (Extraction) digest, and the digest formed by the understanding method is called an understanding (Abstraction) digest.

The extraction type abstract is to extract the ready-made sentences in the text, does not process or slightly processes the ready-made sentences, and reorganizes the sequence to form the abstract. A limitation of this approach is that its performance is heavily dependent on the quality of the sentences in the source document. In addition, for multi-document automatic summarization, since sentences are derived from different documents, the readability and consistency of the summarization can be greatly influenced by the sequence and organization of the sentences. The advantage of course is that the generated abstract is helpful for people to browse and judge (Hirao et al 2002), and can ensure that the grammar of abstract sentences is the main direction of theoretical research at present, and representative work includes: bhandari et al (2008), Wong et al (2008), Hachey (2009), Celikyilmaz & Hakkani-Tur (2010), Lin et al (2012), Almeida & Martins (2013), Chen et al (2013).

The comprehension abstract (Barzilay 2005) generally extracts important language units such as words, phrases, sentences and the like reflecting subject contents from documents, and then generates the abstract by using language generation technologies such as information fusion, compression and the like, wherein abstract sentences are not limited to sentences in source documents. The method has the advantages that the limitation of source document sentences is broken through by the abstract result, the redundancy can be better processed, and the consistency of the theme is emphasized. However, the comprehension type method has high requirements on language generation technology, and due to the lack of reliable theoretical support and technology, the abstract generated by the method is difficult to put into practical use and still in an experimental stage.

As one of the multi-document automatic abstracts, the evolutionary abstract makes a time stamp for each document, and then constructs an abstract in time series. The evolution abstract facing the internet news event is a report document aiming at the internet news event, and the evolution abstract is extracted according to the time sequence to provide the whole process of the event for the user.

The evolutionary abstract with the time marks is used as a novel multi-document automatic abstract technology and has less related research. From published papers, the time-dependent summarization technique was first proposed by alan et al (2001), by extracting key noun phrases and named entities. Chieu et al (2004) created a similar system using sentence cells. These methods, however, do not take into account the evolutionary nature specific to news events. Recently, Yan et al (2011) used a graph-based approach, first mapping sentences to the same plane as a function of time, and then creating a generalized digest. On the basis, they map the evolutionary algorithm with time stamps to an optimization problem Yan et al (2011b) considering relevance, coverage, coherence and diversity. More recently, Li & Li (2013) was used to implement evolution summary with time-stamps by creating an evolution hierarchical topic model (EHDP).

The present research situation at home and abroad of the multi-document automatic abstract is introduced from two aspects of an abstract extraction method and a text formalization representation method.

Abstract extraction method

The main methods for abstracting the abstract comprise: linguistic analysis based methods, statistical based methods, clustering based methods, graph based methods, and the like.

Method based on language analysis

The linguistic analysis based approach uses natural language analysis methods to identify key paragraphs, relationships between words, and discourse relationships. In determining the key paragraphs, the analysis of word relevance and chapter structure is mainly relied on. For example, the degree of lexical association between potential paragraphs and the rest of the article is used (Barzilay & Elhadad 1999; Radev et al 2000). This method generally requires a system to reliably calculate the chapter structure. Another method is to use discourse analysis technology to model the global structure of the document and mine the intrinsic information of the document, such as document format, lexicographic format, etc. (Zhu 2002; Teufel & Moens 2002; Taboada 2006). Wherein Zhu (2002) utilizes CST language model (Cross-document Structure Theory) to analyze the modification Structure formed by the related document set, and generates the multi-document abstract of the specific field based on the modification Structure. Wan et al (2010) use machine translation to achieve the purpose of Chinese automatic abstract based on English automatic abstract. Although the method based on the language analysis has great effect on improving the performance of the abstract, the current language analysis technology is not mature enough, so the effectiveness of the method is limited to a certain extent.

Statistical-based method

Statistics-based methods early on computed scores for each sentence primarily through features including the position of the sentence in the text, word frequency of words and phrases, key phrases, etc. More sophisticated techniques have recently been used to determine the extracted sentences, and these techniques typically rely on machine learning methods to identify important features. Automatic summarization using machine learning methods begins with Kupiec et al (1995), who automatically summarize using a Bayesian classifier on a set of features extracted from a scientific paper and its summarized corpus. Models such as bayesian classifiers, naive bayesian models, and HMMs (hidden markov models) have been used in automatic abstractions to conditional random fields CRF, which have been developed in recent years. Machine learning methods are also applied to learning individual features, for example Lin & Hovy (1997) uses machine learning methods to solve the problem of how to decide sentence position affects sentence selection; witbrock & Mittal (1999) uses statistical methods to select important words, and their syntactic contexts.

Clustering-based method

The clustering-based method mainly utilizes information of a multi-document set, namely, the multi-document set is researched as a whole to measure similarity between all sentence pairs, on the basis, various clustering methods (K-Means, K-Medoids, AP and the like) are used for identifying topics of public information, and central sentences are extracted from each category to serve as document abstracts, such as McKeown and the like (1999), Radev and the like (2000), Wan & Yang (2006), Qazvinian & Radev (2008) and the like. Some scholars also propose sub-event concepts (Boros et al 2001; Daniel et al 2003; Fung and Gigai, 2003), which regard the logically significant sub-collections formed by clustering in the multi-document collection as sub-events, and then extract the sub-events to generate the main contents of the abstract.

Graph-based method

Graph-based methods first map sentences in the document and the similarity between sentences, and then rate the sentences using some graph-related ranking algorithms, such as HITS (Kleinberg,1998), PageRank (Brin & Page,1998), and the like. The essence of these algorithms is to find the principal eigenvectors of the matrix representing the graph, which is equivalent to the topic of the partition graph. For example, a LexPageRank algorithm similar to PageRank is proposed in Erkan & Radev (2004) to evaluate nodes (i.e., sentences) in the graph, and important sentence composition abstracts are found by scoring the nodes in the graph using the LexPageRank. A similar digest method is proposed by Mihalcea (2005).

Text formalization representation method

Textual formal representations are fundamental work in natural language processing tasks. One of the commonly used representation methods is the Bag of words (Bag of words), which mainly represents the text by words, n-grams of words or weights of words (such as tf-idf). Although this method is quite common in multi-document automatic digest research, it has significant weaknesses: the method only considers the expression at the vocabulary level, and lacks consideration on the structures at sentence level and chapter level. This textual representation has inherent limitations because the nature of the automatic digest is a chapter-level process. The other is Vector Space representation (Vector Space Model), that is, text content is represented as a Vector in Vector Space, and the processing of the text content is simplified to Vector operation in Vector Space, for example, semantic similarity can be expressed by similarity in Vector Space. The expression modes of the vector space representation method are various, but the vector space representation method is large in calculation amount and complex after being converted into matrix operation.

In recent years, with the development of topic models, the use of topic models to represent texts has been explored, and related studies include Arora et al (2008), Bhandari et al (2008), Haghighi & Vanderwende (2009), and the like. For example, Arora et al (2008) model documents using LDA (latent dirichlet distribution), represent each sentence to a topic, represent the topics as a weight matrix of words, and make SVD (singular value decomposition) solve an orthogonal representation of a sentence set as a basis for selecting sentences, thereby reducing redundancy of digest information. The method has the main problems that the topic model has certain limitation, so the topic analysis result has limitation, and the result of the automatic abstract is influenced, for example, the LDA model only carries out topic analysis on a vocabulary layer, and lacks consideration on a sentence layer, a text layer and the hierarchical relationship thereof, so the obtained automatic abstract still has the problem of topic dispersion among sentences.

Current state of the art redundancy control

The redundancy control is throughout the entire process of multi-document automatic summarization. In the text analysis stage, the topics and structures in the text are analyzed, and the essence is the basis for identifying redundant information; in the text content selecting stage, identifying important information in the text, and substantially discarding redundant information; in the digest generation phase, the redundant information in the digest can be further controlled by various special redundancy control strategies.

The current special redundancy control strategies can be classified into three types: maximum edge correlation (MMR), inter-document information including CSIS (Cross-sequence information Subsubproposal), and statistics.

The MMR method (carbonic and Goldstein 1998) measures the similarity between candidate and selected excerpts, and only selects candidate segments when they contain enough new information, primarily based on a weighted combination of the relevance of the sentences in the document and the redundancy between the selected sentences. The CSIS method (Radev et al 2004) decides whether to select a sentence as a digest sentence by whether the sentence is contained in another sentence already in the digest. The main differences between MMR and CSIS are: the CSIS determines to select or not select according to whether two sentences are contained or not; whereas MMR considers the redundancy of the current sentence with respect to the user query and the current abstract. The statistical method (Larkey 2003) uses some statistical features to determine whether the sentence is already in the summary to achieve redundancy control.

For the evaluation of the extraction type abstract, besides the common evaluation indexes, the following problems need to be considered: which attributes of the document should be preserved in the digest. Katz (1996) found that: the number of occurrences of a document content word is independent of the length of the document, with the number of different content words in the text increasing with the length of the document. This finding gives the text summary three important attributes that it has: representativeness, informativeness, and diversity. If the effective information in the text is considered to be the content words, the representativeness means which important content words in the text should be contained in the abstract; informative means that as many content words as possible should be included in the digest; diversity refers to how many different content words should be in the digest.

As can be seen from the research trend of redundancy control, the MMR method actually considers the information and diversity of the document abstract; the CSIS controls redundancy from the viewpoint of diversity of abstracts; the statistical method is similar to the CSIS method, and redundancy control is considered from the diversity of abstracts; common clustering methods are also considered from diversity. In addition, the sorting method without additional redundancy control is considered from representativeness, and the traditional method for scoring sentences according to characteristics mainly considers informativeness. As a representative example, Haghighi et al (2009) determined the choice of the digest sentence by judging the topic similarity between the candidate digest and the source document, that is, the more similar the topics the more important the description sentence is, the more important it should be selected as the digest, which mainly takes into account the representativeness.

Topic evolution State of research

An important attribute of the evolutionary summarization for internet news events is dynamic evolvement. As time progresses, the content of the news topic changes. How to effectively organize the large-scale documents and acquire the evolution of topics in the text set according to the time sequence so as to better embody the development process of events, and the method is an important research content of the evolution abstract. The research content is similar to the topic evolution research in the TDT task, and the research content is measured by the dynamics, the developability and the difference of the same topic shown over time.

Currently, there are many methods and achievements for research on topic evolution, and among the commonly used methods are a probabilistic topic model-based method (such as LDA-based method) and a clustering method (ganman spring, et al 2006, Yookyung et al 2011). LDA-based methods are representative, and there are generally three different evolutionary methods: one is to incorporate time as an observable into a topic model, such as the TOT model (Wang & McCallum 2006); secondly, generating topics on the document set by using a topic model, and then checking the distribution of the topics on discrete time according to the time information of the text to measure evolution, such as Griffiths and the like (2004); thirdly, the document set is firstly dispersed to the corresponding time window according to a certain time granularity, and a topic model is applied to each window to obtain evolution, such as Wang et al (2008). However, the LDA-based method cannot detect the generation of new topics since the number of topics is assumed to be fixed.

Prior art documents:

[1].Arora,R.and B.Ravindran.2008.Latent dirichlet allocation based multi-document summarization.AND 2008:91-97

[2].Almeida M.；A.Martins.2013.Fast and Robust Compressive Summarization with Dual Decomposition and Multi-Task Learning.ACL’2013:196-206.

[3].Barzilay R.,and Mckeown K.R.2005.Sentence fusion for Multi-document news summarization.Computational Linguistics,31(3):297-328.

[4].Barzilay,R.,Elhadad,N.,and McKeown,K.2002.Inferring Strategies for Sentence Ordering in Multidocument News Summarization.Journal of Artificial Intelligence Research,17:35–55.

[5].Barzilay,R.,and Lapata,M.2005.Modeling Local Coherence:An Entity-basedApproach.ACL 2005:141-148.

[6].Barzilay,R.,and Lapata,M.2008.Modeling Local Coherence:An Entity-BasedApproach.Computational Linguistics,34:1–34.

[7].Bhandari,H.,M.Shimbo,T.Ito and Y.Matsumoto.2008.Generic text summarization using probabilistic latent semantic indexing.IJCNLP 2008:133-140

[8].Blei D.M.,A.Y.Ng and M.L.Jordan.2003.Latent Dirichlet Allocation.Journal ofMachine Learning Research 2003:993-1022.

[9].Boros E.,P.B.Kantor,and D.J.Neu.2001.A Clustering BasedApproach to Creating Multi-Document Summaries.SIGIR 2001.

[10].Bollegala D.,N.Okazaki,and M.Ishizuka.2006 A Bottom-Up Approach to Sentence Ordering forMulti-Document Summarization.ACL 2006:385-392。

[11].Carbonel J.and Goldstein J.1998.The use of MMR,diversity-based reranking for reordering documents and producing summaries.SIGIR 2004:335-336

[12].Celikyilmaz A.and Hakkani-Tur D..2010.A Hybrid Hierarchical Model forMulti-Document Summarization.ACL 2010:815-824.

[13].Chen Li,Q.Xian,Y Liu.2013.Using Supervised Bigram-based ILP for Extractive Summarization.ACL 2013:1233-1242.

[14].Daniel N.,D.Radev,and T.Allison.2003.Sub-event based multi-document summarization.NAACL 2003:9-16.

[15].Daume III H.and Marcu D..2006.Bayesian query focused summarization.ACL 2006:305-312.

[16].Deerwester S.,S.Dumais,T.Landauer,G.Furnas,and R.Harshman.1990.Indexing by latent semantic analysis.Journal of the American Society of Information Science,41(6):391-407.

[17].Erkan G and Radev DR.2004.LexPageRank:Prestige in Multi-Document Text Summarization.EMNLP 2004.

[18].Fung P.,G.Ngai.2003.Combining Optimal Clustering and Hidden Markov Model for Extractive.ACL 2003:21-28.

[19].Gong,Y.H.and X.Liu.Generic text summarization using relevance measure and latent semantic analysis.SIGIR 2001:19-25.

[20].Haghighi A.and Vanderwende L.2009.Exploring Content Models for Multi-Document Summarization.NAACL 2009:362-370.

[21].Hirao T,Isozakik,Maeda E,et al.2002.Extracting Important Sentences with SupportVector Machines.ICCL 2002:1-7.

[22].Kleinberg J.1999.Authoritative sources in a hyperlinked environment.Journal oftheACM(JACM),1999,46(5):604-632.

[23].Kong,F.,G.D.Zhou,Q.M.Zhu 2009.Employing the Centering Theory in Pronoun Resolution from the Semantic Perspective.Proceedings of EMNLP 2009:986-996

[24].Li,J.W.and S.Li.2013.Evolutionary Hierarchical Dirichlet Process for Timeline Summarization.ACL 2013:556-560

[25].Lin,C.Y.1999.Training a selection function for extraction.CIKM 1999:55–62.

[26].Lin Z.H.,Hwee Tou Ng,and Min-Yen Kan.Automatically Evaluating Text Coherence Using Discourse Relations.2011.ACL 2011:997-1006.

[27].Louis Annie and Ani Nenkova.A coherence model based on syntactic patterns.2012.EMNLP 2012:1157-1168.

[28].Mani,I.and E.Bloedorn.1997.Multi-document summarization by graph search and matching.AAAI 1997:622–628.

[29].Marcu,Daniel.1997.The Rhetorical Parsing,Summarization,and Generation ofNatural Language Texts.Ph.D.thesis,University ofToronto,Toronto.

[30].Mihalcea R.2005.Language Independent Extractive Summarization.AAAI 2005:49-52.

[31].Qzavinian V.and Radev DR.2008.Scientific paper summarization using citation summary network.Coling 2008:689-696.

[32].Radev,DR.,E.Hovy and K.McKeown.2002.Introduction to the Special Issue on Summarization.Computational Linguistics,28(4):399-408.

[33].Radev,DR.,H.Jing and M.Budzikowska.2000.Centroid-based summarization of multiple documents:sentence extraction,utility-based evaluation,and user studies.NAACL 2000:21-29.

[34].Radev DR.2000.A common theory of information fusion from multiple text sources step one:Cross-document structure.ACL 2000:74-83

[35].Radev,D.,Jing,H.；Sty′s,M.；and Tam,D.2004.Centroid-based summarization of multiple documents.Information Processing&Management 40:919–938.

[36].Wan X.and Yang J.2006.Improved affinity graph based multi-document summarization.Proceedings ofHLT-NAACL 2006:181-184.

[37].Wan X,H.Li,and J.Xiao.2010.Cross-language document summarization based on machine translation qualityprediction.Proceedings ofACL 2010:917-926.

[38].Wang Hongling and Zhou Guodong.Topic-driven Multi-document Summarization.Proceedings ofIALP 2010:195-198.

[39].Wang Hongling,Zhu Qiaoming and Zhou Guodong.2012.Towards a Unified Framework for Standard and Update Multi-document Summarization.ACM Transactions onAsian Language Information Processing.11(2):Article 5.

[40].Wong,KF,M.Wu,and W.Li.2008.Extractive summarization using supervised and semi-supervised learning.Proceedings ofColing 2008:985-992.

[41].Yeh,J.Y.,H.R.Ke,W.P.Yang,and I.H.Meng.2005.Text summarization using trainable summarizer and latent semantic analysis.Information Processing&Management,41(1):75-95.

[42].Yan R.,L.Kon,C.Huang,X.J.Wan,X.M.Li,Y.Zhang.2011.Timeline Generation through Evolutionary Trans-Temporal Summarization.EMNLP 2011:433-443.

[43].Yan R.,X.J.Wan,J.Otterbacher.2011b.Evolutionary Timeline Summarization:a Balanced Optimization Framework via Iterative Substitution.SIGIR 2011:745-754.

[44].Yookyung Jo,J.E.Hopcroft,C.Lagoze.The web oftopics:discovering the topology oftopic evolution in a corpus.WWW’2011:257-266.

[45].Zhu Z.,Sasha B.,D.R.Radev.2002.Towards CST-Enhanced Summarization.AAAI/IAAI 2002:439-445.

[46] in the fields of spring, luoyuan and xu hong ripples 2006, topic identification technology research in topic identification and tracking, computer research and development 43(3): 489-.

[47] Wanghong lingling, Zhou dong, Zhu Qiao Ming.2012, redundancy control oriented Chinese multi-document automatic abstract, Chinese information report, 26(2):92-96

The traditional technology has the following technical problems:

the statistical-based method is simple and easy to implement, but the essence of the statistical-based method is that the lexical characteristics in the sentences are considered, deep language characteristics are not considered, so that the abstract performance has a bottleneck problem and is difficult to further promote. Meanwhile, the method does not consider the structural relationship among sentences and the topic distribution of the documents, and ignores the information among the documents in the multi-document set, so that the generated summary information has large redundancy and unbalanced topic distribution, and an effective redundancy control mechanism is needed.

The clustering-based method has the advantages of less redundancy, high information coverage rate and the like in theory, and is a popular method at present. However, the method has a problem that when the length requirement of the abstract is short, such as the TAC2008-2009 abstract task requires to extract an abstract with a length not exceeding 100 words for each multi-document set, because the length of the abstract is limited, not all sentences selected from the topic class can be used as abstract contents, so that the generated abstract is difficult to cover the topics of all documents, and thus the representativeness of the abstract contents is not strong. In addition, the structural relationship between the topics cannot be effectively identified by using the method.

The main problem of the graph-based method is that when a document has a plurality of topics, the graph-based rating algorithm can only find the most central topic, and neglects the importance of other topics, so that the generated summary can not cover the whole document topic.

With the increasing development of the internet and arousal of the consciousness of network citizens, the internet has become a main channel for people to acquire and distribute information, and the hot news event reports on the internet are increasing day by day. Typically, hot news events will last for a long period of time, such as "H7N 9 avian flu" which has been a hot event that has lasted for several months. In order to facilitate the user to quickly and comprehensively understand the development of the event, the automatic summarization becomes an effective means. The traditional multi-document summarization combines all documents about a hot event as a static document set to generate a summary, and as a result, it is difficult for a user to know the evolution process of the event.

Disclosure of Invention

Therefore, it is necessary to provide an evolutionary abstract generation method for internet news events, which is capable of ensuring that the extracted abstract has dynamic evolutionary property, continuity and strong readability on the basis of reducing abstract information redundancy.

An evolution type abstract generating method facing to internet news events comprises the following steps: inputting a related news document set; representing a document as a subject feature vector through an LDA subject model, wherein the dimension of the subject feature vector is a first preset value; clustering documents represented as topic feature vectors, wherein each class represents a topic; calculating local scores of the documents in each topic; calculating the global scores of all the documents in each topic; calculating the final scores of the documents in each topic; extracting the document titles with high scores from each topic and sequencing the document titles according to time to serve as abstracts; and outputting the abstract.

The evolution type abstract generating method facing the internet news events ensures that the extracted abstract has dynamic evolution and is continuous front and back, the readability is strong, and the experimental result shows that compared with the traditional multi-document abstract system, the system is greatly improved in the aspects of redundancy, continuity, dynamic evolution and the like.

In another embodiment, the clustering of the documents expressed as the topic feature vectors adopts a K-means clustering algorithm, wherein the clustering number is a second preset value.

In another embodiment, the second preset value is 7.

In another embodiment, the clustering of the documents represented as topic feature vectors uses the affinity propagation clustering algorithm.

In another embodiment, in the step of calculating the local scores of the documents in each topic, the scores of the documents are obtained by using a greedy vertex selection algorithm in consideration of the time relationship and the document similarity of the documents.

In another embodiment, in the step of calculating the global score of each document in each topic, the scores of the documents are obtained by mapping other topics to the current topic and considering the time relationship and the document similarity of each document by using a greedy vertex selection algorithm.

In another embodiment, the greedy vertex selection algorithm is a maximum marginal correlation algorithm.

In another embodiment, the greedy vertex selection algorithm is a DivRank algorithm.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method when executing the program.

A computer-readable storage medium, on which a computer program is stored, characterized in that the program realizes the steps of the above-mentioned method when executed by a processor.

Drawings

Fig. 1 is a flowchart of an evolutionary abstract generation method for internet news events according to an embodiment of the present application.

Fig. 2 is one of schematic diagrams of an input related news document set in an internet news event-oriented evolutionary summarization method according to an embodiment of the present application.

Fig. 3 is a second schematic diagram of an input related news document set in an internet news event-oriented evolutionary summarization method according to an embodiment of the present application.

Fig. 4 is a third schematic diagram of inputting a related news document set in an internet news event-oriented evolutionary summarization method according to an embodiment of the present application.

Fig. 5 is a schematic diagram of a PageRank algorithm and a DivRank algorithm in an internet news event-oriented evolutionary digest generation method according to an embodiment of the present application.

Fig. 6 is a schematic diagram illustrating calculation of global scores and local scores of documents in each topic in an internet news event-oriented evolutionary summarization method according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention

Referring to fig. 1, an evolutionary abstract generation method for internet news events includes:

and S110, inputting a related news document set.

The related documents of the news event that needs to generate the evolutionary abstract are respectively stored in the folders of the corresponding dates according to the publishing time, one document is stored as one file, as shown in fig. 2, and the file in the folder of a certain date is shown in fig. 3. The first line in each file is a news headline followed by a body text as shown in fig. 4.

And S120, representing the document as a theme feature vector through the LDA theme model, wherein the dimension of the theme feature vector is a first preset value.

By an LDA (latent dirichlet distribution) model of a first preset number of topics, a document can be compressed into a vector of first preset dimension: each dimension is the probability that the document belongs to the Topic, and this vector is also called the Topic presentation. As can be seen from the process of document clustering, similarity calculation is a very important step in document clustering, and has a direct influence on the quality of a clustering result. However, the traditional similarity calculation model only adopts word frequency statistics to represent documents, and a large amount of semantic information among the documents is lost, so that the similarity calculation effect is influenced. Therefore, the LDA model is adopted to model the document set, the topic distribution vector of each document is obtained, potential semantic knowledge is mined, and the defect of information loss caused by the fact that the documents are represented by word frequency information is overcome to a certain extent.

When the first preset value is 50, the effect is better.

S130, clustering the documents expressed as the topic feature vectors, wherein each class represents a topic.

The clustering algorithm can adopt Affinity Propagation clustering algorithm (AP for short). In another embodiment, the clustering of the documents expressed as the topic feature vectors adopts a K-means clustering algorithm, wherein the clustering number is a second preset value. The K-means clustering algorithm has better effect.

In another embodiment, the second preset value is 7. The preset value is 7, and the effect is better by adopting a K-means clustering algorithm.

It is to be understood that the type of clustering algorithm is not limited in this example.

And S140, calculating local scores of the documents in each topic.

And S150, calculating the global scores of the documents in each topic.

Referring to fig. 6, each topic is divided into a global part and a local part. In the global part, the scores of all the documents are obtained by mapping other topics to the current topic, considering the time relation and the document similarity of all the documents and utilizing a greedy vertex selection algorithm; in the local part, only the time relation and the document similarity of each document are considered, and the scores of the documents are obtained by using a greedy vertex selection algorithm. And finally, combining the global and local to obtain the final score of each document.

In another embodiment, the greedy vertex selection algorithm is a DivRank algorithm. (see specifically Mei Q, Guo J, Radev D. DivRank: the interactive of prediction and diversity in information networks [ C ]// ACM SIGKDD International Conference on Knowledge Discovery and Data mining. ACM,2010:1009 1018.) the effect of using the DivRank algorithm is better.

It is understood that in another embodiment, S150 may be performed first and then S140 may be performed. That is to say, in this embodiment, the sequence of calculating the local score of each document in each topic and calculating the global score of each document in each topic is not limited.

And S160, calculating the final scores of the documents in each topic.

And calculating the final scores of the documents in each topic by combining the global scores and the local scores of the documents in each topic.

And S170, extracting the document titles with high scores from each topic and sequencing the document titles according to time to be used as the abstract.

And for each topic, extracting the number of document titles, and determining according to the ratio of the number of the documents in the topic to the number of all the documents and the number of sentences of the abstract. For example, there are 2000 documents in total, 200 documents under the topic, the number of sentences of the abstract is 100, and then the top 100 × 200/2000-10 scored document titles are extracted under the topic. And after all document themes are extracted, sorting the document themes according to time to be used as an abstract. Wherein the score of the document is the final score of the document.

And S180, outputting the abstract.

The inventive concept of the present invention is briefly introduced:

from the current research situation, the evolutionary abstract facing the internet news event is an emerging task from the natural language processing field in recent years, and the related research is still in the starting stage. Although the essence of the method is still the multi-document automatic abstract, the method has the characteristics of high information redundancy, dynamic evolution and the like aiming at the Internet documents, and meanwhile, the continuity of the abstract needs to be additionally considered in the generation process of the evolution abstract.

Therefore, the project is to consider the redundancy control problem, the dynamic evolution and the continuity problem in the evolution type abstract with the time mark on the basis of the multi-document automatic abstract, realize an evolution type automatic abstract system with the time mark, mainly solve the key problems in the evolution type abstract and ensure the generalization, the relevance and the continuity of the abstract.

In terms of document representation: by an LDA (latent dirichlet distribution) model of a first preset number of topics, a document can be compressed into a vector of first preset dimension: each dimension is the probability that the document belongs to the Topic, and this vector is also called the Topic presentation. As can be seen from the process of document clustering, similarity calculation is a very important step in document clustering, and has a direct influence on the quality of a clustering result. However, the traditional similarity calculation model only adopts word frequency statistics to represent documents, and a large amount of semantic information among the documents is lost, so that the similarity calculation effect is influenced. Therefore, the LDA model is adopted to model the document set, the topic distribution vector of each document is obtained, potential semantic knowledge is mined, and the defect of information loss caused by the fact that the documents are represented by word frequency information is overcome to a certain extent.

In terms of redundant control: diversity is to embody information richness and sentence novelty, aiming at reducing information redundancy. However, using the standard PageRank does not result in diversity. The overall effect of PageRank confers high significance to a tightly connected community of nodes. The greedy vertex selection algorithm may achieve diversity by iteratively selecting the most important vertices, and then "covering" the already selected vertices, such as the maximum marginal relevance algorithm. DivRank is also a greedy vertex selection algorithm which can make important vertices draw vertex weights of attachments to reduce redundancy, so that DivRank is used to replace the standard PageRank algorithm to reduce information redundancy. (refer to FIG. 5)

In terms of consistency: traditional multi-document summarization generally focuses on extracting relevant sentences from article documents. The main drawback of this method is that it does not guarantee good intelligibility and high relevance of the abstract. Low relevance is often caused by the nature of the document data-it is difficult to select the correct sentence from a large number of sentences; low intelligibility is often caused by inconsistencies and lack of continuity between selected sentences. The headlines of web news articles have proven to be a reliable source that adequately provides a high-level overview of news events. The title is easily understandable by the reader without requiring too much reading time. News headlines typically provide information that is both body-centered and timely and complete, and thus suitable for creating coherent evolutionary summaries. Therefore, we choose news headlines to generate the evolutionary summaries.

On the contrary of dynamic evolution: the two aspects of topic evolution and time evolution are considered, a document is firstly clustered into a second preset value class corresponding to a second preset number of topics through a clustering method, and each topic is divided into a global part and a local part. In the global part, the scores of all the documents are obtained by mapping other topics to the current topic and considering the time relation and the document similarity of all the documents by utilizing DivRank; in the local part, only the time relation and the document similarity of each document are considered, and the score of each document is obtained by using DivRank. And finally, combining the global and local to obtain the final score of each document.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. An evolution type abstract generating method facing to internet news events is characterized by comprising the following steps:

s110, inputting a related news document set;

storing related documents of news events needing to generate the evolution type abstract in folders with corresponding dates according to publication time of the related documents, wherein one document is stored as one file; the first line in each file is a news title, and the text after the first line;

s120, representing the document as a theme feature vector through an LDA theme model, wherein the dimension of the theme feature vector is a first preset value;

compressing a document into a vector with a first preset dimension through an LDA model with a first preset number of subjects: each dimension is the probability that the document belongs to the Topic, and the vector is also called Topic probability; as seen from the process of document clustering, similarity calculation is a very important step in document clustering, and has a direct influence on the quality of a clustering result; modeling a document set by adopting an LDA (latent Dirichlet Allocation) model to obtain a topic distribution vector of each document, excavating potential semantic knowledge, and making up for the defect of information loss caused by representing the documents by only using word frequency information to a certain extent; wherein the first preset value is 50;

s130, clustering the documents expressed as the topic feature vectors, wherein each class represents a topic; clustering the documents expressed as the theme characteristic vectors by adopting a K-means clustering algorithm, wherein the clustering number is a second preset value; the second preset value is 7;

s140, calculating local scores of all the documents in each topic;

s150, calculating the global score of each document in each topic;

in the global part, the scores of all the documents are obtained by mapping other topics to the current topic and considering the time relation and the document similarity of all the documents by utilizing a greedy vertex selection algorithm; in the local part, only the time relation and the document similarity of each document are considered, and the scores of the documents are obtained by using a greedy vertex selection algorithm; wherein the greedy vertex selection algorithm is a maximum marginal correlation algorithm or a DivRank algorithm;

s160, calculating the final score of each document in each theme;

calculating the final scores of the documents in each topic by combining the global scores and the local scores of the documents in each topic;

s170, extracting the document titles with high scores from each topic and sequencing the document titles according to time to serve as abstracts;

for the number of extracted document titles in each topic, determining according to the ratio of the number of the documents in the topic to the number of all the documents and the number of sentences of the abstract; wherein the score of the document is the final score of the document;

and S180, outputting the abstract.

2. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method of claim 1 are performed when the program is executed by the processor.

3. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method as claimed in claim 1.