CN110929022A - Text abstract generation method and system - Google Patents

Text abstract generation method and system Download PDF

Info

Publication number
CN110929022A
CN110929022A CN201811088867.1A CN201811088867A CN110929022A CN 110929022 A CN110929022 A CN 110929022A CN 201811088867 A CN201811088867 A CN 201811088867A CN 110929022 A CN110929022 A CN 110929022A
Authority
CN
China
Prior art keywords
key
text
sentences
semantic
key sentences
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811088867.1A
Other languages
Chinese (zh)
Inventor
李卫
白子龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Archimedes (shanghai) Media Co Ltd
Original Assignee
Archimedes (shanghai) Media Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Archimedes (shanghai) Media Co Ltd filed Critical Archimedes (shanghai) Media Co Ltd
Priority to CN201811088867.1A priority Critical patent/CN110929022A/en
Publication of CN110929022A publication Critical patent/CN110929022A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text abstract generating method and a text abstract generating system. The method comprises the steps of extracting key sentences of a text based on a TextRank algorithm, embedding words into the extracted key sentences by adopting a word2vec model, removing key sentences with lower key degrees in two key sentences with similar semantemes according to semantic relevance of participles, extracting semantic units where key sentences which are not removed are located one by one, removing duplication to serve as new key sentences, and finally organizing a text abstract according to the sequence of the new key sentences appearing in the original text. The method and the system provided by the invention only calculate the semantic similarity of the key sentences extracted by the TextRank, and the calculated amount is small; key sentences with repeated semantemes are removed by utilizing semantic similarity, so that semantic diversity is ensured on the premise of the same abstract length; the whole readability and semantic consistency of the generated abstract are improved by extracting the complete semantic unit as the key sentence.

Description

Text abstract generation method and system
Technical Field
The invention discloses a text abstract generation method and a text abstract generation system, relates to the field of natural language processing, in particular to the field of automatic text abstract generation, and particularly relates to the field of content abstract of internet text.
Background
At present, the generation modes of text automatic abstractions based on natural languages are mainly divided into two main categories, namely extraction type and generation type. The core idea of the extraction type text abstract automatic generation method is that the central idea of a document can be summarized by using a certain sentence in the document, so that an abstract task is changed into a problem of sentence importance sequencing; the most representative algorithms are the TF-IDF algorithm and the TextRank algorithm. The TF-IDF algorithm measures the importance of words by word frequency and inverse document frequency, and further considers that the importance of sentences containing a large number of keywords is higher. The TextRank algorithm uses sentences as nodes in a graph based on a graph model, uses relations between the sentences as edges in the graph, generally measures the relations between the sentences by using literal similarity, and sorts the importance of the sentences by using a voting mechanism. The TextRank algorithm does not need to learn and train a specific model in advance like a neural network, and is simple and effective to be widely applied. The generation type abstract generation method directly generates abstract sentences according to the meaning expression of the text, and is similar to the process of writing abstract by people. The representative algorithm is seq2seq + attack model based on deep learning technology. The model training needs a large amount of labeled texts as training samples, the calculation requirement on the system is high, and the generated abstract also has the problems of grammar error, poor readability and the like. Therefore, the technology is still not mature enough and has limited application range.
At present, the most applied and most mainstream text abstract generation method is still a generation type abstract. The most representative TF-IDF algorithm and the TextRank algorithm respectively measure the importance of sentences based on word frequency statistics and word face similarity; both of them lack the measurement of semantic level, resulting in that under the limitation of limited abstract length, the generated abstract lacks semantic diversity, and the abstract composed of the extracted isolated sentences also lacks semantic integrity and consistency, and readability is poor.
Disclosure of Invention
In order to overcome the defects of the existing extraction type text abstract generating method, the invention provides a text abstract generating method, which comprises the following steps:
a. preprocessing the original text with the abstracted abstract;
b. extracting key sentences in the original text and corresponding key degrees based on a standard TextRank algorithm;
c. adopting a word segmentation tool to segment the extracted key sentences and filter preset stop words, carrying out word embedding processing on the filtered segments based on a word2vec model, and calculating semantic correlation between the segments in every two key sentences according to word embedding processing results so as to obtain semantic similarity between every two key sentences;
d. filtering out key sentences with highly similar semantemes and lower criticality according to the semantic similarity between every two key sentences;
e. and extracting a complete semantic unit according to the filtered key sentences to be used as a new key sentence for sequencing, and then generating a text abstract.
Further, the preprocessing of the original text in the step a includes removing non-text contents such as space characters, line feed characters and the like, and text coding is performed on the remaining text contents. The filtering of the key sentence in the step d specifically comprises the following steps: and presetting a similarity threshold, removing the key sentences with lower criticality from the two key sentences with semantic similarity exceeding the similarity threshold, and only keeping the other key sentence with higher criticality.
Further, in order to solve the problems that the abstract generated by the existing extraction-type abstract generation method lacks semantic integrity and coherence and has poor readability, the abstract generated by the filtered key sentence needs to be further optimized. Therefore, the invention provides a step e of a text abstract generation method, which is specifically realized as follows: dividing the original text into a plurality of semantic units according to the Chinese end character identifiers, extracting the semantic units containing the filtered key sentences, and removing duplication to serve as new key sentences; and sequencing the new key sentences according to the sequence of the new key sentences appearing in the original text to form a final abstract.
Corresponding to the method, the invention also provides a text abstract generating system, which comprises:
the text preprocessing module is used for preprocessing the original text with the abstracted abstract;
the key sentence extraction module is used for extracting key sentences in the original text and corresponding key degrees by adopting a standard TextRank algorithm;
the key sentence semantic similarity calculation module is used for filtering preset stop words after word segmentation is carried out on the extracted key sentences by adopting a word segmentation tool, then carrying out word embedding processing on the filtered word segments on the basis of a word2vec model, calculating semantic correlation between the words in every two key sentences according to word embedding processing results, and further obtaining semantic similarity between every two key sentences;
the key sentence filtering module is used for filtering out key sentences with high semantic similarity and lower criticality according to the semantic similarity between every two key sentences and the corresponding criticality;
and the text abstract completing module is used for taking the semantic unit where the filtered key sentence is located as a new key sentence and sequentially linking the semantic unit according to the sequence of the new key sentence appearing in the text to generate a final text abstract.
Further, the text preprocessing module preprocesses the original text, including removing non-text contents such as space characters and line feed characters, and performs text coding on the remaining text contents. The encoding method adopted by the text encoding can comprise the following steps: UTF-8, GB2312, GBK, ASCII and other Chinese and English codes. The filtering process of the key sentence filtering module is specifically as follows: and presetting a similarity threshold, removing the key sentences with lower criticality from the key sentence pairs with semantic similarity reaching the similarity threshold, and only keeping the other key sentences with higher criticality. In addition, the invention provides a text abstract finishing module, which is specifically realized as follows: dividing the original text into a plurality of semantic units according to Chinese ending identifier, wherein the Chinese ending identifier at least comprises: ". ","? ","! ","? ". And extracting semantic units containing the filtered key sentences, removing duplicates to serve as new key sentences, and sequentially linking the new key sentences in the sequence of the new key sentences in the text to generate a final text abstract.
Drawings
Fig. 1 is a flowchart of a text summary generation method according to the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantages solved by the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As shown in fig. 1 of the specification, the present invention provides a text summary generating method, which includes the following steps:
a. preprocessing the text: removing space character, line feed character and other non-text content in original text, and UTF-8 coding the rest text content;
b. extracting key sentences in the original text and corresponding key degrees based on a standard TextRank algorithm;
c. calculating the similarity between the key sentences: adopting a Jieba word segmentation tool to segment the extracted key sentences, filtering preset stop words, carrying out word embedding processing on the filtered segments based on a word2vec model, calculating semantic correlation between the segments in every two key sentences according to word embedding processing results, and further obtaining semantic similarity between every two key sentences;
d. filtering the key sentences: filtering out key sentences with lower key degrees in the two key sentences with similar semantemes according to the semantic similarity between every two key sentences and the corresponding key degrees;
e. and extracting the complete semantic unit where the filtered key sentences are located, and sequencing the complete semantic unit as a new key sentence to generate a text abstract.
The TextRank algorithm involved in the step b is a representative algorithm of the automatic generation method of the extraction type text abstract, and is a widely adopted mature key sentence extraction algorithm. The TextRank algorithm is a graph-based ranking algorithm for text. The basic idea is derived from the PageRank algorithm of Google, a text is divided into a plurality of composition units (words and sentences), a graph model is established, important components in the text are sequenced by using a voting mechanism, and keyword extraction and an extraction type abstract can be realized only by using the information of a single document. The general model of the TextRank algorithm can be represented as a directed weighted graph G ═ (V, E), consisting of a set of points V and a set of edges E, E being a subset of V × V. Any two points V in the figurei,VjRight of edge betweenHeavy is wjiFor a given point Vi,In(Vi) To point to the set of points at that point, Out (V)i) Is a point ViSet of other points pointed to. Point ViThe score of (c) is defined as follows:
Figure BDA0001803852300000051
wherein d is a damping coefficient, the value range is 0 to 1, which represents the probability of pointing to any other point from a certain point in the graph, and the value is generally 0.85. When calculating the score of each point in the graph by using the TextRank algorithm, it is necessary to assign an arbitrary initial value to the point in the graph and recursively calculate until convergence is reached, that is, when the error rate of any point in the graph is less than a given limit value, the limit value is generally 0.0001.
When extracting key sentences and key degrees based on the TextRank algorithm, the method mainly comprises the following steps:
1. pretreatment: the input text or text set T is divided according to the sentence S, namely T ═ S1,S2,S3…Sm](ii) a For each sentence SiPerforming word segmentation and removing stop words to obtain:
Figure BDA0001803852300000053
where n is the number of candidate keywords of the sentence, ti,j∈SiAre the candidate keywords after retention.
2. And (3) carrying out similarity calculation on the sentences: constructing an edge set E in the graph G, giving similarity of two sentences based on the content coverage rate between the sentences, and calculating by adopting the following formula:
Figure BDA0001803852300000052
wherein Si、SjRespectively representing two sentences, wkRepresenting words in a sentence, the numerator portion means the number of the same word appearing in both sentences at the same time, and the denominator is the logarithmic sum of the number of words in the sentence. The denominator is designed in such a way that the denominator can be restrainedThe long sentence has an advantage in similarity calculation. If the similarity between two sentences is greater than a given threshold, the two sentences are considered semantically related and connected, namely the weight w of the edgeji=Similarity(Si,Sj);
And 3, calculating sentence weight: according to formula 1, iteratively propagating the weight to calculate the score of each sentence;
and 4, extracting abstract sentences: and (4) carrying out reverse ordering according to the sentence scores (namely the criticality) obtained in the last step, and extracting T sentences with the highest criticality as key sentences.
The word2vec model in step c is an NLP tool derived by *** in 2013, and is characterized in that all participles are vectorized (namely, word embedding), the relationship between words is quantitatively described, and the semantic correlation between two participles can be measured through the matrix operation result of two word vectors. Thus, the similarity between sentences can be quantitatively described by weighting operation of semantic relevance of each word between sentences. Details about the concrete implementation principle of the word2vec model are not described here.
Further, the filtering of the key sentence in the step d specifically includes: and c, presetting a similarity threshold, removing the key sentence with lower criticality from the two key sentences with semantic similarity exceeding the similarity threshold according to the correlation between the key sentences obtained by calculation in the step c, and only keeping the other key sentence with higher criticality. In addition, step e is specifically implemented as: dividing the original text into a plurality of semantic units according to the Chinese end character identifiers, extracting the semantic units containing the filtered key sentences, and removing duplication to serve as new key sentences; and sequencing the new key sentences according to the sequence of the new key sentences appearing in the original text to form a final abstract. The generated abstract gives consideration to the semantic integrity and the continuity of the abstract to a certain extent, and the readability of the abstract is improved.
Corresponding to the method, the invention also provides a text abstract generating system, which comprises: the system comprises a text preprocessing module, a key sentence extraction module, a key sentence semantic similarity calculation module, a key sentence filtering module and a text abstract finishing module; the text preprocessing module is used for preprocessing the original text with the abstracted abstract; the key sentence extraction module is used for extracting key sentences in the original text and corresponding key degrees by adopting a standard TextRank algorithm; the key sentence semantic similarity calculation module is used for filtering preset stop words after word segmentation is carried out on the extracted key sentences by adopting a word segmentation tool, then carrying out word embedding processing on the filtered word segments on the basis of a word2vec model, calculating semantic correlation between the words in every two key sentences according to word embedding processing results, and further obtaining semantic similarity between every two key sentences; the key sentence filtering module is used for filtering key sentences according to the semantic similarity between every two key sentences and the corresponding key degrees; and the text abstract finishing module is used for extracting the complete semantic unit where the filtered key sentences are located as new key sentences and generating the text abstract after sequencing.
Further, the text preprocessing module preprocesses the original text, including removing non-text contents such as space characters and line feed characters, and performs text coding on the remaining text contents. The key sentence filtering module specifically filters the key sentences as follows: and presetting a similarity threshold, removing the key sentences with lower criticality from the two key sentences with semantic similarity reaching the similarity threshold, and only keeping the other key sentence with higher criticality. In addition, the text abstract completion module of the text abstract generation system provided by the invention is specifically realized as follows: dividing the original text into a plurality of semantic units according to the Chinese end character identifiers, extracting the semantic units containing the filtered key sentences, and removing duplication to serve as new key sentences; and sequencing the new key sentences according to the sequence of the new key sentences appearing in the original text to form a final abstract.
Compared with the prior art, the text abstract generating method and the text abstract generating system provided by the invention have the following advantages:
1. semantic similarity is calculated only for the extracted key sentences, and the calculation amount is small;
2. key sentences based on semantic similarity measurement are filtered, and semantic diversity of generated summaries is guaranteed under the same summary length;
3. each key sentence is expanded to a complete semantic unit in the original text, so that the overall readability of the generated abstract is improved, the key sentences are reordered according to the sequence of the key sentences in the original text, and the semantic consistency of the abstract is improved.

Claims (10)

1. A text abstract generating method comprises the following steps:
a. preprocessing the original text with the abstracted abstract;
b. extracting key sentences in the original text and corresponding key degrees based on a standard TextRank algorithm;
c. adopting a word segmentation tool to segment the extracted key sentences, filtering preset stop words, carrying out word embedding processing on the filtered segments based on a word2vec model, and calculating semantic similarity between every two key sentences according to the similarity of the segments in the key sentences;
d. deleting one sentence with lower key degree in the two key sentences with the similarity exceeding the similarity threshold value, and realizing the filtering of the key sentences;
e. and taking the semantic unit where the filtered key sentence is as a new key sentence, and sequencing to generate the abstract.
2. The method according to claim 1, wherein the pretreatment in step a is specifically: and after removing the space character, the line feed character and other non-text contents in the original text, performing text coding on the remaining text contents.
3. The method of claim 1, wherein step d is embodied as: and setting a similarity threshold, and deleting the key sentences with lower criticality from two similar key sentences of which the semantic similarity exceeds the threshold.
4. The method of claim 2, wherein the text encoding employs an encoding method comprising: UTF-8, GB2312, GBK and ASCII.
5. The method of claim 1, wherein step e is embodied as: dividing the original text into a plurality of semantic units according to the Chinese end symbol identifiers; and extracting semantic units containing the filtered key sentences, removing duplicates to serve as new key sentences, and sequencing the new key sentences according to the sequence of the new key sentences appearing in the original text to form a final abstract.
6. The method of claim 5, wherein the chinese ending identifier comprises at least: ". ","? ","! ","? ".
7. A text summary generation system, the system comprising: the system comprises a text preprocessing module, a key sentence extraction module, a key sentence semantic similarity calculation module, a key sentence filtering module and a text abstract finishing module; wherein
The text preprocessing module is used for preprocessing the original text needing to extract the abstract;
the key sentence extraction module is used for extracting key sentences in the original text and corresponding key degrees by adopting a standard TextRank algorithm;
the key sentence semantic similarity calculation module is used for filtering preset stop words after word segmentation is carried out on the extracted key sentences by adopting a word segmentation tool, then carrying out word embedding processing on the filtered word segments on the basis of a word2vec model, and calculating semantic correlation between every two key sentence segments according to word embedding processing results so as to obtain semantic similarity between every two key sentences;
the key sentence filtering module is used for removing key sentences with lower key degrees in two key sentences with semantic similarity exceeding a preset threshold value according to the semantic similarity between every two key sentences and the corresponding key degrees;
and the text abstract completing module is used for taking the semantic unit where the filtered key sentence is located as a new key sentence and sequentially linking the semantic unit according to the sequence of the new key sentence appearing in the text to generate the text abstract.
8. The system of claim 7, wherein the text pre-processing module is embodied as: and removing non-text contents such as space characters, line feed characters and the like in the original text, and then performing text coding on the remaining text contents.
9. The system of claim 7, wherein the key sentence filtering module is embodied as: and setting a similarity threshold, and deleting the key sentences with lower key degrees for two similar key sentences of which the semantic similarity exceeds the threshold.
10. The system of claim 7, wherein the text summarization completion module is embodied as: dividing the original text into a plurality of semantic units according to the Chinese end symbol identifiers; and extracting semantic units containing the filtered key sentences, removing duplicates to serve as new key sentences, and sequencing the new key sentences according to the sequence of the new key sentences appearing in the original text to form a final abstract.
CN201811088867.1A 2018-09-18 2018-09-18 Text abstract generation method and system Pending CN110929022A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811088867.1A CN110929022A (en) 2018-09-18 2018-09-18 Text abstract generation method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811088867.1A CN110929022A (en) 2018-09-18 2018-09-18 Text abstract generation method and system

Publications (1)

Publication Number Publication Date
CN110929022A true CN110929022A (en) 2020-03-27

Family

ID=69855772

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811088867.1A Pending CN110929022A (en) 2018-09-18 2018-09-18 Text abstract generation method and system

Country Status (1)

Country Link
CN (1) CN110929022A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112101005A (en) * 2020-04-02 2020-12-18 上海迷因网络科技有限公司 Method for generating and dynamically adjusting quick expressive force test questions
CN113535942A (en) * 2021-07-21 2021-10-22 北京海泰方圆科技股份有限公司 Text abstract generation method, device, equipment and medium
CN113590809A (en) * 2021-07-02 2021-11-02 华南师范大学 Method and device for automatically generating referee document abstract

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017084267A1 (en) * 2015-11-18 2017-05-26 乐视控股(北京)有限公司 Method and device for keyphrase extraction
CN107133213A (en) * 2017-05-06 2017-09-05 广东药科大学 A kind of text snippet extraction method and system based on algorithm
CN108197111A (en) * 2018-01-10 2018-06-22 华南理工大学 A kind of text automatic abstracting method based on fusion Semantic Clustering
CN108549634A (en) * 2018-04-09 2018-09-18 北京信息科技大学 A kind of Chinese patent text similarity calculating method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017084267A1 (en) * 2015-11-18 2017-05-26 乐视控股(北京)有限公司 Method and device for keyphrase extraction
CN107133213A (en) * 2017-05-06 2017-09-05 广东药科大学 A kind of text snippet extraction method and system based on algorithm
CN108197111A (en) * 2018-01-10 2018-06-22 华南理工大学 A kind of text automatic abstracting method based on fusion Semantic Clustering
CN108549634A (en) * 2018-04-09 2018-09-18 北京信息科技大学 A kind of Chinese patent text similarity calculating method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
王志宏;过弋;: "基于词句重要性的中文专利关键词自动抽取研究", no. 09 *
高雪霞: "基于语句类似度优化计算的改进自动摘要算法研究", 计算机应用与软件, pages 2 - 3 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112101005A (en) * 2020-04-02 2020-12-18 上海迷因网络科技有限公司 Method for generating and dynamically adjusting quick expressive force test questions
CN112101005B (en) * 2020-04-02 2022-08-30 上海迷因网络科技有限公司 Method for generating and dynamically adjusting quick expressive force test questions
CN113590809A (en) * 2021-07-02 2021-11-02 华南师范大学 Method and device for automatically generating referee document abstract
CN113535942A (en) * 2021-07-21 2021-10-22 北京海泰方圆科技股份有限公司 Text abstract generation method, device, equipment and medium

Similar Documents

Publication Publication Date Title
CN111177365B (en) Unsupervised automatic abstract extraction method based on graph model
CN107122413B (en) Keyword extraction method and device based on graph model
CN107451126B (en) Method and system for screening similar meaning words
CN106970910B (en) Keyword extraction method and device based on graph model
CN113011533A (en) Text classification method and device, computer equipment and storage medium
CN111831802B (en) Urban domain knowledge detection system and method based on LDA topic model
CN106598940A (en) Text similarity solution algorithm based on global optimization of keyword quality
CN114065758B (en) Document keyword extraction method based on hypergraph random walk
CN107688630B (en) Semantic-based weakly supervised microbo multi-emotion dictionary expansion method
CN103324626A (en) Method for setting multi-granularity dictionary and segmenting words and device thereof
CN104778256A (en) Rapid incremental clustering method for domain question-answering system consultations
Al-Kabi et al. Evaluating social context in arabic opinion mining.
CN109271524A (en) Entity link method in knowledge base question answering system
CN110929022A (en) Text abstract generation method and system
CN107577713B (en) Text handling method based on electric power dictionary
CN114048354B (en) Test question retrieval method, device and medium based on multi-element characterization and metric learning
CN106776678A (en) Search engine optimization technology is realized in new keyword optimization
CN110502759B (en) Method for processing Chinese-Yue hybrid network neural machine translation out-of-set words fused into classification dictionary
CN115906805A (en) Long text abstract generating method based on word fine granularity
CN115422939A (en) Fine-grained commodity named entity identification method based on big data
Singh et al. Writing Style Change Detection on Multi-Author Documents.
CN110929518A (en) Text sequence labeling algorithm using overlapping splitting rule
CN111191413B (en) Method, device and system for automatically marking event core content based on graph sequencing model
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
Al-Sultany et al. Enriching tweets for topic modeling via linking to the wikipedia

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination