CN109815328B - Abstract generation method and device - Google Patents

Abstract generation method and device Download PDF

Info

Publication number
CN109815328B
CN109815328B CN201811626213.XA CN201811626213A CN109815328B CN 109815328 B CN109815328 B CN 109815328B CN 201811626213 A CN201811626213 A CN 201811626213A CN 109815328 B CN109815328 B CN 109815328B
Authority
CN
China
Prior art keywords
abstract
sentence
sets
sentences
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811626213.XA
Other languages
Chinese (zh)
Other versions
CN109815328A (en
Inventor
董超
崔朝辉
赵立军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Neusoft Corp
Original Assignee
Neusoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Neusoft Corp filed Critical Neusoft Corp
Priority to CN201811626213.XA priority Critical patent/CN109815328B/en
Publication of CN109815328A publication Critical patent/CN109815328A/en
Application granted granted Critical
Publication of CN109815328B publication Critical patent/CN109815328B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Document Processing Apparatus (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method and a device for generating an abstract, wherein the method comprises the following steps: obtaining a plurality of abstract sets; each abstract set comprises abstract sentences of expected texts in corresponding time segments, and the time segments corresponding to any two abstract sets are different; acquiring the difference degree of the first abstract set to the second abstract set according to abstract sentences included in the abstract sets to obtain the abstract difference degree corresponding to the first abstract set; selecting a summary set which meets the first screening condition from the plurality of summary sets based on the obtained difference of each summary, and removing the summary sets which are more repeated or redundant with other summary sets; the abstract sentences included in the selected abstract set are combined to generate the abstract, so that the repetitive and redundant abstract contents can be reduced on the basis of ensuring the abstract coverage rate, and readers can accurately master the central thought of the expected text and the development condition of events.

Description

Abstract generation method and device
Technical Field
The present application relates to the field of natural language processing, and in particular, to a method and an apparatus for generating an abstract.
Background
The automatic summarization technology is that for a given original long text, keywords (or key sentences) in the original long text are automatically extracted, and then the extracted keywords (or key sentences) are organized into small sections of texts through certain rules or means, so as to summarize the central idea of the original long text. Today, however, people are exposed to huge amounts of text information each day, for example, each day, a large amount of news text from different media and different channels is generated. Under the background, the traditional abstract generation aiming at a single long text is very little to help people to timely and accurately master key information, and is difficult to be used. At this time, it is necessary to provide a summary generation technique for a plurality of long texts.
Multi-text summarization (MDS) technology has been developed, in which a plurality of long texts are used as input, and summary texts of a specific length are automatically generated as required. The multiple text summaries can be divided into non-time-sequential static text auto summary and time-sequential dynamic text auto summary.
Taking news summarization as an example, because the generation and subsequent development of each topic news are not in a single time segment, and new development trends may occur every day in the initial generation, people often really need chronological summarization formed according to the development progress of the same topic or event, namely, dynamic text automatic summarization. However, in the conventional dynamic multi-text automatic summarization, static text summarization is performed on data in each time slice first, and then the obtained static summaries are simply spliced according to a time sequence, so that a lot of repetitive and redundant summary contents are generated.
Disclosure of Invention
In view of this, embodiments of the present application provide a method and an apparatus for generating an abstract, which can solve the problem in the prior art that an extracted abstract has repetitive and redundant abstract contents.
The summary generation method provided in the first aspect of the embodiment of the present application includes:
obtaining a plurality of abstract sets; each abstract set comprises abstract sentences of expected texts in corresponding time segments, and the time segments corresponding to any two abstract sets are different;
acquiring the difference degree of a first abstract set to a second abstract set according to abstract sentences included in the abstract sets to obtain the abstract difference degree corresponding to the first abstract set; the first and second digest sets are any two of the plurality of digest sets;
selecting a summary set which meets a first screening condition from the plurality of summary sets based on each obtained summary difference degree;
and combining the abstract sentences in the selected abstract set to generate the abstract.
Optionally, the determining, according to the abstract sentence included in the abstract set, a difference degree of the first abstract set to the second abstract set to obtain an abstract difference degree corresponding to the first abstract set specifically includes:
obtaining the difference degree of each abstract sentence in the first abstract set to the second abstract set based on the reappearance state of the characters or character strings in each abstract sentence in the first abstract set in the second abstract set;
and integrating the difference degree of each abstract sentence in the first abstract set to the second abstract set to obtain the abstract difference degree corresponding to the first abstract set.
Optionally, the obtaining the difference between each abstract sentence in the first abstract set and the second abstract set based on the reappearance state of the character or the character string in each abstract sentence in the first abstract set specifically includes:
extracting a plurality of character strings from the target abstract sentence according to a preset rule; the target abstract sentence is any one abstract sentence in the first abstract set;
counting the number of character strings which are not reproduced in the second abstract set in the plurality of character strings to obtain a statistical value;
and acquiring the difference degree of the target abstract sentence to the second abstract set according to the statistical value and the number of the character strings.
Optionally, the obtaining a plurality of digest sets specifically includes:
obtaining a sentence dividing result of an expected text in a first time segment to obtain a first sentence set; the first time segment is a time segment corresponding to the first abstract set;
extracting a subject word from the first sentence set by using a subject model to obtain a first subject word set;
obtaining article relevance corresponding to each sentence in the first sentence set according to the word frequency of the sentence co-occurrence words in the expected text in the first time segment and the word frequency of the topic co-occurrence words in the expected text in the first time segment; the sentence co-occurrence words are characters or character strings which appear in any two sentences in the first sentence set at the same time, the topic co-occurrence words are characters or character strings which appear in the first sentence set and the topic words at the same time, and the article relevance represents the possibility of reflecting the central thought of the expected text in the first time segment;
and selecting sentences which accord with a second screening condition from the first sentence set according to the article relevance to obtain the first abstract set.
Optionally, obtaining an article relevancy corresponding to each sentence in the first sentence set according to the word frequency of the sentence co-occurrence word in the expected text in the first time segment and the word frequency of the topic co-occurrence word in the expected text in the first time segment, specifically including:
and iterating by using an optimization algorithm to obtain the article relevancy, and aiming at the ith iteration:
obtaining article relevance of the target sentence according to the word frequency of the sentence co-occurrence words of each sentence in the target sentence and the sentence subset in the expected text in the first time segment, the article relevance of each sentence in the sentence subset, the word frequency of the topic co-occurrence words of each topic word in the target sentence and the target topic word set in the expected text in the first time segment and the article relevance of each topic word in the first topic word set;
the target sentence is any abstract sentence in the first sentence set, the sentence subset includes other abstract sentences except the target sentence in the first sentence set, and the article relevance of the subject word is obtained according to the word frequency of the desired text of the subject word in the first time segment and the article relevance of the sentence including the subject word in the first sentence set.
Optionally, the selecting, based on each obtained digest difference, a digest set meeting a first screening condition from the plurality of digest sets specifically includes:
the article relevance of each abstract sentence in the first abstract set is synthesized to obtain the article relevance corresponding to the first abstract set;
and selecting a summary set which meets a first screening condition from the plurality of summary sets based on the obtained difference degree of each summary and the relevance degree of each article.
Optionally, the obtaining a plurality of digest sets further includes:
obtaining the novelty of a target abstract sentence based on the reproduction state of characters or character strings in the target abstract sentence in each abstract sentence in a first abstract subset; the target abstract sentence is any one abstract sentence in the first abstract set, and the first abstract subset comprises the other abstract sentences except the target abstract sentence in the first abstract set;
integrating the novelty of each abstract sentence in the first abstract set to obtain the novelty of the first abstract set;
then, the selecting, from the plurality of digest sets, a digest set that meets a first screening condition based on each obtained digest difference specifically includes:
and selecting a summary set meeting a first screening condition from the plurality of summary sets based on the obtained difference degree of each summary and the novelty degree of each summary set.
Optionally, obtaining the novelty of the abstract sentence in the first abstract set, and then further includes:
and based on the novelty of each abstract sentence in the first abstract set, eliminating abstract sentences which do not meet a third screening condition from the first abstract set.
The apparatus for generating a summary provided in the second aspect of the embodiment of the present application includes: the device comprises a first obtaining unit, a second obtaining unit, a screening unit and a combining unit;
the first obtaining unit is used for obtaining a plurality of summary sets; each abstract set comprises abstract sentences of expected texts in corresponding time segments, and the time segments corresponding to any two abstract sets are different;
the second obtaining unit is configured to obtain a difference degree of the first abstract set to the second abstract set according to the abstract sentences included in the abstract sets, so as to obtain an abstract difference degree corresponding to the first abstract set; the first and second digest sets are any two of the plurality of digest sets;
the screening unit is used for selecting an abstract set meeting a first screening condition from the plurality of abstract sets based on each obtained abstract difference degree;
and the combination unit is used for combining the abstract sentences in the selected abstract set to generate the abstract.
The third aspect of the embodiments of the present application further provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the computer program implements any one of the digest generation methods provided in the first aspect.
A fourth aspect of the present embodiment also provides an abstract generating device, including: a processor and a memory;
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to execute any one of the digest generation methods provided in the first aspect according to instructions in the program code.
Compared with the prior art, the method has the advantages that:
in the embodiment of the application, the summary sets of the expected texts in a plurality of different time segments are obtained first, and a plurality of summary sets are obtained. And then according to the abstract sentences included in the abstract sets, obtaining the difference between any two abstract sets in the plurality of abstract sets, and obtaining the abstract difference corresponding to each abstract set, wherein the abstract difference is used for reflecting the repeated conditions between each abstract set and other abstract sets. Then, based on the abstract difference degree corresponding to each abstract set, the abstract sets meeting the first screening condition are selected from the plurality of abstract sets, the progressive and variability of the abstract subjects in different time segments are considered, the repetition conditions among the abstract sets corresponding to different time segments are judged, after the abstract sets which are more repeated or redundant with other abstract sets are removed, the abstract sentences in the selected abstract sets are combined to generate the abstract, the abstract contents of repeatability and redundancy can be reduced on the basis of ensuring the abstract coverage rate, and readers can accurately master the central thought of the expected text and the development condition of events.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a schematic flowchart of a digest generation method according to an embodiment of the present application;
fig. 2 is a schematic flowchart of another digest generation method according to an embodiment of the present application;
fig. 3 is a schematic flowchart of another digest generation method according to an embodiment of the present application;
fig. 4 is a schematic flowchart of another digest generation method according to an embodiment of the present application;
fig. 5 is a schematic flowchart of another digest generation method according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of a summary generation apparatus according to an embodiment of the present application.
Detailed Description
In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.
Today, people are exposed to huge amounts of text information every day, for example, each day, a large amount of news text from different media and different channels is generated. In order to quickly and effectively acquire the central thought and the development context of the subject of the article, multiple text abstracts are generated. Then, when obtaining the abstract capable of representing the event development context, the existing multi-text summarization technology simply splices the summary sentences extracted from the texts in different time segments according to the time sequence. However, according to the general situation of the development of the event, the content described in the expected text in the previous time segment may be repeated in the expected text in the current time segment, for example, news of the same subject in the previous day will be briefly reviewed in the news of the same day, the existing multi-text summarization technology does not consider the progressiveness and variability of the summary subject in different time segments, so that a lot of repetitive and redundant summary contents exist in the summaries of the texts in different time segments, and the reader cannot timely and effectively grasp the development context of the summary subject.
Therefore, the embodiment of the application provides an abstract generating method and an abstract generating device, in consideration of the progressiveness and variability of abstract subjects in different time segments, the repetition condition between the abstract sets corresponding to the two different time segments is judged, the abstract sets which are more repeated or redundant with other abstract sets are removed, and the abstract sentences in the remaining abstract sets are combined to generate the abstract, so that the repetitive and redundant abstract contents can be reduced on the basis of ensuring the coverage rate of the abstract subjects, and readers can effectively acquire the development context of the subjects.
Based on the above-mentioned ideas, in order to make the above-mentioned objects, features and advantages of the present application more comprehensible, specific embodiments of the present application are described in detail below with reference to the accompanying drawings.
Please refer to fig. 1, which is a flowchart illustrating a summary generation method according to an embodiment of the present disclosure.
The summary generation method provided by the embodiment of the application comprises the following steps:
s101: a plurality of digest sets are obtained.
Each abstract set comprises abstract sentences of expected texts in corresponding time segments, and the time segments corresponding to any two abstract sets are different.
It should be noted that the time segment may roughly divide the development stage of the abstract theme, and the specific implementation may be set according to specific needs, for example, the time segment may be several hours, one day, one week, or one month, and the embodiment of the present application does not limit this. In one example, the time segments may be set according to the development speed of the abstract theme. For example, when the speed of the summary theme development changes is fast (for example, the progress of a conference, etc.), the time slice may be set to be short, such as several hours or one day; when the speed of the development change of the abstract theme is slow (for example, the development of train technology, etc.), the time slice can be set to be longer, for example, 1 year, etc.
In the embodiment of the present application, the expected text is text related to the abstract topic, and may be retrieved from a database (e.g., an internet database) or a knowledge base. It should be noted that, in the embodiment of the present application, the language used for the expected text is not limited, and may be chinese, english, japanese, and the like. Each expected text comprises corresponding time attributes, such as writing time, publishing time, occurrence time of recorded content and the like, and according to the time attributes, the time segment in which the expected text is positioned can be determined. In a specific implementation, according to a query term (i.e., a summary topic) carried in a received query request, a search engine component may search a database or a knowledge base for a text related to (e.g., including) the query term, so as to obtain at least one desired text.
It is understood that each abstract set includes abstract sentences of the expected text in the corresponding time segment, so that the abstract sentences in the abstract sets are combined to obtain an abstract representing the central idea or development context of the abstract theme. The following will explain in detail how to obtain the abstract sentence of the desired text, which is not described herein again.
S102: and determining the difference degree of the first abstract set to the second abstract set according to the abstract sentences included in the abstract sets to obtain the abstract difference degree corresponding to the first abstract set.
It is to be understood that the first and second digest sets are any two of the plurality of digest sets. In specific implementation, in order to reduce the repeated content or redundant content in the summary, a corresponding summary difference degree may be obtained for each summary set.
In the embodiment of the application, the abstract difference degree reflects the difference of two abstract sets on the expression content. Because the content expressed by each abstract set can be expressed by the abstract sentences included in the abstract set, the difference degree of the first abstract set to the second abstract set can be determined according to the abstract sentences included in the abstract sets. If the content expressed by the first abstract set and the content expressed by the second abstract set are different greatly, namely the first abstract set and the second abstract set comprise less repeated content or redundant content in the abstract sentence, or no repeated content or redundant content, the abstract difference degree of the first abstract set is large; if the content expressed by the first abstract set and the content expressed by the second abstract set are less different, that is, the first abstract set and the second abstract set comprise more repeated content or redundant content in the abstract sentences, the abstract difference degree of the first abstract set is small.
It should be noted that, due to the content described in the expected text in the previous time segment, the content may be repeated in the expected text in the current time segment, for example, a brief previous review of news events with the same topic in the previous day may be performed in the news of the current day, so that the duplicate content or the redundant content appears in the summary set corresponding to the previous time segment and the summary set corresponding to the current time segment. Therefore, in order to reduce the repeatability and/or redundancy of generating the digests, in some possible implementations of the embodiments of the present application, the start time of the time segment corresponding to the first digest set may be later than the start time of the time segment corresponding to the second digest set. As one example, the first digest set may correspond to a current time segment (e.g., the current day) and the second digest set may correspond to a previous time segment (e.g., the previous day).
S103: and selecting an abstract set meeting the first screening condition from the plurality of abstract sets based on the obtained difference degree of each abstract.
In the embodiment of the application, the abstract difference degree reflects the difference situation of the two abstract sets on the expression content, the difference situation of the content expressed by the abstract set and the content expressed by other abstract sets can be determined by utilizing the abstract difference degree of each abstract set, so that the abstract set with higher abstract difference degree (meeting a first screening condition) can be selected from the abstract sets, repeated content or redundant content in the abstract sets is eliminated, the abstract sets with low repeatability and redundancy are obtained in a combined mode, the central thought and the development context of the abstract subject are simplified and summarized, and readers can understand conveniently.
In practical application, the first screening condition may be set according to specific needs, for example, the digest difference needs to be greater than a certain threshold, and the like, which is not limited herein.
S104: and combining the abstract sentences in the selected abstract set to generate the abstract.
In the embodiment of the application, the selected abstract set has a larger content difference with other abstract sets (such as the abstract set corresponding to the previous time segment), the repeated content or the redundant content is less, the repeated content or the redundant content in the abstract generated by combining the abstract sentences included in the selected abstract set is correspondingly less, the repeatability and the redundancy of the abstract are reduced, and a reader can accurately master the central thought and the development condition of an event of a desired text according to the generated abstract.
In practical application, the selected abstract sets can be directly combined according to the time sequence of the corresponding time segment, abstract sentences in the abstract sets can be combined according to the sequence of the abstract sentences in the expected text, and identical or similar abstract sentences in the same abstract set can be spliced to reduce repeated contents.
In the embodiment of the application, the summary sets of the expected texts in a plurality of different time segments are obtained first, and a plurality of summary sets are obtained. And then according to the abstract sentences included in the abstract sets, obtaining the difference between any two abstract sets in the plurality of abstract sets, and obtaining the abstract difference corresponding to each abstract set, wherein the abstract difference is used for reflecting the repeated conditions between each abstract set and other abstract sets. Then, based on the abstract difference degree corresponding to each abstract set, the abstract sets meeting the first screening condition are selected from the plurality of abstract sets, the progressive and variability of the abstract subjects in different time segments are considered, the repetition conditions among the abstract sets corresponding to different time segments are judged, after the abstract sets which are more repeated or redundant with other abstract sets are removed, the abstract sentences in the selected abstract sets are combined to generate the abstract, the abstract contents of repeatability and redundancy can be reduced on the basis of ensuring the abstract coverage rate, and readers can accurately master the central thought of the expected text and the development condition of events.
The above description explains how to select a summary set combination meeting a first selection condition from a plurality of summary sets according to the summary difference of the summary sets to generate the summary, and the following takes the first summary set (i.e. any one of the plurality of summary sets) as an example to exemplify how to specifically obtain the summary difference of the summary sets, and the manner of obtaining the summary difference of other summary sets is similar to this, and is not repeated.
Referring to fig. 2, the figure is a schematic flow chart of another digest generation method provided in the embodiment of the present application.
In some possible implementation manners of the embodiment of the present application, step S102 may specifically include:
s201: and obtaining the difference degree of each abstract sentence in the first abstract set to the second abstract set based on the reappearance state of the characters or the character strings in each abstract sentence in the first abstract set in the second abstract set.
In the embodiments of the present application, the reproduction state specifically means presence or absence. When a character or a character string (such as a word or a plurality of continuous words) in a summary sentence included in the first summary set appears in a summary sentence in the second summary set, the recurrence state is yes; and when the characters or character strings in the abstract sentences included in the first abstract set do not appear in the abstract sentences in the second abstract set, the reappearance state is no.
It is understood that the degree of difference reflects the difference between the two in content. When characters or character strings in the abstract sentences included in the first abstract set appear in the abstract sentences in the second abstract set, the abstract sentences and the second abstract set have repeated content or redundant content, and the difference degree is small; on the contrary, when the characters or character strings in the abstract sentences included in the first abstract set do not appear in the abstract sentences in the second abstract set, the repeated content or redundant content does not exist in the abstract sentences and the second abstract set, and the difference degree is large. Therefore, according to the reappearance state of the characters or character strings in the abstract sentences in the first abstract set in the second abstract set, the difference degree of each abstract sentence in the first abstract set to the second abstract set can be obtained.
In some possible implementation manners of the embodiment of the present application, as shown in fig. 3, step S201 may specifically include:
s2011: and extracting a plurality of character strings from the target abstract sentence according to a preset rule.
It is understood that the target abstract sentence is any one of the abstract sentences in the first abstract set. In practical application, a preset rule may be specifically set to determine the length of the extracted character string and the extraction method.
As an example, in order to ensure the accuracy and comprehensiveness of the statistics on the difference degree, the preset rule may be an N-gram rule, and the character string is extracted from the target abstract sentence by using the N-gram rule.
N-gram: is a sequence of N items (items) in a piece of text given in Natural Language Processing (NLP). The item (item) may be a letter or a word, etc. When N is 1, it may be referred to as unigram; when N is 2, it may be referred to as bigram; when N is 3, it may be called trigram, and so on. Taking trigram as an example, a character string can be obtained by taking the 1 st character to the 3 rd character from the first character of the target abstract sentence, a character string can be obtained by taking the 2 nd character to the 4 th character, a character string can be obtained by taking the 3 th character to the 6 th character, and so on until the n-2 th character to the n th character is taken to obtain a character string, and a plurality of character strings extracted from the target abstract sentence can be obtained.
S2012: and counting the number of character strings which are not reproduced in the second abstract set in the plurality of character strings corresponding to the target abstract sentence to obtain a statistical value.
In practical application, the intersection of the character string of the target abstract sentence and the character string included in the abstract sentence in the second abstract set can be counted first, and the number of the character strings which do not belong to the intersection in the plurality of character strings corresponding to the target abstract sentence is counted to obtain a statistical value.
It is understood that the greater the number of character strings in the target abstract sentence whose character strings are not reproduced in the second abstract set, the greater the degree of difference between the target abstract sentence and the second abstract set.
S2013: and acquiring the difference degree of the target abstract sentence to the second abstract set according to the statistical value and the number of the plurality of character strings corresponding to the target abstract sentence.
In the embodiment of the present application, the statistical value and the difference of the target abstract sentence from the second abstract set are in a positive correlation. The larger the statistical value is, the smaller the number of the character strings in the target abstract sentence reappearing in the second abstract set is, and the larger the difference degree between the target abstract sentence and the second abstract set is; on the contrary, the smaller the statistical value is, the larger the number of character strings reproduced in the target abstract sentence in the second abstract set is, and the smaller the difference between the target abstract sentence and the second abstract set is.
As an example, the difference between the target abstract sentence and the second abstract set may be obtained according to a ratio of the number of character strings, which are not reproduced in the second abstract set, of the character strings corresponding to the target abstract sentence to the number of character strings corresponding to the target abstract sentence.
Then, the difference con between the target abstract sentence and the second abstract set can be obtained according to the following formula (1):
Figure BDA0001928053830000111
in the formula, si,jThe j-th summary in the summary set corresponding to the i-th time segmentMain sentence, DSi-1The summary set corresponding to the (i-1) th time segment is obtained, n-gram is a character or a character string, and | | represents the quantity calculation.
S202: and integrating the difference degree of each abstract sentence in the first abstract set to the second abstract set to obtain the abstract difference degree corresponding to the first abstract set.
It can be understood that the difference between the first abstract set and the second abstract set (i.e. the abstract difference) is related to the recurrence of the character strings in the abstract sentences included in the first abstract set to the second abstract set, so that the abstract difference corresponding to the first abstract set can be obtained by integrating the difference between each abstract sentence in the first abstract set and the second abstract set.
As an example, the summary difference degree C corresponding to the first summary set can be obtained according to the following formula (2):
Figure BDA0001928053830000112
in the formula, DSiSet of digests, s, corresponding to the ith time slicei,aIs the a-th abstract sentence in the abstract set corresponding to the ith time slice, N is DSiThe total number of summary sentences included.
The above description explains how to select a summary set combination meeting the first selection condition from a plurality of summary sets according to the summary difference of the summary sets to generate the summary and how to obtain the summary difference, and the following takes the first summary set (i.e. any one of the plurality of summary sets) as an example to exemplify how to specifically obtain the summary set, and the obtaining manners of other summary sets are similar to this and will not be described again.
Referring to fig. 4, this figure is a schematic flow chart of another digest generation method provided in the embodiment of the present application.
In some possible implementation manners of the embodiment of the present application, step S101 may specifically include:
s401: and obtaining a sentence segmentation result of the expected text in the first time segment to obtain a first sentence set.
It can be understood that the first time segment is a time segment corresponding to the first summary set. After sentence separation processing is carried out on the expected text in the first time segment, a first sentence set can be obtained.
In practical application, the sentence dividing processing may be performed according to punctuation marks in the expected text, that is, the logic of the sentence dividing takes punctuation marks such as a period, a comma, an exclamation mark, a question mark, a semicolon, a colon mark and the like as separators, and each expected text is divided into sentences having a meaning separately (that is, sentence dividing results), so as to obtain the first sentence set.
It should be noted that, when the number of words included in a sentence is too small, the expressed information is correspondingly small, and the central idea of the desired text cannot be summarized effectively and accurately. Therefore, in some possible implementations of the embodiments of the present application, the sentences whose number of words in the sentences obtained by splitting is less than the threshold number of words are deleted from the first sentence set. The word number threshold may be set according to actual needs, such as 10, and is not limited herein.
In some possible implementation manners, segmentation processing may be performed on the expected text in the first time segment to obtain a first sentence set, and the specific principle is similar to that of sentence segmentation processing and is not described herein again.
S402: and extracting the subject words from the first sentence set by using the subject model to obtain a first subject word set corresponding to the first time segment.
The subject term is an artificial language used for expressing text topics in indexing and retrieval, and has conceptualized and normalized features. In the embodiment of the present application, any one of the subject extraction methods may be adopted to extract a subject term from the first sentence set, for example, a Latent Dirichlet Allocation (LDA) subject model and the like. The extracted subject words form a first subject word set corresponding to the first time segment.
It should be noted that, taking the LDA topic model as an example, a general topic model needs to set the number of topic words first and perform clustering to obtain the topic words. However, in practical applications, the number of topic words in each time segment is not constant, and errors are likely to occur by using the LDA topic model. Therefore, in some possible implementations of the embodiment of the present application, a Hierarchical Dirichlet Process (HDP) may be used to generate a topic word set corresponding to each time slice. The HDP is a non-parametric model and is a variant of the LDA topic model, and the HDP has the advantages that the number of topics is not required to be preset, and the accuracy of topic word extraction is guaranteed.
When the HDP is used to generate the topic word set corresponding to each time segment, the first sentence set may be segmented first, and stop words in the obtained segmentation result may be filtered out. The stop word means that in the information retrieval, in order to save storage space and improve search efficiency, some characters or words can be automatically filtered before or after processing natural language data (or text), and the stop word can include a mood auxiliary word, an adverb, a preposition word, a connection word and the like, and can be set manually according to actual needs. Then, topic modeling is carried out through HDP to form a topic model, and the obtained topic model is used for calculating the word segmentation result after the stop words are filtered out to obtain a first topic word set.
In some possible implementation manners, in order to improve the accuracy and the processing efficiency of the abstract, no more than a certain number of subject words may be selected from the extracted subject words according to the order of the weights from high to low, so as to obtain a first subject word set, where the number of selected subject words is not limited.
S403: and obtaining the article relevance corresponding to each sentence in the first sentence set according to the word frequency of the sentence co-occurrence words in the expected text in the first time segment and the word frequency of the topic co-occurrence words in the expected text in the first time segment.
In the embodiment of the present application, the sentence co-occurrence word is a character or a character string appearing in any two sentences in the first sentence set at the same time, the topic co-occurrence word is a character or a character string appearing in the first sentence set and the topic word at the same time, and the article relevancy represents the possibility that the article co-occurrence word reflects the idea of the center of the desired text in the first time segment. And determining abstract sentences of the expected text in the first time segment according to the article relevance.
It can be understood that the character or the character string in step S403 may be directly extracted, or may be obtained according to a certain rule (e.g., an N-gram rule), and is not described herein again. As the number of occurrences of a character or character string in the desired text within the first time segment increases, the likelihood that a sentence comprising the character or character string will express the idea of the center of the desired text increases.
Therefore, the degree of correlation between the target sentence and the other sentences in the first sentence set can be obtained according to the word frequency of the sentence co-occurrence words included in the target sentence (i.e. any one sentence in the first sentence set) in the first sentence set in the expected text in the first time segment, and the capability of the target sentence to summarize the expression content of the other sentences in the first sentence set is reflected. And according to the word frequency of the topic co-occurrence words included in the target sentence in the expected text in the first time segment, obtaining the degree of correlation between the target sentence and the topic words in the first sentence set, and reflecting the capability of the target sentence for expressing the central thought of the expected text in the first time segment.
It should be noted that, when there is no sentence co-occurrence between the target sentence and other sentences in the sentence subset, the degree of correlation between the target sentence and the other sentences may be set to 0; when the target sentence does not include the subject word in the first subject word, the degree of correlation between the target sentence and the subject word may be set to 0.
Then, the ability of the target sentence to summarize the meaning of other sentences in the first sentence set and the ability of the target sentence to express the idea of the center of the expected text in the first time segment are integrated, so that the possibility that the target sentence reflects the idea of the center of the expected text in the first time segment, namely the article relevance of the target sentence can be obtained.
In some possible implementation manners of the embodiment of the application, the article relevancy can be obtained by utilizing an optimization algorithm through iteration according to the word frequency of the sentence co-occurrence word in the expected text in the first time segment and the word frequency of the topic co-occurrence word in the expected text in the first time segment. For example, an optimization solution is performed using genetic algorithms with the goal of maximizing the overall quality of the summary (which may include coverage, repeatability, and redundancy, etc.). As the name suggests, the genetic algorithm is a self-adaptive global optimization search algorithm for simulating the heredity and evolution processes of organisms in the natural environment, and individuals with higher adaptability are screened out through action mechanisms such as natural selection, heredity and mutation by means of the principle of genetics.
Then, for the ith iteration of the optimization algorithm, step S403 may specifically include:
and obtaining the article relevance of the target sentence according to the word frequency of the sentence co-occurrence words of each sentence in the target sentence and the sentence subset in the expected text in the first time segment, the article relevance of each sentence in the sentence subset, the word frequency of the topic co-occurrence words of each topic word in the target sentence and the target topic word set in the expected text in the first time segment and the article relevance of each topic word in the first topic word set.
In the embodiment of the application, the target sentence is any one of the abstract sentences in the first sentence set, and the sentence subset includes the rest abstract sentences except the target sentence in the first sentence set. Similar to the article relevance of the sentences, the article relevance of the subject term, which represents the likelihood that it reflects the central idea of the desired text in the first time segment, can be derived from the word frequency of the subject term in the desired text in the first time segment and the article relevance of the sentences including the subject term in the first sentence set.
In one example, the article relevance of the target sentence can be obtained according to the following formula (3):
Figure BDA0001928053830000151
in the formula, sα,βFor the alpha time segment corresponding to the beta sentence, s in the sentence setα,γFor the alpha time segment corresponding to the gamma sentence, rel(s) in the sentence subsetα,β) As a sentence sα,βArticle relevance of (1), rel(s)α,γ) As a sentence sα,γArticle relevance of (1), tα,kCorresponding to the k-th subject word in the subject word set for the alpha-th time slice, rel (t)α,k) Is a subject term tα,kArticle relevance of (1), WST(sα,β,tα,k) According to the sentence sα,βAnd subject word tα,kThe topic co-occurrence word is obtained in the expected word frequency in the text in the first time segment, WSS(sα,β,sα,γ) According to the sentence sα,βAnd sentence sα,γThe word frequency of the sentence co-occurrence words in the expected text in the first time segment is obtained, M is the number of the subject words in the subject word set corresponding to the alpha time segment, and P is the number of sentences in the sentence subset.
Wherein, the subject term tα,kArticle relevance rel (t)α,k) Can be obtained according to the following formula (4):
Figure BDA0001928053830000152
in the formula, sα,xFor the alpha time segment corresponding to the x-th sentence, rel(s) in the sentence setα,x) As a sentence sα,xArticle relevance of (1), WTS(tα,k,sα,x) According to the subject term tα,kAnd sentence sα,xThe topic co-occurrence word is obtained in the expected word frequency in the text in the first time segment, WTS(tα,k,sα,x) And WST(sα,x,tα,k) And Q is the number of sentences in the corresponding sentence set of the alpha-th time slice.
As an example, WST(sα,β,tα,k) Can be obtained according to the following formula (5):
Figure BDA0001928053830000153
in the formula, wt(sα,β,tα,k) According to the sentence sα,βAnd subject word tα,kThe word frequency of the t-th topic co-occurrence word in the expected text in the first time segment is obtained, and can be the TF-IDF weight of the topic co-occurrence word.
It should be noted that when a sentence is written, the sentence issα,βAnd subject word tα,kThere is no topic co-occurrence word (i.e. sentence s)α,βDoes not include subject word tα,k) When W is greater than WST(sα,β,tα,k) Is set to 0.
As an example, WSS(sα,β,sα,γ) Can be obtained according to the following formula (6):
Figure BDA0001928053830000154
in the formula, wt(sα,β,sα,γ) According to the sentence sα,βAnd sentence sα,γThe word frequency of the t-th sentence co-occurrence word in the expected text in the first time segment is obtained, and the t-th sentence co-occurrence word can be the TF-IDF weight of the sentence co-occurrence word.
It should be noted that when the sentence sα,βAnd sentence sα,γThere is no sentence co-occurrence word (i.e. sentence s)α,βAnd sentence sα,γNo identical character string), W may be setSS(sα,β,sα,γ) Is set to 0.
TF-IDF, i.e., term frequency-inverse text frequency index (term frequency-inverse document frequency), is a commonly used weighting technique for information retrieval and data mining to evaluate the importance of a word to one of a set of documents or a corpus of documents. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus.
In practical application, normalization processing may be performed on the article relevance of each sentence in the first sentence set, and normalization processing may be performed on each subject term in the first subject term set.
S404: and selecting sentences which accord with the second screening condition from the first sentence set according to the article relevance of the sentences to obtain a first abstract set.
It can be understood that the article relevance of the sentence represents the possibility that the sentence reflects the central idea of the desired text in the first time segment, and the abstract sentences (i.e. the first abstract set) of the desired text in the first time segment can be obtained by screening the sentences meeting the second screening condition in the first sentence set by using the article relevance of the sentence.
In practical application, the sentences in the first sentence set can be sorted according to the order of the article relevance from high to low, the first J sentences are selected to form the first abstract set, and J can be set according to practical conditions and is not listed.
It should be noted that the above description illustrates how to obtain the abstract set. Because the article relevance of each abstract sentence is obtained when the abstract set is obtained, in some possible implementation manners of the embodiment of the application, in order to improve the accuracy of the generated abstract and remove abstract sentences which cannot accurately represent the idea of the center of the expected text, the abstract set can be screened according to the article relevance of the abstract sentences. Then, the step S103 may specifically include:
the article relevance of each abstract sentence in the first abstract set is synthesized to obtain the article relevance corresponding to the first abstract set; and selecting a summary set which meets the first screening condition from the plurality of summary sets based on the obtained difference degree of each summary and the relevance degree of each article.
As an example, the article relevance corresponding to the abstract set can be obtained according to the following formula (6):
Figure BDA0001928053830000171
in the formula, R (DS)i) The summary set DS corresponding to the ith time sliceiArticle relevance of (1), rel(s)1,y) Is a summary set DSiThe relevance of the article in the y-th abstract sentence, V is the abstract set DSiThe number of the middle abstract sentences.
In specific implementation, the abstract difference degree and the article relevance degree can be integrated to judge whether each abstract set meets the first screening condition. For example, it is determined whether the sum of the summarization difference degree and the article relevance degree corresponding to the summarization set is greater than a certain threshold value. In addition, corresponding weights can be set for the abstract difference degree and the article relevance degree corresponding to the abstract set according to actual conditions.
The above details have described how to obtain the abstract sets and the abstract difference and article relevance of the abstract sets, so that the abstract sets which are duplicated or redundant with other abstract sets in a plurality of abstract sets and abstract sets which cannot accurately express the idea of the desired text center can be removed. In practical applications, however, for a certain abstract set, there may be repeated content or redundant content of the abstract sentence included therein. Therefore, in some possible implementation manners of the embodiment of the application, whether summary sentences included in a single summary set are repeated or redundant can be judged, so that the repeatability and the redundancy of the generated summary are further reduced.
Specifically, refer to fig. 5, which is a schematic flow chart of another digest generation method provided in the embodiment of the present application.
Based on the summary generation method provided in the foregoing, taking the method shown in fig. 1 as an example, in some possible implementations, after obtaining a plurality of summary sets (i.e., step S101), the method may further include:
s501: and obtaining the novelty of the target abstract sentence based on the reproduction state of the characters or the character strings in the target abstract sentence in each abstract sentence in the first abstract subset.
It is understood that the target abstract sentence is any one of the abstract sentences in the first abstract set, and the first abstract subset includes the remaining abstract sentences in the first abstract set except the target abstract sentence.
In the embodiment of the present application, the recurrence state refers to a case whether a character string in a target summary sentence appears in other sentences in the first summary subset. In practical application, the intersection of the character or character string of the target sentence and the character or character string included in the abstract sentence in the first abstract subset may be counted first, and the novelty of the target abstract sentence may be obtained according to the number of the characters or character strings, which do not belong to the intersection, in the plurality of characters or character strings corresponding to the target abstract sentence.
The novelty reflects how much new content is introduced by the target abstract sentence for other abstract sentences in the first abstract set, and the greater the number of characters or character strings in the target abstract sentence which are not reproduced in the first abstract subset, the greater the novelty of the target abstract sentence. When the characters or character strings in the target abstract sentence appear in other sentences in the first abstract subset (namely the same character strings exist between the target abstract sentence and other sentences in the first abstract set), the two sentences comprise the same information, repeated content or redundant content may exist, the new content introduced by the target abstract sentence is less, and the novelty is lower; on the contrary, when the character or character string in the target abstract sentence does not appear in other sentences in the first abstract subset (i.e. the same character string does not exist between the target abstract sentence and other sentences in the first abstract set), the two sentences do not include the same information, and the target abstract sentence introduces more new content and has higher novelty. Therefore, the novelty of the target abstract sentence can be obtained according to the reproduction state of the characters or character strings in the target abstract sentence in each abstract sentence in the first abstract subset.
It should be noted that the characters or character strings described herein may be extracted from the target abstract sentence at will, or extracted according to a certain rule (e.g., N-gram), and are not described again.
As an example, the novelty of the target abstract sentence may be obtained according to a sum of ratios of the number of character strings, which are not reproduced in the abstract sentence in the first abstract subset, among the character strings corresponding to the target abstract sentence to the number of character strings corresponding to the target abstract sentence.
Then, the novelty of the target abstract sentence can be obtained according to the following formula (7):
Figure BDA0001928053830000181
in the formula, si,aThe a-th abstract sentence, s in the abstract set corresponding to the ith time segmenti,bNov(s) for the b-th summary sentence in the summary subset corresponding to the i-th time slicei,a,si,b) As abstract sentence si,aFor abstract sentence si,bIs the novelty of, H isThe number of abstract sentences in the abstract subsets corresponding to the i time segments, n-gram is a character or a character string, and | | represents the quantity calculation.
In some possible designs, the abstract sentences in the abstract set can be screened according to the novelty of the abstract sentences, the abstract sentences with the novelty lower than a set threshold value are removed, and repeated contents or redundant contents in the abstract set are eliminated. Then, after step S501, the method may further include: and based on the novelty of each abstract sentence in the first abstract set, eliminating abstract sentences which do not meet the third screening condition from the first abstract set. In practical applications, the third screening condition may be adaptively set, which is not described herein again.
S502: and integrating the novelty of each abstract sentence in the first abstract set to obtain the novelty of the first abstract set.
It is understood that the novelty of the first abstract set over the second abstract set is related to the novelty of the abstract sentences included in the first abstract set, so that the novelty of the first abstract set can be obtained by integrating the novelty of each abstract sentence in the first abstract set.
As an example, the novelty of the first digest set may be obtained according to equation (8) below:
Figure BDA0001928053830000191
in the formula, DSiSet of digests, s, corresponding to the ith time slicei,cIs the c summary sentence in the summary set corresponding to the ith time segment, N is DSiThe total number of summary sentences included.
Then, in order to reduce the repeatability and redundancy of the generated summary, the step S103 may specifically include:
and selecting the abstract set meeting the first screening condition from the plurality of abstract sets based on the obtained difference degree of each abstract and the novelty degree of each abstract set.
It can be understood that the summary sets which are duplicated or redundant with other summary sets in a plurality of summary sets can be removed based on the summary difference degree, and the summary sets which are duplicated or redundant in content can be removed based on the novelty degree, so that the repeatability and the redundancy of the generated summary are further reduced.
In practical application, the first screening condition may be set according to specific needs, for example, the sum of the summary difference and the novelty needs to be greater than a certain threshold, and the like, which is not limited herein. In addition, corresponding weights can be set for the abstract difference degree and the novelty degree respectively for judgment.
In some possible implementations, in order to reduce the repeatability and redundancy of the generated summary, the step S103 may further include:
and selecting the abstract sets meeting the first screening condition from the plurality of abstract sets based on the obtained difference degree and article relevance of each abstract and the novelty of each abstract set.
It can be understood that the abstract sets which are repeated or redundant with other abstract sets in a plurality of abstract sets can be removed based on the abstract difference degree, the abstract sets which cannot accurately express the central idea of the expected text can be removed based on the article relevance degree, the abstract sets which have repeated or redundant contents can be removed based on the novelty degree, and the central idea of the expected text can be accurately and briefly described according to the selected abstract sets.
Based on the summary generation method provided by the embodiment, the embodiment of the application also provides a summary generation device.
Referring to fig. 6, this figure is a schematic structural diagram of a summary generation apparatus according to an embodiment of the present application.
The abstract generation device provided by the embodiment of the application comprises: a first obtaining unit 100, a second obtaining unit 200, a screening unit 300, and a combining unit 400;
a first obtaining unit 100, configured to obtain a plurality of digest sets; each abstract set comprises abstract sentences of expected texts in corresponding time segments, and the time segments corresponding to any two abstract sets are different;
a second obtaining unit 200, configured to obtain a difference between the first abstract set and the second abstract set according to the abstract sentences included in the abstract sets, so as to obtain an abstract difference corresponding to the first abstract set; the first abstract set and the second abstract set are any two of the plurality of abstract sets;
a screening unit 300, configured to select, based on each obtained digest difference, a digest set that meets a first screening condition from the plurality of digest sets;
a combining unit 400, configured to combine the abstract sentences included in the selected abstract set to generate an abstract.
In some possible implementation manners of the embodiment of the present application, the second obtaining unit 200 may specifically include: the difference degree obtaining subunit and the comprehensive subunit;
the difference degree obtaining subunit is configured to obtain a difference degree of each abstract sentence in the first abstract set to the second abstract set based on a reproduction state of the character or the character string in each abstract sentence in the first abstract set in the second abstract set;
and the comprehensive subunit is used for synthesizing the difference degree of each abstract sentence in the first abstract set to the second abstract set to obtain the abstract difference degree corresponding to the first abstract set.
In some possible implementation manners of the embodiment of the present application, the obtaining the difference degree subunit specifically may include: an extraction subunit, a statistics subunit and an acquisition subunit;
the extraction subunit is used for extracting a plurality of character strings from the target abstract sentence according to a preset rule; the target abstract sentence is any abstract sentence in the first abstract set;
the statistical subunit is used for counting the number of the character strings which are not reproduced in the second abstract set in the plurality of character strings to obtain a statistical value;
and the obtaining subunit is used for obtaining the difference degree of the target abstract sentence to the second abstract set according to the statistical value and the number of the plurality of character strings.
In some possible implementation manners of the embodiment of the present application, the first obtaining unit 100 may specifically include: a clause obtaining subunit, a theme obtaining subunit, a relevancy obtaining subunit and an abstract sentence selecting subunit;
the sentence dividing obtaining subunit is used for obtaining a sentence dividing result of the expected text in the first time segment to obtain a first sentence set; the first time segment is a time segment corresponding to the first abstract set;
the topic obtaining subunit is used for extracting a topic word from the first sentence set by using a topic model to obtain a first topic word set;
the relevancy obtaining subunit is used for obtaining article relevancy corresponding to each sentence in the first sentence set according to the word frequency of the sentence co-occurrence words in the expected text in the first time segment and the word frequency of the topic co-occurrence words in the expected text in the first time segment; the sentence co-occurrence words are characters or character strings which appear in any two sentences in the first sentence set at the same time, the topic co-occurrence words are characters or character strings which appear in the first sentence set and the topic words at the same time, and the article relevance represents the possibility of reflecting the central thought of the expected text in the first time segment;
and the abstract sentence selecting subunit is used for selecting sentences which accord with the second screening condition from the first sentence set according to the article relevance to obtain a first abstract set.
Optionally, the correlation obtaining subunit may be specifically configured to:
and (3) iterating by using an optimization algorithm to obtain article relevancy, and aiming at the ith iteration:
obtaining article relevance of the target sentence according to the word frequency of the sentence co-occurrence words of each sentence in the target sentence and the sentence subset in the expected text in the first time segment, the article relevance of each sentence in the sentence subset, the word frequency of the topic co-occurrence words of each topic word in the target sentence and the target topic word set in the expected text in the first time segment and the article relevance of each topic word in the first topic word set;
the target sentence is any abstract sentence in the first sentence set, the sentence subset comprises other abstract sentences except the target sentence in the first sentence set, and the article relevance of the subject word is obtained according to the word frequency of the subject word in the expected text in the first time segment and the article relevance of the sentences including the subject word in the first sentence set.
In some possible implementation manners of the embodiment of the present application, the screening unit 300 may be specifically configured to:
the article relevance of each abstract sentence in the first abstract set is synthesized to obtain the article relevance corresponding to the first abstract set; and selecting a summary set which meets the first screening condition from the plurality of summary sets based on the obtained difference degree of each summary and the relevance degree of each article.
In some possible implementation manners of the embodiment of the present application, the apparatus may further include: a third obtaining unit and an integrating unit;
a third obtaining unit, configured to obtain a novelty of the target abstract sentence based on a reproduction state of the characters or the character strings in the target abstract sentence in each abstract sentence in the first abstract subset; the target abstract sentence is any one abstract sentence in the first abstract set, and the first abstract subset comprises the other abstract sentences except the target abstract sentence in the first abstract set;
the comprehensive unit is used for synthesizing the novelty of each abstract sentence in the first abstract set to obtain the novelty of the first abstract set;
then, the screening unit 300 may be specifically configured to:
and selecting the abstract set meeting the first screening condition from the plurality of abstract sets based on the obtained difference degree of each abstract and the novelty degree of each abstract set.
In some possible implementation manners of the embodiment of the present application, the screening unit 300 may be further configured to remove, from the first digest set, the digest sentence that does not meet the third screening condition based on the novelty of each digest sentence in the first digest set.
In the embodiment of the application, the summary sets of the expected texts in a plurality of different time segments are obtained first, and a plurality of summary sets are obtained. And then according to the abstract sentences included in the abstract sets, obtaining the difference between any two abstract sets in the plurality of abstract sets, and obtaining the abstract difference corresponding to each abstract set, wherein the abstract difference is used for reflecting the repeated conditions between each abstract set and other abstract sets. Then, based on the abstract difference degree corresponding to each abstract set, the abstract sets meeting the first screening condition are selected from the plurality of abstract sets, the progressive and variability of the abstract subjects in different time segments are considered, the repetition conditions among the abstract sets corresponding to different time segments are judged, after the abstract sets which are more repeated or redundant with other abstract sets are removed, the abstract sentences in the selected abstract sets are combined to generate the abstract, the abstract contents of repeatability and redundancy can be reduced on the basis of ensuring the abstract coverage rate, and readers can accurately master the central thought of the expected text and the development condition of events.
Based on the summary generation method and apparatus provided in the foregoing embodiments, an embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, any one of the summary generation methods provided in the foregoing embodiments is implemented.
Based on the method and the device for generating the abstract provided by the embodiment, the embodiment of the application further provides an abstract generating device, which comprises the following steps: a processor and a memory; the memory is used for storing program codes and transmitting the program codes to the processor; the processor is configured to execute any one of the digest generation methods provided in the foregoing embodiments according to instructions in the program code.
It should be noted that, in the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. The system or the device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The foregoing is merely a preferred embodiment of the present application and is not intended to limit the present application in any way. Although the present application has been described with reference to the preferred embodiments, it is not intended to limit the present application. Those skilled in the art can now make numerous possible variations and modifications to the disclosed embodiments, or modify equivalent embodiments, using the methods and techniques disclosed above, without departing from the scope of the claimed embodiments. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present application still fall within the protection scope of the technical solution of the present application without departing from the content of the technical solution of the present application.

Claims (10)

1. A method for generating a summary, the method comprising:
obtaining a plurality of abstract sets; each abstract set comprises abstract sentences of expected texts in corresponding time segments, and the time segments corresponding to any two abstract sets are different;
acquiring the difference degree of a first abstract set to a second abstract set according to abstract sentences included in the abstract sets to obtain the abstract difference degree corresponding to the first abstract set; the first and second digest sets are any two of the plurality of digest sets;
selecting a summary set which meets a first screening condition from the plurality of summary sets based on each obtained summary difference degree;
combining the abstract sentences in the selected abstract set to generate an abstract;
wherein:
selecting a summary set meeting a first screening condition from the plurality of summary sets based on each obtained summary difference, specifically comprising:
the article relevance of each abstract sentence in the first abstract set is synthesized to obtain the article relevance corresponding to the first abstract set; selecting a summary set which meets a first screening condition from the plurality of summary sets based on the obtained difference degree of each summary and the relevance degree of each article; or,
and selecting a summary set meeting a first screening condition from the plurality of summary sets based on the obtained difference degree of each summary and the novelty degree of each summary set.
2. The method according to claim 1, wherein the determining, according to the abstract sentence included in the abstract set, a difference degree of a first abstract set to a second abstract set to obtain an abstract difference degree corresponding to the first abstract set specifically includes:
obtaining the difference degree of each abstract sentence in the first abstract set to the second abstract set based on the reappearance state of the characters or character strings in each abstract sentence in the first abstract set in the second abstract set;
and integrating the difference degree of each abstract sentence in the first abstract set to the second abstract set to obtain the abstract difference degree corresponding to the first abstract set.
3. The method according to claim 2, wherein the obtaining the difference degree of each abstract sentence in the first abstract set to the second abstract set based on the recurrence status of the character or the character string in each abstract sentence in the first abstract set comprises:
extracting a plurality of character strings from the target abstract sentence according to a preset rule; the target abstract sentence is any one abstract sentence in the first abstract set;
counting the number of character strings which are not reproduced in the second abstract set in the plurality of character strings to obtain a statistical value;
and acquiring the difference degree of the target abstract sentence to the second abstract set according to the statistical value and the number of the character strings.
4. The method according to any one of claims 1 to 3, wherein the obtaining a plurality of digest sets specifically comprises:
obtaining a sentence dividing result of an expected text in a first time segment to obtain a first sentence set; the first time segment is a time segment corresponding to the first abstract set;
extracting a subject word from the first sentence set by using a subject model to obtain a first subject word set;
obtaining article relevance corresponding to each sentence in the first sentence set according to the word frequency of the sentence co-occurrence words in the expected text in the first time segment and the word frequency of the topic co-occurrence words in the expected text in the first time segment; the sentence co-occurrence words are characters or character strings which appear in any two sentences in the first sentence set at the same time, the topic co-occurrence words are characters or character strings which appear in the first sentence set and the topic words at the same time, and the article relevance represents the possibility of reflecting the central thought of the expected text in the first time segment;
and selecting sentences which accord with a second screening condition from the first sentence set according to the article relevance to obtain the first abstract set.
5. The method according to claim 4, wherein obtaining article relevance corresponding to each sentence in the first sentence set according to the word frequency of the sentence co-occurrence word in the expected text in the first time segment and the word frequency of the topic co-occurrence word in the expected text in the first time segment specifically comprises:
and iterating by using an optimization algorithm to obtain the article relevancy, and aiming at the ith iteration:
obtaining article relevance of the target sentence according to the word frequency of the sentence co-occurrence words of each sentence in the target sentence and the sentence subset in the expected text in the first time segment, the article relevance of each sentence in the sentence subset, the word frequency of the topic co-occurrence words of each topic word in the target sentence and the target topic word set in the expected text in the first time segment and the article relevance of each topic word in the first topic word set;
the target sentence is any abstract sentence in the first sentence set, the sentence subset includes other abstract sentences except the target sentence in the first sentence set, and the article relevance of the subject word is obtained according to the word frequency of the desired text of the subject word in the first time segment and the article relevance of the sentence including the subject word in the first sentence set.
6. The method of any of claims 1-3, wherein obtaining the plurality of digest sets further comprises:
obtaining the novelty of a target abstract sentence based on the reproduction state of characters or character strings in the target abstract sentence in each abstract sentence in a first abstract subset; the target abstract sentence is any one abstract sentence in the first abstract set, and the first abstract subset comprises the other abstract sentences except the target abstract sentence in the first abstract set;
and integrating the novelty of each abstract sentence in the first abstract set to obtain the novelty of the first abstract set.
7. The method of claim 6, wherein obtaining the novelty of the abstract sentence in the first abstract set further comprises:
and based on the novelty of each abstract sentence in the first abstract set, eliminating abstract sentences which do not meet a third screening condition from the first abstract set.
8. An apparatus for generating a summary, the apparatus comprising: the device comprises a first obtaining unit, a second obtaining unit, a screening unit and a combining unit;
the first obtaining unit is used for obtaining a plurality of summary sets; each abstract set comprises abstract sentences of expected texts in corresponding time segments, and the time segments corresponding to any two abstract sets are different;
the second obtaining unit is configured to obtain a difference degree of the first abstract set to the second abstract set according to the abstract sentences included in the abstract sets, so as to obtain an abstract difference degree corresponding to the first abstract set; the first and second digest sets are any two of the plurality of digest sets;
the screening unit is used for selecting an abstract set meeting a first screening condition from the plurality of abstract sets based on each obtained abstract difference degree;
the combination unit is used for combining the abstract sentences in the selected abstract set to generate an abstract;
wherein: the screening unit is specifically configured to:
the article relevance of each abstract sentence in the first abstract set is synthesized to obtain the article relevance corresponding to the first abstract set; selecting a summary set which meets a first screening condition from the plurality of summary sets based on the obtained difference degree of each summary and the relevance degree of each article; or,
and selecting a summary set meeting a first screening condition from the plurality of summary sets based on the obtained difference degree of each summary and the novelty degree of each summary set.
9. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, implements the digest generation method according to any one of claims 1 to 7.
10. A digest generation apparatus characterized by comprising: a processor and a memory;
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor, configured to execute the digest generation method according to any one of claims 1 to 7 according to instructions in the program code.
CN201811626213.XA 2018-12-28 2018-12-28 Abstract generation method and device Active CN109815328B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811626213.XA CN109815328B (en) 2018-12-28 2018-12-28 Abstract generation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811626213.XA CN109815328B (en) 2018-12-28 2018-12-28 Abstract generation method and device

Publications (2)

Publication Number Publication Date
CN109815328A CN109815328A (en) 2019-05-28
CN109815328B true CN109815328B (en) 2021-05-25

Family

ID=66602705

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811626213.XA Active CN109815328B (en) 2018-12-28 2018-12-28 Abstract generation method and device

Country Status (1)

Country Link
CN (1) CN109815328B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111813925A (en) * 2020-07-14 2020-10-23 混沌时代(北京)教育科技有限公司 Semantic-based unsupervised automatic summarization method and system
CN112328783A (en) * 2020-11-24 2021-02-05 腾讯科技(深圳)有限公司 Abstract determining method and related device
CN114722194B (en) * 2022-03-15 2023-05-09 电子科技大学 Automatic construction method for emergency time sequence based on abstract generation algorithm

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101231634A (en) * 2007-12-29 2008-07-30 中国科学院计算技术研究所 Autoabstract method for multi-document
CN101446940A (en) * 2007-11-27 2009-06-03 北京大学 Method and device of automatically generating a summary for document set
CN102043851A (en) * 2010-12-22 2011-05-04 四川大学 Multiple-document automatic abstracting method based on frequent itemset
CN106874469A (en) * 2017-02-16 2017-06-20 北京大学 A kind of news roundup generation method and system
CN107526841A (en) * 2017-09-19 2017-12-29 中央民族大学 A kind of Tibetan language text summarization generation method based on Web

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101446940A (en) * 2007-11-27 2009-06-03 北京大学 Method and device of automatically generating a summary for document set
CN101231634A (en) * 2007-12-29 2008-07-30 中国科学院计算技术研究所 Autoabstract method for multi-document
CN102043851A (en) * 2010-12-22 2011-05-04 四川大学 Multiple-document automatic abstracting method based on frequent itemset
CN106874469A (en) * 2017-02-16 2017-06-20 北京大学 A kind of news roundup generation method and system
CN107526841A (en) * 2017-09-19 2017-12-29 中央民族大学 A kind of Tibetan language text summarization generation method based on Web

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
《基于Google_Map的地理位置查询***》;盛雅东;《中国优秀硕士学位论文全文数据库 信息科技辑》;20120815;正文第10-11页 *
《基于时间线的事件组织与摘要技术的研究与应用》;李辉;《中国优秀硕士学位论文全文数据库 信息科技辑》;20120715;第四章 *
李辉.《基于时间线的事件组织与摘要技术的研究与应用》.《中国优秀硕士学位论文全文数据库 信息科技辑》.2012, *

Also Published As

Publication number Publication date
CN109815328A (en) 2019-05-28

Similar Documents

Publication Publication Date Title
Weiss et al. Fundamentals of predictive text mining
Reynar Topic segmentation: Algorithms and applications
US9495358B2 (en) Cross-language text clustering
US9460195B1 (en) System and methods for determining term importance, search relevance, and content summarization
US10095692B2 (en) Template bootstrapping for domain-adaptable natural language generation
US8386240B2 (en) Domain dictionary creation by detection of new topic words using divergence value comparison
US20070217693A1 (en) Automated evaluation systems & methods
JP5379138B2 (en) Creating an area dictionary
CN113407679B (en) Text topic mining method and device, electronic equipment and storage medium
CN109815328B (en) Abstract generation method and device
CN111090731A (en) Electric power public opinion abstract extraction optimization method and system based on topic clustering
CN113377927A (en) Similar document detection method and device, electronic equipment and storage medium
CN112434211B (en) Data processing method, device, storage medium and equipment
CN105701085B (en) A kind of network duplicate checking method and system
CN105701086B (en) A kind of sliding window document detection method and system
Karpovich The Russian language text corpus for testing algorithms of topic model
Mekki et al. Tokenization of Tunisian Arabic: a comparison between three Machine Learning models
CN115455975A (en) Method and device for extracting topic keywords based on multi-model fusion decision
Eggi Afaan oromo text retrieval system
Zhao et al. Effective authorship attribution in large document collections
Wang Information Extraction from TV Series Scripts for Uptake Prediction
Tekir et al. Quote Detection: A New Task and Dataset for NLP
Makkonen Semantic classes in topic detection and tracking
Ceylan Investigating the extractive summarization of literary novels
Egan Introduction to a special section on ‘Computational Methods for Literary–Historical Textual Scholarship’

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant