CN113918708A

CN113918708A - Abstract extraction method

Info

Publication number: CN113918708A
Application number: CN202111532196.5A
Authority: CN
Inventors: 胡为民; 郑喜
Original assignee: Shenzhen Dib Enterprise Risk Management Technology Co ltd
Current assignee: Shenzhen Dib Enterprise Risk Management Technology Co ltd
Priority date: 2021-12-15
Filing date: 2021-12-15
Publication date: 2022-01-11
Anticipated expiration: 2041-12-15
Also published as: CN113918708B

Abstract

The invention relates to the technical field of natural language processing, in particular to a method for abstracting an abstract, which comprises the following steps: s1, preprocessing, namely generalizing the numerical value and time type data in the bulletin text; s2, constructing a first word list; s3, constructing a word co-occurrence matrix of the first word list; s4, reducing the dimension of the word co-occurrence matrix, and extracting semantic representations of all words in the first word list; s5, repeating S2 to S4, and extracting semantic representations of all words in the bulletin text; s6, accumulating and combining the semantic representations by taking the sentences as units to form sentence context semantic representations; s7, inputting a key phrase by a user, and extracting semantic representation of the key phrase; and S8, judging the similarity between the semantic representation of the key phrase and the semantic representation of the sentence context, and if the similarity of the key phrase is more than a set value, extracting the bulletin text sentence comprising the key phrase into a public text abstract. The association degree between the content of the abstract and the keywords input by the user is high.

Description

Abstract extraction method

Technical Field

The invention relates to the technical field of natural language processing, in particular to a method for abstracting an abstract.

Background

At present, the number of enterprises appearing on the market is increasing day by day, and the public notice of the companies appearing on the market, namely the temporary or annual relevant operating conditions of finance, business and the like of the companies appearing on the market, contains a large amount of information; however, the bulletin text lacks of standard writing specifications and is long in space, reading and data analysis are not facilitated, and key sentences and other information need to be manually extracted from the bulletin text by data analysis and auditors, so that the working efficiency is low. Therefore, it is necessary to provide a method for extracting the abstract of the public company bulletin text, which compresses the text of the bulletin text, removes the 'redundant' information which is not concerned by the analysis and audit staff, and improves the work efficiency of the analysis and audit staff.

At present, a relevant abstract extraction method is available, but sentences containing key words or words similar to the key words in semantics are mainly searched through full text, and extracted and synthesized abstracts are obtained. The method mainly adopts the technology of word vector similarity calculation. However, the existing abstract extraction method has some problems when applied to the public company bulletin text, which is mainly reflected in that only semantic associations among keywords are considered, the semantic associations between the keywords and paragraphs and chapters are not considered, and part of the keywords extend through the whole bulletin text, so that the abstract extraction content is not accurate enough.

Disclosure of Invention

The invention provides a method for extracting an abstract, which aims to solve the problem that the content of the abstract extracted by the existing abstract extracting method is not accurate enough and has high association degree with keywords input by a user.

A method for abstracting abstract comprises the following steps:

s1, preprocessing, namely generalizing the numerical value and time type data in the bulletin text;

s2, constructing a first word list;

s3, constructing a word co-occurrence matrix of the first word list;

s4, reducing the dimension of the word co-occurrence matrix, and extracting semantic representations of all words in the first word list;

s5, repeating S2 to S4, and extracting semantic representations of all words in the bulletin text;

s6, accumulating and combining the semantic representations by taking the sentences as units to form sentence context semantic representations;

s7, inputting a key phrase by a user, and extracting semantic representation of the key phrase;

and S8, judging the similarity between the semantic representation of the key phrase and the semantic representation of the sentence context, and if the similarity of the key phrase is more than a set value, extracting the bulletin text sentence comprising the key phrase into a public text abstract.

In the method, the semantic representation of the words is extracted, the similarity between the semantic representation of the sentence context and the semantic representation of the key phrase is judged, the sentences with the similarity larger than a set value are extracted to form a bulletin text abstract, and the association degree between the abstract content and the key words input by a user is high;

further, replacing the bulletin textTextThe numerical value in the text is a Chinese character numerical value, and the announcement text is replacedTextThe middle time is the Chinese character time;

eliminating the mark number in the punctuation mark, and the pause number and the colon number in the point number, and decomposing the bulletin text into sentences by using the reserved point number as a separator; bulletin text by adopting jieba word-separating methodTextPerforming Chinese word segmentation, after eliminating stop words in the Chinese words, weighting the words by adopting TFIDF, and arranging the words from large to small according to weight;

further, the step of constructing the first vocabulary in S2 includes obtaining words of 2000 words before the weight arrangement to construct the first vocabularyWords；

Whereinw _iIs shown asiThe number of the words is one,w _jis shown asjThe number of the words is one,nis the number of words，

。

Further, the S3 includes the step of,

for any two words appearing in the same sentence, the same paragraph and the same chapterw _iAndw _jestablishing association and constructing word co-occurrence matrix

Is a sentence level word co-occurrence matrix;

is a paragraph level word co-occurrence matrix;

is a discourse level co-occurrence matrix; matrix row indexiColumn indexjRespectively representing two co-occurrence wordsw _iAndw _jan index of (2); the elements in the matrix represent the joint probability of two words pointed by row and column indexes

。

Further, the step S4 includes reducing dimensions of the sentence-level co-occurrence matrix of words, the paragraph-level co-occurrence matrix of words, and the chapter-level co-occurrence matrix of words by using a principal component analysis method, where the dimensions after the dimension reduction are 2000 × 100, where 2000 represents the number of words and 100 represents the dimension of each semantic vector of words; after dimensionality reduction, the three-level vector of the word co-occurrence matrix is three-level semantic representation; the three-level semantic representations are sentence-level, paragraph-level and chapter-level semantic representations of words; and extracting three-level semantic representations of all the words in the first word list.

Further, the dimension reduction calculation formula is as follows:

wherein

Represents the k-th row vector standard deviation;

to represent

To (1)kA row vector;

representing a covariance matrix;

the first 100 columns of feature column vectors representing the covariance matrix;

means word co-occurrence matrixkThree levels of semantic representation of individual words.

Further, the S5 is repeated each time, and S2 constructs a vocabulary respectively until all words in the bulletin text are included, and the vocabulary is sequentially words 2000 before weight arrangement;

further, the statement context three-level semantic representation is

；

Wherein t is the t-th word in the sentence.

The S7 comprises the steps that a user inputs a key phrase, and three-level semantic representations of all key words of the key phrase are extracted; accumulating and combining the three-level semantic representations of all the keywords to form a keyword group three-level semantic representation; the key phrase three-level semantic representation is

；

Wherein t is the t-th word in the key phrase.

Furthermore, a semantic similarity calculation model based on a twin neural network is constructed, the semantic similarity calculation model based on the twin neural network comprises two groups of isomorphic feedback neural networks, the input of the semantic similarity calculation model based on the twin neural network is the three-level semantic representation of statement context and the three-level semantic representation of user key word groups, and the output is similarity;

inputting three-level semantic representations of statement context and three-level semantic representations of user key phrases, and extracting statements corresponding to the input three-level semantic representations of statement context when the similarity is greater than a set value;

and repeating the steps, and sequentially inputting all the three-level semantic representations of the sentence context in the bulletin text until all the sentences with the similarity between the three-level semantic representations of the sentence context in the bulletin text and the three-level semantic representations of the user key phrase being more than a set value are extracted to form the bulletin text abstract.

Has the advantages that: by extracting semantic representations of words, the similarity between the sentence context semantic representation and the keyword group semantic representation is judged, sentences with the similarity larger than a set value are extracted to form a bulletin text abstract, and the association degree between the abstract content and keywords input by a user is high; the 'redundant' information which is not concerned by the user can be effectively removed, and the working efficiency of the user is improved.

Drawings

The invention is described in further detail below with reference to the figures and specific embodiments.

Fig. 1 is a flowchart of the present embodiment.

FIG. 2 is an architecture diagram of the present embodiment of a twin neural network.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art. In addition, the technical features involved in the different embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

As shown in fig. 1, the present embodiment provides a method for extracting a summary by taking a public company bulletin text as an example, and specifically includes the following steps.

S1, preprocessing, namely generalizing the numerical value and time type data in the bulletin text; comprises the steps of (a) preparing a mixture of a plurality of raw materials,

replacement bulletin textTextThe numerical value in the text is a Chinese character numerical value, and the announcement text is replacedTextThe middle time is the Chinese character time;

eliminating the mark number in the punctuation mark, and the pause number and the colon number in the point number, and decomposing the bulletin text into sentences by using the reserved point number as a separator; bulletin text by adopting jieba word-separating methodTextPerforming Chinese word segmentation, and obtaining a bulletin text after eliminating stop words in the Chinese word segmentationTextThe term (a);

and weighting the words by adopting TFIDF, and arranging the words from large to small according to the weight.

S2, constructing a first vocabulary, including,

obtaining words of 2000 before weight arrangement to construct first word listWords；

。

S3, constructing a word co-occurrence matrix of the first word list;

，

Is a sentence level word co-occurrence matrix;

is a paragraph level word co-occurrence matrix;

is a discourse level co-occurrence matrix;

matrix row indexiColumn indexjRespectively representing two co-occurrence wordsw _iAndw _jan index of (2); the elements in the matrix represent the joint probability of two words pointed by row and column indexes

。

S4, reducing the dimension of the word co-occurrence matrix, and extracting semantic representations of all words in the first word list; comprises the steps of (a) preparing a mixture of a plurality of raw materials,

reducing dimensions of the sentence-level word co-occurrence matrix, the paragraph-level word co-occurrence matrix and the chapter-level word co-occurrence matrix by adopting a principal component analysis method, wherein the dimensions after the dimension reduction are 2000 x 100, wherein 2000 represents the number of words, and 100 represents the dimension of each word semantic vector; after dimensionality reduction, the three-level vector of the word co-occurrence matrix is three-level semantic representation; the three-level semantic representations are sentence-level, paragraph-level and chapter-level semantic representations of words; the dimensionality reduction calculation formula is as follows:

wherein

Represents the k-th row vector standard deviation;

to represent

To (1)kA row vector;

representing a covariance matrix;

to representFirst in the word co-occurrence matrixkThree levels of semantic representation of individual words.

S2-S4 form a three-level semantic coding method, and three-level semantic representations of all words in the bulletin text are extracted by the three-level semantic coding method. And splitting the sentence as a unit, and accumulating and combining the three-level semantic representations of the sentence context words to form the three-level semantic representation of the sentence context.

S5, repeating S2 to S4, and extracting semantic representations of all words in the bulletin text; repeating each time, S2 respectively constructing a word list until all words in the bulletin text are included, wherein the word list is sequentially the words 2000 before the weight arrangement;

the sentence context three-level semantic representation is as follows:

，

wherein t is the t-th word in the sentence.

S7, inputting key phrase by user, extracting semantic representation of key phrase, including

Inputting a key phrase by a user, and extracting three-level semantic representations of all key words of the key phrase; accumulating and combining the three-level semantic representations of all the keywords to form a keyword group three-level semantic representation;

the third-level semantic representation of the key phrase is as follows:

in the formula, t is the t-th word in the keyword group.

S8, judging the similarity between the sentence context semantic representation and the key phrase semantic representation, and extracting sentences with the similarity larger than a set value to form a bulletin text abstract; comprises the steps of (a) preparing a mixture of a plurality of raw materials,

constructing a semantic similarity calculation model based on the twin neural network as follows:

as shown in fig. 2, the semantic similarity calculation model based on the twin neural network includes two groups of isomorphic feedback neural networks, the inputs of which are three-level semantic representations of sentence context and three-level semantic representations of user key phrases, and the output is similarity; the semantic similarity calculation model based on the twin neural network specifically comprises two independent parallel input layers, two independent parallel hidden layers and an output layer; the input layer dimension is 1 x 100; the hidden layer dimension is 1 x 10; the two independent parallel input layers are respectively connected with the two independent parallel hidden layers by adopting a Sigmoid activated function, and the two independent parallel hidden layers are commonly connected with the output layer by adopting the Sigmoid activated function; the output layer is a cross entropy loss function; the output of the output layer is similarity;

adopting the three-level semantic representation of the statement context and the three-level semantic representation of the user key word group as the input of a semantic similarity calculation model based on a twin neural network, training the semantic similarity calculation model based on the twin neural network, and calculating the similarity of the three-level semantic representation of the statement context and the three-level semantic representation of the user key word group through the semantic similarity calculation model based on the twin neural networkSimilarity(Text,keywords)；

Specifically, the sentence context three-level semantic representation adopts a Sigmoid activated function to connect one of two independent parallel input layers, and the keyword group three-level semantic representation adopts a Sigmoid activated function to connect the other input layer;

judging the similarity between the semantic representation of the key phrase and the semantic representation of the sentence context, wherein the set value is 0.7, and when the similarity is high

And extracting the bulletin text sentences including the key phrases into public text abstracts.

S6-S8 form a summary extraction method for context semantic similarity calculation, and extract sentences containing key information in the bulletin text to form a summary.

The abstract extraction method provided by the embodiment extracts three levels of semantic representations of a sentence level, a paragraph level and a chapter level of words by a three-level semantic coding method; and judging the similarity between the sentence context semantic representation and the key phrase semantic representation by a abstract extraction method of context semantic similarity calculation, and extracting sentences with the similarity larger than a set value to form the abstract of the bulletin text.

The abstract extraction method provided by the implementation considers the relevance of the keywords input by the user with sentences, paragraphs and chapters, and accurately extracts the sentences with high relevance with the keywords input by the user to form abstract texts; the 'redundant' information which is not concerned by the user can be effectively removed, and the working efficiency of the user is improved.

It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications of the invention may be made without departing from the spirit or scope of the invention.

Claims

1. A method for abstracting an abstract is characterized by comprising the following steps:

s2, constructing a first word list;

s3, constructing a word co-occurrence matrix of the first word list;

2. The digest extraction method according to claim 1, wherein the S1 includes,

eliminating the mark number in the punctuation mark, and the pause number and the colon number in the point number, and decomposing the bulletin text into sentences by using the reserved point number as a separator; bulletin text by adopting jieba word-separating methodTextPerforming Chinese word segmentation, after eliminating stop words in the Chinese words, weighting the words by adopting TFIDF, and arranging the words from large to small according to the weight.

3. The method for abstracting abstract as claimed in claim 2, wherein said S2 constructing the first vocabulary includes constructing the first vocabulary by obtaining 2000 words before weight arrangementWords；

，

。

4. The digest extraction method according to claim 3, wherein the S3 includes,

Is a sentence level word co-occurrence matrix;

is a paragraph level word co-occurrence matrix;

is a discourse level co-occurrence matrix;

。

5. The abstract extraction method as claimed in claim 4, wherein the S4 includes using principal component analysis to perform dimension reduction on the sentence-level word co-occurrence matrix, paragraph-level word co-occurrence matrix, and chapter-level word co-occurrence matrix, respectively, the dimension after dimension reduction is 2000 x 100, where 2000 represents the number of words and 100 represents the dimension of each word semantic vector; after dimensionality reduction, the three-level vector of the word co-occurrence matrix is three-level semantic representation; the three-level semantic representations are sentence-level, paragraph-level and chapter-level semantic representations of words; and extracting three-level semantic representations of all the words in the first word list.

6. The abstract extraction method as claimed in claim 5, wherein the dimension reduction calculation formula is as follows:

，

wherein

Represents the k-th row vector standard deviation;

to represent

To (1)kA row vector;

representing a covariance matrix;

7. The method for abstracting abstract as claimed in claim 6, wherein in S5, each repetition, S2 constructs a vocabulary respectively until all words in the bulletin text are included, and the vocabulary is in turn the words with weight ranking of 2000.

8. The summarization extraction method of claim 7 wherein the sentence context three-level semantic features are characterized as

；

Wherein t is the t-th word in the sentence.

9. The method for abstracting abstract as claimed in claim 8, wherein the step S7 includes inputting a keyword group by a user, extracting three-level semantic representations of all keywords of the keyword group; accumulating and combining the three-level semantic representations of all the keywords to form a keyword group three-level semantic representation; the key phrase three-level semantic representation is

；

Wherein t is the t-th word in the key phrase.

10. The digest extraction method according to claim 9, wherein the S8 includes,

the method comprises the steps that a semantic similarity calculation model based on a twin neural network is constructed, the semantic similarity calculation model based on the twin neural network comprises two groups of isomorphic feedback neural networks, the input of the semantic similarity calculation model based on the twin neural network is three-level semantic representation of statement context and three-level semantic representation of user key word groups, and the output is similarity;