CN108090049B

CN108090049B - Multi-document abstract automatic extraction method and system based on sentence vectors

Info

Publication number: CN108090049B
Application number: CN201810045090.4A
Authority: CN
Inventors: 窦全胜; 朱翔
Original assignee: Shandong Technology and Business University
Current assignee: Shandong Technology and Business University
Priority date: 2018-01-17
Filing date: 2018-01-17
Publication date: 2021-02-05
Anticipated expiration: 2038-01-17
Also published as: CN108090049A

Abstract

The invention discloses a method and a system for automatically extracting multiple document abstracts based on sentence vectors, which comprises the following steps: s1, preprocessing the document set; s2, training by adopting a doc2vec model to generate a sentence vector; s3, clustering into each sub-topic document; s4, establishing a sentence relation graph model in each subtopic document; s5, calculating sentence weight; and S6, extracting sentences and sequencing to form extracts. According to the method, all sentences in a target document set are expressed by vectors through a large corpus training doc2vec model; clustering is carried out by using the spectrum as each subtopic, and a sentence is extracted from each subtopic, so that the problem of sentence redundancy is avoided; and (4) forming a abstract according to the position sequence of the sentences in the original document, and improving the front-back continuity of the sentences of the abstract.

Description

Multi-document abstract automatic extraction method and system based on sentence vectors

Technical Field

The invention relates to the field of computer text mining, in particular to a method and a system for automatically extracting multiple document abstracts based on sentence vectors.

Background

The automatic document summarization technology provides the general information of the text for the user to summarize and refine the text through a computer. The user can preliminarily snoop the key contents of the full text only by briefly reading the abstract, and the efficiency of obtaining or understanding information by the user is greatly improved. The single document automatic summarization is a computer automatically generates a summary of the main content of a document through an algorithm, and since Luhn proposed a method for automatically generating a summary of a document in 1958, research based on the single document automatic summarization is actively developed, so that the result of the single document automatic summarization reaches the generally accepted degree so far. And the multi-document automatic summarization is a comprehensive summarization of main contents generated by different documents. So far, the multi-document automatic summarization technology is closely combined with artificial intelligence science and related algorithms, and more recently, the multi-document automatic summarization technology is combined with evolution algorithms and deep learning algorithms.

Yan et al will use deep learning for text summarization for the first time, with the input layer being word frequency vectors, the hidden layer consisting of a constrained boltzmann machine, and the summary consisting of dynamically planning and selecting important sentences. Rush uses deep learning to summarize the original document, uses convolutional networks to encode the original document, and uses context attention feedforward neural networks to generate the summary. Google in 2016 originated the deep learning based automatic digest module Textsum in the deep learning framework, tensoflow. The multi-document automatic digest may be divided into a decimated digest and an abstract digest according to whether or not a sentence forming the digest is derived from an original text. The abstract type abstract is mainly used for evaluating the importance of sentences of an original document and selecting key sentences from the sentences to form the abstract. The abstract is mainly used for extracting word information from an original document and then organizing word tandem sentences to form the abstract.

At present, the implementation method of the abstract is too complex, the machine has insufficient understanding of natural language, more manual participation parts are needed, and the development and the improvement are slow and are still in the starting stage. The extraction type abstract is a commonly used method at present, and a maximum public sub-graph method and an edge weight similarity method in text classification based on a graph model are commonly used similarity measurement methods. Similarity measurement is also performed based on the eigenvector corresponding to the left singular value of the text map matrix, which is essentially the PCA dimension reduction assuming that the sample mean is 0. The conventional extraction type summarization method mainly has the problems of sentence redundancy and unsmooth sentence connection.

Disclosure of Invention

Aiming at the defects of redundancy, disordered sentence sequence and the like of the sentences extracted in the prior art, the invention provides a multi-document abstract automatic extraction method based on sentence vectors to provide a document abstract with higher accuracy and readability.

The technical scheme adopted by the invention is as follows:

the method for automatically extracting the multi-document abstract based on the sentence vector comprises the following steps:

s1: preprocessing a document set of the abstract to be extracted;

s2: training by adopting a doc2vec model to generate a sentence vector;

s3: clustering sentence vectors and storing corresponding sentences as sub-topic documents;

s4: establishing a sentence relation graph model in each subtopic document;

s5: calculating sentence weight in each subtopic document according to the relation graph model established in the step S4;

s6: the sentences are extracted and sorted to form the abstract.

Further, S1 includes the steps of:

step S101: dividing sentences of each document of the document set according to the sentence end characters, and recording the divided sentences in lines, wherein one sentence occupies one line;

step S102: recording the corresponding position of each sentence;

step S103: copying each document content in the document set after the sentences are divided into the same document to carry out document set combination, wherein one sentence of the combined document occupies one line;

step S104: and cutting words and removing stop words from sentences in each row in the merged document.

Further, the corresponding position of the sentence in step S102 in step S1 is represented as:

wherein h is_n，iRepresenting the position of the ith sentence in the nth document, text_nRepresents the nth document, len (text)_n) Indicating the number of sentences contained in the nth document.

Further, step S2 includes the following steps:

step S201: preprocessing all documents in the large corpus through steps S101 to S104 in step S1, inputting the preprocessed documents in the large corpus into a sentence vector distribution memory model PV-DM in doc2vec, and training the sentence vector distribution memory model PV-DM;

step S202: inputting the target document preprocessed in the steps S101 to S104 in the step S1 into the trained sentence vector distribution memory model PV-DM to obtain the sentence vector.

In step S201, the training of the sentence vector distribution memory model PV-DM includes the following steps:

step S2011: initializing sentences and all words in each row in the document preprocessed by the large corpus into k-dimensional vectors, and inputting word vectors corresponding to the context of the words w and sentence vectors corresponding to the sentences in which the words are located into the deep neural network model;

step S2012: summing and accumulating the input vectors in a hidden layer of the deep neural network model, wherein the accumulated vectors are used as the input of an output layer;

step S2013: the output layer of the deep neural network model corresponds to a binary tree, the binary tree uses words in the corpus as leaf nodes, the number of times of the appearance of each word in the corpus is used as a weight to construct a Huffman tree, each word corresponds to a leaf node in the tree, each branch in the tree is regarded as one-time two-classification, and a Label corresponding to each tree node in a path from a root node to the leaf node corresponding to the word w is 1-p_j，p_jFor the coding corresponding to the jth node in the path, except for a root node and a leaf node, each tree node corresponds to an auxiliary vector with the same length as the sentence vector for assisting in training the model;

step S2014: and continuously correcting the sentence vectors, the word vectors and the auxiliary vectors by adopting a gradient ascending method to finally obtain a trained distributed memory model PV-DM of the sentence vectors.

The context of the word w is the C words before and after the word w.

The objective function of the neural network training is

Wherein, the sensor is a sentence, doc is a preprocessed document, w is a word, and Context (w) is a Context word of the word w and the sentence where w is located.

Further, clustering generation of sentence vectors in step S3 adopts a spectral clustering manner;

further, step S3 includes the following steps:

step S301: constructing a similarity matrix W among all sentence vectors, using a Gaussian kernel function as a kernel function,

wherein, W_i，jAs a sentence x_iAnd sentence x_jThe similarity between the two is shown in the specification, wherein sigma is a Gaussian radius;

step S302: calculating a Laplace matrix L;

L＝D-W

wherein D is a diagonal matrix, and the nth row element of D is the sum of the nth row elements of W;

step S303: construction of a standardized Laplace matrix

Step S304: computing

K minimum eigenvalues and corresponding eigenvectors V;

step S305: arranging the feature vectors in columns to form a feature matrix, unitizing each row in the feature matrix to form a matrix F,

that is, the modulus of the vector formed by each row of the matrix F is 1;

step S306: taking each row of the matrix F as a k-dimensional sample, clustering by using a Kmeans algorithm, and clustering into a C class;

step S307: and storing sentences corresponding to the vectors in the C type as C subtopic documents.

Further, step S4 specifically includes:

in each sub-topic document, a sentence relation graph model is established by taking sentences as nodes and the similarity between the sentences as the similarity between edges and the sentences;

further, the similarity between sentences is calculated by cosine value of included angle between sentence vectors, cosine value sim (x)_i，x_j) The calculation formula of (2) is as follows:

wherein x is_i，x_jTwo sentence vectors.

Further, the step S5 specifically includes:

initializing each sentence weight, and iteratively updating the sentence weight according to the relationship graph model established in the step S4:

wherein s (i) is the weight of the sentence i, δ (i) is all sentences in the same sub-topic document whose similarity with the sentence i is higher than the set threshold, | δ (j) | is the number of sentences in the same sub-topic document whose similarity with the sentence i is higher than the set threshold, s (j) is the weight of the sentence j, and d is the damping coefficient which is always set to 0.85.

Further, the sentence weight is initialized to 1 in step S5, and the similarity threshold set in | δ (i) | is 0.05.

Further, step S6 specifically includes: the sentence with the largest weight is extracted from each sub-topic document, and the sentences are sorted according to the positions of the sentences obtained in step S102 in step S1 in the document and combined into an abstract.

A sentence vector-based multi-document abstract automatic extraction system comprises: the computer program product comprises a memory, a processor, and computer instructions stored on the memory and executed on the processor, wherein the computer instructions, when executed by the processor, perform the steps of any of the above methods.

A computer-readable storage medium, having a computer program running thereon, the computer program, when executed by a processor, performing the steps of any of the methods described above.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.

FIG. 1 is a flow chart of the present invention.

FIG. 2 is a flow chart of the pretreatment steps of the present invention.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

FIG. 1 is a flow chart of the present invention, comprising the steps of:

s1: preprocessing a document set;

s2: training by adopting a doc2vec model to generate a sentence vector;

s4: establishing a sentence relation graph model in each subtopic document;

s5: calculating sentence weights in the sub-topic documents;

s6: the sentences are extracted and sorted to form the abstract.

Specifically, the specific implementation steps of step S1 are shown in fig. 2, and include the following steps:

step S101: dividing sentences of each document in the document set by using sentence end symbols, wherein one sentence occupies one line;

step S102: recording the corresponding position of each sentence;

Further, the sentence positions in step S102 are expressed as:

wherein h is_n，iRepresenting the position of the ith sentence in the nth document, text_nRepresents the nth document, and len represents the number of sentences contained in the nth document.

Step S2 specifically includes the following steps:

step S201: preprocessing all documents in the large corpus through steps S101 to S104 in step S1, and training the preprocessed documents in the large corpus through a PV-DM (sentence vector distribution memory model) in doc2 vec;

step S202: and importing the target document preprocessed in the steps S101 to S104 in the step S1 into the trained model to obtain a sentence vector.

The training of the PV-DM model in step S201 specifically includes the following steps:

(1) initializing sentences and all words in each row in a document preprocessed by a large corpus into k-dimensional vectors, and inputting word vectors corresponding to the context of a word w (C words before and after the word) and sentence vectors corresponding to the sentence in which the word is positioned into a deep neural network model;

(2) summing and accumulating the input vectors in the hidden layer, wherein the accumulated vectors are used as the input of the output layer;

(3) the output layer corresponds to a binary tree, which is a Huffman tree constructed by taking words appearing in the corpus as leaf nodes and the times of each word appearing in the corpus as weight, each word corresponds to leaf nodes in the tree, each branch in the tree can be classified once and twice, and the Label corresponding to each tree node in the path from the root node to the leaf node corresponding to the word w is 1-p_j，p_jFor the coding corresponding to the jth node in the path, except for a root node and a leaf node, each tree node corresponds to a vector with the same length as the sentence vector and is used for assisting in training the model, and the vector is called as an auxiliary vector;

(4) and (3) continuously correcting the sentence vectors, the word vectors and the auxiliary vectors by adopting a gradient ascending method to train the model, and finally obtaining a distribution memory model of the trained sentence vectors.

The objective function of the neural network training is

Clustering generation of each sub-topic document in the step S3 adopts a spectral clustering mode, and the method specifically comprises the following steps:

step S301: constructing a similarity matrix W among all sentences, using a Gaussian kernel function as a kernel function,

wherein W_i，jAs a sentence x_i，x_jThe similarity between the two is shown in the specification, wherein sigma is a Gaussian radius;

step S302: the laplacian matrix L is calculated,

L＝D-W

wherein D is a diagonal matrix, and the nth row element of the diagonal matrix is the sum of the nth row elements of W;

step S303: construction of a normalized Laplace matrix

Step S304: computing

K minimum eigenvalues and corresponding eigenvectors V;

that is, the modulus of the vector formed by each row of the matrix F is 1;

step S306: each row of F is regarded as a k-dimensional sample, and is clustered by using a Kmeans algorithm to be gathered into a C class;

step S307: and storing sentences corresponding to the vectors in the C type as C documents.

Step S4 specifically includes:

in each sub-topic document, sentences are taken as nodes, the similarity between the sentences is taken as an edge, and a sentence relation graph model is established;

further, the cosine similarity sim (x)_i，x_j) The calculation formula of (2) is as follows:

wherein x is_i，x_jTwo sentence vectors.

The step S5 includes the following steps:

each sentence weight is initialized, and the sentence weight is iteratively updated according to the relationship graph model established in step S4 by using the following formula:

where s (i) is the weight of the sentence i, δ (i) is all sentences in the same sub-topic document whose similarity to the sentence i is higher than the set threshold, | δ (j) | is the number of sentences in the same sub-topic document whose similarity to the sentence i is higher than the set threshold, s (j) is the weight of the sentence j, and d is the damping coefficient, which is always set to 0.85.

Further, step S6 specifically includes:

the sentences with the largest weight are extracted from the sub-topic documents, and the sentences are sorted according to the positions of the sentences in the original document according to the method in the step (2) in the step S1 and combined into the abstract.

To further describe the multi-document summarization automatic extraction method of the present invention, the following are the results of summarization for three documents related to "wuqing source died" as follows:

xinlang sports news Baisui Qing Yuan eclosion returning to the fairy and eternal legends leaving the kingdom of chess. Wuqing Yuan, a third son in home, originated in Fujian province in China in 1914, 6 months and 12 days. Three segments were awarded by the Japanese chess institute in the second year, and nine segments were obtained in 1950. Wu Qing Yuan is the first to develop the chess course and to guide the descendants without losing vigor. He was the first man in the golden age of go, japan and was called "showa chess holy" in japan. In 1961, Wu Qing Yuan was subjected to car accidents and gradually faded out to the first line until formally disappearing in 1984. Before Wuqing Yuan, the I-go kingdom never had any player able to reach his height. In 2014, the middle-aged chess world is a great prosperous celebration ceremony held by mr. wu, and weiqi chess players with these honor can only be obtained by wu in the world. Family members of Wu Qingyuan prepared to summon the distinguished ceremony for Wu Qingyuan in 12 months and 3 days. Wu Qing Yuan Zheng saying: "I also need to play chess after one hundred years old, and I also need to play chess in the universe after two hundred years old. In pursuing the way of go, mr. wu has seen the death of life and death thoroughly.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. The method for automatically extracting the multi-document abstract based on the sentence vector is characterized by comprising the following steps of:

s1: preprocessing a document set of the abstract to be extracted;

the step S1 includes the steps of:

step S102: recording the corresponding position of each sentence;

step S104: cutting words of sentences in each row in the combined document and removing stop words;

s2: training by adopting a doc2vec model to generate a sentence vector;

the step S2 includes the steps of:

step S202: inputting the target document preprocessed in the steps S101 to S104 in the step S1 into a trained sentence vector distribution memory model PV-DM to obtain a sentence vector;

in step S201, training a sentence vector distribution memory model PV-DM includes the following steps:

step S2013: the output layer of the deep neural network model corresponds to a binary tree, the binary tree uses words in the corpus as leaf nodes, the number of times of the appearance of each word in the corpus is used as a weight to construct a Huffman tree, each word corresponds to a leaf node in the tree, each branch in the tree is regarded as one-time two-classification, and a Label corresponding to each tree node in a path from a root node to the leaf node corresponding to the word w is 1-p_j,p_jFor the coding corresponding to the jth node in the path, except for a root node and a leaf node, each tree node corresponds to an auxiliary vector with the same length as the sentence vector for assisting in training the model;

step S2014: continuously correcting the sentence vectors, the word vectors and the auxiliary vectors by adopting a gradient ascending method to finally obtain a trained distributed memory model PV-DM of the sentence vectors;

s4: establishing a sentence relation graph model in each subtopic document;

s5, according to the relation graph model established in the step S4, sentence weight is calculated in each sub-topic document;

s6: the sentences are extracted and sorted to form the abstract.

2. The method for automatically extracting a multiple document abstract based on sentence vectors as claimed in claim 1, wherein the clustering generation of sentence vectors in step S3 adopts a spectral clustering method.

3. The sentence vector-based multi-document digest automatic extraction method of claim 2, wherein the step S3 comprises the steps of:

step S302: calculating a Laplace matrix L;

step S303: constructing a standardized Laplace matrix;

step S304: calculating k minimum eigenvalues of the Laplace matrix and corresponding eigenvectors V;

step S305: arranging the characteristic vectors according to columns to form a characteristic matrix, unitizing each row in the characteristic matrix to form a matrix F, namely the modulus value of the vector formed by each row of the matrix F is 1;

4. The method for automatically extracting a multiple document abstract based on sentence vectors as claimed in claim 1, wherein the step S4 specifically comprises:

in each sub-topic document, sentences are used as nodes, the similarity between the sentences is used as the similarity between the edges and the sentences, and a sentence relation graph model is established.

5. The method for automatically extracting a multiple document abstract based on sentence vectors as claimed in claim 1, wherein the step S6 specifically comprises: the sentence with the largest weight is extracted from each sub-topic document, and the sentences are sorted according to the positions of the sentences obtained in step S102 in step S1 in the document and combined into an abstract.

6. The system for automatically extracting the multiple document abstracts based on sentence vectors is characterized by comprising the following steps: memory, a processor, and computer instructions stored on the memory and executed on the processor, the computer instructions, when executed by the processor, performing the steps of the method of any of claims 1-5.

7. A computer-readable storage medium, on which a computer program is run, which computer program, when being executed by a processor, performs the steps of the method of any one of claims 1 to 5.