CN113127632A

CN113127632A - Text summarization method and device based on heterogeneous graph, storage medium and terminal

Info

Publication number: CN113127632A
Application number: CN202110533278.5A
Authority: CN
Inventors: 蒋昌俊; 闫春钢; 丁志军; 王俊丽; 张亚英; 张超波
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2021-05-17
Filing date: 2021-05-17
Publication date: 2021-07-16
Anticipated expiration: 2041-05-17
Also published as: WO2022241913A1; CN113127632B

Abstract

The invention discloses a text summarization method and device based on heterogeneous images, a storage medium and a terminal, wherein the method comprises the following steps: performing knowledge fusion on a preset knowledge base and a target text, acquiring word features and sentence features of the target text, and constructing a text heterogeneous graph of the target text based on the word features and the sentence features; updating the text heterogeneous graph through a graph attention network based on the edge weight and the attention weight to obtain an updated version of the text heterogeneous graph; calculating multi-class abstract indexes of sentence vectors in the updated version text heterogeneous image, and calculating classification weights of corresponding sentence vectors according to the multi-class abstract indexes corresponding to each sentence vector; and respectively weighting sentence features in the updated version text heterogeneous image based on the classification weight of the sentence vector, acquiring corresponding sentence labels based on the weighted sentence features, and generating a text abstract according to the acquired sentence labels. The invention has the more direct mode that sentences and words are respectively used as two types of nodes to construct a heterogeneous graph, and word nodes are used as the intermediaries of the sentences, so that the association among the sentences is enriched and the information is indirectly transmitted.

Description

Text summarization method and device based on heterogeneous graph, storage medium and terminal

Technical Field

The invention relates to the technical field of text generation, in particular to a method and a device for text summarization based on heterogeneous graphs, a storage medium and a terminal.

Background

The automatic generation of text summaries is an important task in the field of natural language processing, which aims to compress an original text and generate a short description containing the main content of the original text. There are two main aspects of the study: generating and extracting. The key point of the generation formula is to generate the abstract word by word after the whole document is coded, and the extraction formula method is to directly select sentences from the document to combine the sentences into the abstract. Compared with the generation formula, the abstraction type abstract method is higher in efficiency and the readability of the generated abstract is better.

The key step in the abstract text task is to establish the relation between each sentence and the article, most of the existing methods acquire the relation of the sentences based on the Recurrent Neural Network (RNN), but the method can not capture the long-distance dependency relationship of the sentences. The use of graph structures to represent text is a more effective way to solve the above problems, but how to reasonably model text into graphs remains to be studied. Recently, Graph Neural Networks (GNNs) have shown a powerful feature extraction capability for graph data, and a text summarization method based on the graph neural networks has been proposed, in which some works use a modified structure theory (RST) to decompose sentences into elementary semantic units (EDUs) and construct a modified structure theory structure tree, and then use a Graph Convolution Network (GCN) to complete graph information aggregation update. Although the basic semantic unit-based approach achieves better effect, the process of generating the basic semantic unit is more complex, and only one type of node is used for constructing the graph. The relevance strength between sentences in the extraction type abstract is more important, but in the current work of the heterogeneous graph, edges are only added between nodes in different types, so that the sentences are not directly related.

Disclosure of Invention

The technical problem to be solved by the invention is that the process of generating basic semantic units in the existing text abstract generating mode based on the graph neural network is complex, only one node is used for constructing a graph, and the relevance among sentences is weak, so that the generation of the extraction abstract is not facilitated.

In order to solve the technical problem, the invention provides a text summarization method based on heterogeneous graphs, which comprises the following steps:

performing knowledge fusion on a preset knowledge base and a target text, acquiring word features and sentence features of the target text, and constructing a text heterogeneous graph of the target text based on the word features and the sentence features;

updating the text heterogeneous graph through a graph attention network based on the edge weight and the attention weight to obtain an updated version of the text heterogeneous graph;

calculating multi-class abstract indexes of sentence vectors in the updated version text heterogeneous image, and calculating classification weights of corresponding sentence vectors according to the multi-class abstract indexes corresponding to each sentence vector;

and respectively weighting sentence features in the updated version text heterogeneous graph based on the classification weight of the sentence vector, acquiring corresponding sentence labels based on the weighted sentence features, and generating a text abstract according to the acquired sentence labels.

Preferably, the knowledge fusion of a preset knowledge base and a target text, and the acquiring of the word features and sentence features of the target text comprises:

respectively encoding and vectorizing knowledge in a preset knowledge base and content in a target text to acquire a knowledge vector in the preset knowledge base and a word vector in the target text;

respectively calculating the attention weight of each word vector in the target text and the attention weight of the knowledge vector in the preset knowledge base to obtain the attention weight of each word vector in the target text;

sequentially taking the attention weight of the word vector in the target text as a weight, and respectively weighting and combining the knowledge vectors in the preset knowledge base to obtain the knowledge weight of each word vector in the target text;

acquiring word features of corresponding word vectors based on knowledge weight of each word vector in the target text;

and respectively performing local feature capture and global feature capture on the word features of the word vectors contained in each sentence vector in the target text to obtain the local features and the global features of each sentence vector, and respectively obtaining the sentence features of the corresponding sentence vectors according to the local features and the global features of each sentence vector.

Preferably, constructing the text heterogeneous map of the target text based on the word features and the sentence features comprises:

based on sentence characteristics of sentence vectors in the target text, calculating the homogeneous edge weight between every two sentence vectors of all sentence vectors in the target text in a cosine similarity calculation mode;

calculating heterogeneous edge weights among all word vectors and sentence vectors to which the word vectors belong in the target text through a TF-IDF algorithm based on the word features of the word vectors in the target text and the sentence features of the sentence vectors to which the word vectors belong;

and constructing a text heterogeneous graph of the target text based on word features of the word vectors, sentence features of the sentence vectors, homogeneous edge weights among the sentence vectors and heterogeneous edge weights among the word vectors and the sentence vectors to which the word vectors belong.

Preferably, the step of updating the text heterogeneous map through the graph attention network based on the edge weight and the attention weight, and acquiring an updated text heterogeneous map includes:

calculating attention weights between every two sentence vectors of all sentence vectors in the target text, calculating attention weights between all word vectors in the target text and the sentence vectors to which the word vectors belong, and acquiring all homogeneous edge weights and all heterogeneous edge weights in the target text;

and updating all word nodes and all sentence nodes in the text heterogeneous graph through a graph attention network based on the attention weight between every two sentence vectors of all sentence vectors in the target text, the attention weight between all word vectors and the sentence vectors to which the word vectors belong, all homogeneous edge weights and all heterogeneous edge weights to obtain an updated version of the text heterogeneous graph.

Preferably, updating all word nodes and all sentence nodes in the text heterogeneous graph through the graph attention network based on the attention weight between every two sentence vectors of all sentence vectors in the target text, the attention weight between all word vectors and sentence vectors to which the word vectors belong, all homogeneous edge weights and all heterogeneous edge weights comprises:

taking word nodes as central nodes, taking the product of attention weight between sentence nodes connected with the central nodes and heterogeneous edge weight between sentence nodes connected with the central nodes as weight, and carrying out weighted aggregation on sentence characteristics of the sentence nodes connected with the central nodes to realize the updating of the word nodes;

taking sentence nodes as central nodes, taking the product of attention weight between word nodes connected with the central nodes and heterogeneous edge weight between the word nodes connected with the central nodes as weight, and carrying out weighted aggregation on word characteristics of the word nodes connected with the central nodes to realize the updating of the sentence nodes;

and taking sentence nodes as central nodes, taking the product of the attention weight between the word nodes connected with the central nodes and the homogeneous edge weight between the sentence nodes connected with the central nodes as a weight, and carrying out weighted aggregation on the sentence characteristics of the sentence nodes connected with the central nodes to realize the update of the sentence nodes.

Preferably, the step of calculating multiple types of abstract indexes of sentence vectors in the updated version text heterogeneous graph and calculating classification weights of corresponding sentence vectors according to the multiple types of abstract indexes corresponding to each sentence vector comprises:

calculating a relevance score, a redundancy score, a new information score and a recall rate evaluation-oriented metric score of each sentence vector in the updated text heterogeneous graph;

and calculating the classification weight of the corresponding sentence vector through a Sigmoid function based on the relevance score, the redundancy score, the new information score and the recall ratio evaluation-oriented metric score of each sentence vector.

Preferably, the step of calculating the relevance score, the redundancy score, the new information score and the recall-rate-evaluation-oriented metric score of the single sentence vector in the updated text heterogeneous map comprises the following steps:

calculating the relevance score of the sentence vector through a bilinear function based on the text features of the updated version text heterogeneous image and the sentence features of the sentence vector;

calculating a redundancy score of the sentence vector through a bilinear function based on the sentence characteristics of the sentence vector in the updated text heterogeneous image;

calculating a new information quantity score of the sentence vector through a bilinear function based on the sentence characteristics of the sentence vector in the updated text heterogeneous image and the knowledge vector in the preset knowledge base;

and calculating the recall-rate evaluation-oriented metric score of the sentence vector through the recall-rate evaluation-oriented metric function based on the target text which is not coded and vectorized and the text content of the sentence vector.

In order to solve the above technical problem, the present invention further provides a text summarization apparatus based on heterogeneous graphs, including:

the text heterogeneous graph building module is used for carrying out knowledge fusion on a preset knowledge base and a target text, acquiring word characteristics and sentence characteristics of the target text, and building a text heterogeneous graph of the target text based on the word characteristics and the sentence characteristics;

the updating module is used for updating the text heterogeneous graph through a graph attention network based on the edge weight and the attention weight to obtain an updated version of the text heterogeneous graph;

the classification weight acquisition module is used for calculating multi-class abstract indexes of sentence vectors in the updated version text heterogeneous image and calculating classification weights of corresponding sentence vectors according to the multi-class abstract indexes corresponding to each sentence vector;

and the abstract generating module is used for weighting the sentence characteristics in the updated version text heterogeneous graph respectively based on the classification weight of the sentence vector, acquiring corresponding sentence labels based on the weighted sentence characteristics, and generating a text abstract according to the acquired sentence labels.

In order to solve the above technical problem, the present invention also provides a computer-readable storage medium having a computer program stored thereon, which when executed by a processor implements the heterogeneous graph-based text summarization method.

In order to solve the above technical problem, the present invention further provides a terminal, including: a processor and a memory;

the memory is used for storing computer programs, and the processor is used for executing the computer programs stored by the memory so as to enable the terminal to execute the text summarization method based on the heterogeneous graph.

Compared with the prior art, one or more embodiments in the above scheme can have the following advantages or beneficial effects:

the text summarization method based on the heterogeneous graph provided by the embodiment of the invention is applied to connect texts according to semantic and syntactic relations to construct the text heterogeneous graph, updates two types of node characteristics of words and sentences by combining a graph attention network, and designs a plurality of measure indexes related to summaries to perform weighted evaluation on the sentences for final summarization extraction, thereby not only considering information transfer between the words and the sentences, but also considering mutual influence between the sentences. The further added external knowledge base can better help the model to understand the text, corresponding weights can be added to sentences before classification aiming at multi-angle indexes designed by the abstract task, the utilization capacity of the model on text features is effectively improved, and then the abstract which is more accurate and higher in readability is generated.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIG. 1 is a flow chart illustrating a method for text summarization based on heterogeneous graphs according to an embodiment of the present invention;

FIG. 2 is a process diagram of a text summarization method based on heterogeneous graphs according to an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating a process of constructing a text heterogeneous graph in a text summarization method based on heterogeneous graphs according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating a single-layer update process of a heterogeneous graph in a text summarization method based on heterogeneous graphs according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating results of an ablation learning experiment of a text summarization method based on heterogeneous maps according to an embodiment of the present invention;

fig. 6 is a schematic diagram illustrating an experimental result based on a CNN & DailyMail data set and a comparative experimental result performed with other abstract methods in the first embodiment of the present invention;

FIG. 7 is a diagram illustrating an influence of multi-angle indicators on a digest in a text summarization method based on heterogeneous images according to an embodiment of the present invention;

FIG. 8 is a diagram illustrating a multi-angle index quantization sample according to an embodiment of the present invention;

FIG. 9 is a schematic structural diagram of a text summarization apparatus based on heterogeneous graphs according to a second embodiment of the present invention;

fig. 10 shows a schematic structural diagram of a four-terminal according to an embodiment of the present invention.

Detailed Description

The following detailed description of the embodiments of the present invention will be provided with reference to the drawings and examples, so that how to apply the technical means to solve the technical problems and achieve the technical effects can be fully understood and implemented. It should be noted that, as long as there is no conflict, the embodiments and the features of the embodiments of the present invention may be combined with each other, and the technical solutions formed are within the scope of the present invention.

The automatic generation of text summaries is an important task in the field of natural language processing, and the existing generation of text summaries mainly has two modes, namely a generating mode and an extracting mode. The key point of the generation formula is to generate the abstract word by word after the whole document is coded, and the extraction formula method is to directly select sentences from the document to combine the sentences into the abstract. Compared with the generating formula, the abstraction method has the advantages of higher efficiency and good readability. The key step in the task of abstracting the text abstract is to establish the relation between each sentence and the article, and the existing method can not capture the long-distance dependence relationship of the sentences generally. Recently, the Graph Neural Network (GNN) has shown a powerful feature extraction capability for graph data, and a text summarization method based on the graph neural network has also been proposed. Although the abstract generation mode based on the graph neural network achieves a good effect, the process of generating the basic semantic units is complex, only one type of node is used for constructing the graph, and only edges are added among different types of nodes, so that the association among sentences is weak.

Example one

In order to solve the technical problems in the prior art, embodiments of the present invention provide a text summarization method based on heterogeneous graphs.

FIG. 1 is a flow chart illustrating a method for text summarization based on heterogeneous graphs according to an embodiment of the present invention; fig. 2 is a process diagram of a text summarization method based on heterogeneous graphs according to an embodiment of the present invention. Referring to fig. 1 and 2, the method for abstracting a text abstract based on heterogeneous graphs according to an embodiment of the present invention includes the following steps.

Further, in order to more clearly illustrate the specific implementation method of the method for abstracting text abstract based on heterogeneous graph of the present invention, the following definitions are made in advance:

definition of sentence sets and word sets: given a target text d containing m sentences and n words, S ═ S₁，s₂...s_mIs then the sentence set of d,

is the word set of sentence i.

The definition of the text graph is: g ═ { V, E } represents a graph, V represents a set of nodes, and E represents a set of edges. Since the heteromorphic graph used in the present invention contains two types of nodes, it can be divided into a word node set and a sentence node set. Therefore, the text map TG ═ { V ═ V_TG，E_TGIs designed as a heteromorphic graph, wherein:

(1)V_TGwhere W ═ U.S contains two types of nodes, W ═ W { (W {)₁，W₂，...，W_mRepresents a collection of sets of words. S ═ S₁，s₂...s_mRepresents a sentence set.

(2)E_TG＝E_heter∪E_homoWherein

Represents a heterogeneous edge, E_homo＝{(s_i，s_j)|s_i，s_jE.s represents a homogenous edge.

(3)e_ijRepresents a heterogeneous edge (w)_ij，s_i) Weight of e'_ijDenotes homogeneous edge(s)_i，s_j) The weight of (c).

Step S101, performing knowledge fusion on a preset knowledge base and a target text, acquiring word features and sentence features of the target text, and constructing a text heterogeneous graph of the target text based on the word features and the sentence features.

The method is characterized in that knowledge fusion is firstly realized, namely, an external knowledge base is utilized to complete the knowledge enrichment of word features, so that the feature representation of a text has semantic perception and knowledge perception simultaneously. Specifically, a preset knowledge base is selected, and the language type of the preset knowledge base needs to be the same as that of the target text. That is, when the target text is Chinese, the preset knowledge base needs to select Chinese knowledgeA library; when the target text is English, the preset knowledge base needs to be selected. Secondly, in order to integrate the knowledge of the preset knowledge base into the word features of the target text, the coding vectorization needs to be carried out on the knowledge in the selected preset knowledge base and the content of the target text respectively to obtain all knowledge vectors of the preset knowledge base and all word vectors of the target text. For simplifying the description, d directly represents the target text after encoding vectorization, W represents the set of word vectors of the target text after encoding vectorization, and W represents_iRepresents a word vector in W; let K denote the pre-set knowledge base of code vectorization, and K denotes the knowledge vector in K.

And then, word features of each word vector in the target text are obtained. The solving mode of the word feature of each word vector is as follows: computing word vectors w by bilinear operations_iAnd presetting attention weights of all knowledge vectors k in a knowledge base to obtain the word vector w_iThe word vector w_iAttention weight β of_iThe calculation method is as follows:

β_i＝BiLinear(K,W_KB,w) (1)

wherein, W_KBAre trainable weight parameters.

After the attention weight of each word vector in the target text is obtained through calculation, the attention weight of the word vector in the target text is sequentially taken as a weight, and the knowledge vectors in the preset knowledge base are respectively weighted and combined to obtain the knowledge weight of each word vector in the target text. The acquisition mode of the knowledge weight knowledge of the single word vector is as follows:

knowledge＝β_iK (2)

at this time, the knowledge weight knowledge of the word vector already contains word-related knowledge.

And then acquiring the word features of the corresponding word vector based on the knowledge weight of each word vector in the target text. I.e. connecting w the knowledge weight of each word vector with the corresponding word vector^k＝[w，knowledge]Obtaining the word feature of the corresponding word vector, and the word feature hasSemantic perception and knowledge perception.

After the word features of all word vectors of the target text are obtained, the sentence features of all sentence vectors in the target text can be obtained. Specifically, local feature capture and global feature capture are respectively carried out on word features of word vectors contained in each sentence vector in the target text after encoding and vectorization so as to obtain the local features and the global features of each sentence vector, and then the sentence features of each sentence vector are respectively obtained according to the local features and the global features of each sentence vector. Wherein the capturing of local features is extracted by Convolutional Neural Network (CNN) and the capturing of global features is extracted by BiLSTM. Meanwhile, after the sentence characteristics of all sentence vectors in the target text after the encoding vectorization are obtained, the text characteristics of the target text after the encoding vectorization can also be obtained. The specific sentence features and text features are calculated as follows:

D＝BiLSZM([s₁，...，s_m]) (4)。

after word features and sentence features in the target text after encoding vectorization are obtained, a text heterogeneous graph of the target text can be constructed through semantic grammar, and word-sentence heterogeneous edge weights and sentence-sentence homogeneous edge weights of the target text are obtained before the text heterogeneous graph. The construction process of the text heterogeneous graph is shown in fig. 3, and referring to fig. 3, the embodiment of the present invention regards the text abstract as a classification problem, and takes the sentences as the minimum units to be classified, so that the association relationship between the sentences is particularly important when generating the abstract. The homogeneous edge expression quantity request mode is as follows: based on sentence characteristics of sentence vectors in the target text after the code vectorization, calculating the homogeneous edge weight between every two sentence vectors of all the sentence vectors in the target text after the code vectorization in a cosine similarity calculation mode. In order to add more information related to the text, the embodiment calculates all word vectors in the target text and the heterogeneous edge weights among the sentence vectors to which the word vectors belong by the TF-IDF algorithm based on the word features of the word vectors in the target text and the sentence features of the sentence vectors to which the word vectors belong after encoding and vectorization.

And S102, updating the text heterogeneous image through the image attention network based on the edge weight and the attention weight, and acquiring an updated text heterogeneous image.

To further explain the updating process of the graph attention network on the text heterogeneous graph, the word characteristics and sentence characteristics of word vectors in the target text after coding vectorization are respectively expressed as the hidden states H of word nodes_wAnd hidden state H of sentence node_sAnd expressing the text feature as H_D。

Fig. 4 is a schematic diagram illustrating a single-layer updating process of a text heterogeneous graph in a text summarization method based on a heterogeneous graph according to an embodiment of the present invention. Referring to fig. 4, the attention weight between every two sentence vectors of all sentence vectors in the target text is calculated, the attention weight between all word vectors and sentences to which the word vectors belong in the target text is calculated, and all homogeneous edge weights and all heterogeneous edge weights in the target text are obtained. The attention weight calculation method among sentence vectors is as follows, and the attention weight calculation method among the sentences to which the word vectors belong can also refer to the following formula:

the attention weight calculation method between two sentence vectors is as follows:

wherein h is_iAnd h_jRepresenting hidden states of two sentence nodes, W_a，W_q，W_k，W_vFor trainable parameters, α_ijIs h_iAnd h_jAttention weight in between.

Calculating an update delta u by attention weight_iThe update delta may be calculated by equation (6):

wherein, mu_iCan represent word node increment and can also represent sentence node increment, N_iRepresenting a set of neighbor nodes for node i.

In order to make the semantic associated information participate in the updating, the heterogeneous edge weight e is used_ijAnd homogenous side weight e'_ijAnd introducing, controlling the updating degree of the nodes from two aspects of semantic and attention models. The homogeneous edge weight and heterogeneous edge weight calculation method is as follows:

or

In this case, the formula (6) can be modified as follows:

the above process is a process of how the graph attention network calculates the update increment by using the attention weight between sentence vectors, the attention weight between the word vector and the sentence to which the word vector belongs, the homogeneous edge weight and the heterogeneous edge weight.

The process of updating the text heterogeneous graph through the attention network is actually the updating of word nodes and sentence nodes in the text heterogeneous graph. The network attention of the further figure actually comprises three processes of updating word nodes by sentence nodes, updating sentence nodes by word nodes and updating sentence nodes mutually.

Wherein, the updating of the sentence node to the word node comprises: and taking word nodes as central nodes, taking the product of the attention weight between the sentence nodes connected with the central nodes and the heterogeneous edge weight between the sentence nodes connected with the central nodes as weights, and carrying out weighted aggregation on the sentence characteristics of the sentence nodes connected with the central nodes to realize the updating of the word nodes. The updating of the sentence nodes by the word nodes comprises the following steps: and taking sentence nodes as central nodes, taking the product of the attention weight between the word nodes connected with the central nodes and the heterogeneous edge weight between the word nodes connected with the central nodes as a weight, and carrying out weighted aggregation on the word characteristics of the word nodes connected with the central nodes to realize the updating of the sentence nodes. Mutual updating between sentence nodes includes: and taking sentence nodes as central nodes, taking the product of the attention weight between the sentence nodes connected with the central nodes and the homogeneous edge weight between the sentence nodes connected with the central nodes as a weight, and carrying out weighted aggregation on the sentence characteristics of the sentence nodes connected with the central nodes to realize the update of the sentence nodes. LSTM can also be applied to text feature H at sentence node level_dAnd (6) updating. Wherein the polymerization mode is as follows: the aggregation is weighted using the corresponding attention weight and the edge bin. To illustrate the above, the following presents the graph attention network's update procedure for the t-th time:

wherein GAT (G, H)_s，H_w) Representing a view attention update layer, G is a text graph, H_sFor sentence features, as a request matrix in the attention mechanism, H_wIs a word feature, as a key and value matrix in the attention mechanism.

Messages representing the delivery of words to sentences are updated by the multi-layer perceptron (MLP). Preferably, the multi-layered perceptron is a multi-layered perceptron comprising two linear hidden layers.

After each iteration updating, the text characteristic H is subjected to_dUpdating:

through the updating iteration of the homogeneous heterogeneous graph structure based on the graph attention network GAT, sentences acquire more cross-sentence information through indirect connection of words, and the homogeneous edges among sentence vectors enable the sentences to acquire long-distance correlation, so that more information is provided for abstract extraction.

Step S103, calculating multi-class abstract indexes of sentence vectors in the updated version text heterogeneous image, and calculating classification weights of corresponding sentence vectors according to the multi-class abstract indexes corresponding to each sentence vector.

Specifically, calculating a relevance score, a redundancy score, a new information score and a recall rate evaluation-oriented metric score of each sentence vector in the updated text heterogeneous graph; and calculating the classification weight of the corresponding sentence vector through a Sigmoid function based on the relevance score, the redundancy score, the new information score and the recall ratio evaluation-oriented metric score of each sentence vector. In order to extract a proper sentence as an abstract, the invention sets a multi-angle sentence evaluation index, evaluates and scores the sentence from four angles of relevance (Rel), redundancy (Red), new information (Info) and Rouge-F1 score, weights the sentence characteristics by using the score, and selects the optimal N items as an extraction result.

The relevance is a very visual measurement standard, which represents the relevance of the sentence and the full text, and the higher the value of the relevance is, the more the sentence can represent the subject of the article; the relevance score calculation mode of a single sentence in the updated version text heterogeneous graph is as follows: and calculating the relevance score of the sentence vector through a bilinear function based on the text features of the updated text heterogeneous image and the sentence features of the sentence vector. The redundancy is a concept relative to the correlation, and a good abstract is not only matched with the original text theme, but also kept as concise as possible, namely the redundancy of the abstract itself is as low as possible; the redundancy score calculation mode of a single sentence vector in the updated version text heterogeneous graph is as follows: and calculating the redundancy score of the sentence vector through a bilinear function based on the sentence characteristics of the sentence vector in the updated text heterogeneous graph. The relevancy is a standard which ignores background knowledge and other information sources, and the new information content is evaluated by combining the background knowledge, for readers, the reader hopes to know the knowledge which is not understood before for reading the abstract, and the new knowledge is the new information content; the new information quantity score calculation mode of a single sentence vector in the updated version text heterogeneous graph is as follows: and calculating a new information quantity score of the sentence through a bilinear function based on the sentence characteristics of the sentence vector in the updated text heterogeneous image and the knowledge vector in the preset knowledge base. Rouge is a machine scoring method commonly used in text summarization, and the accuracy of the summarization can be further improved by taking the Rouge as an evaluation index; the calculation mode of the recall ratio evaluation-oriented metric standard score of a single sentence vector in the updated text heterogeneous graph is as follows: and calculating the recall-rate evaluation-oriented metric score of the sentence vector through the recall-rate evaluation-oriented metric function based on the target text which is not coded and vectorized and the text content of the sentence vector.

The sentence s is scored by integrating four abstract indexes, and the specific calculation formula is as follows:

Rel＝h_sW_relH_D (15)

Red＝h_sW_redA_s (16)

Info＝h_sW_infoH_k (17)

Rouge＝R(s，ref) (18)

Score＝Sigmoid(Rel-Red+Info+Rouge) (19)

wherein h is_sFeature vector of sentence s, h_dFor the feature vector of the text to which it belongs, H_kFor knowledge base coding, s is the sentence itself, ref is the reference summary, W_rel、W_red、W_infoAnd R is a Rouge calculation function for learnable parameters, and the calculation result of each index is processed by a Sigmoid function to be used as the classification weight of the current sentence.

And step S104, weighting the corresponding updated sentence characteristics based on the classification weight of each sentence vector, acquiring corresponding sentence labels based on the weighted sentence characteristics, and generating a text abstract according to the acquired sentence labels.

Specifically, after the classification weight of each sentence vector is obtained in the above steps, weighting is performed respectively corresponding to the updated sentence features to obtain weighted sentence features, then corresponding sentence labels are obtained respectively based on the weighted sentence features through a perceptron classifier, and then a text abstract is generated according to the obtained sentence labels. In the process of selecting the abstract sentences by using the classifier of the perceptron, the best N sentences are selected as the abstract, and the triple blocking strategy is used for reducing the redundancy of the abstract.

In order to verify the effectiveness of the invention, a text summarization method based on heterogeneous graphs is compared with other methods for experiment, the effect of each summarization method is evaluated by a Rouge evaluation method, and the scoring condition of the Rouge evaluation is shown in fig. 5. The KHHGS represents the text summarization method based on heterogeneous graph.

In comparative experiments, the CNN & daisy Mail data are used as data sets in the present invention, and compared with other summarization methods, the results are shown in fig. 5. From the comparative experimental results, it can be found that compared with the RNN model BiGRU, the KHHGS model (the method of the application) has obvious advantages and also is superior to the Transformer model. From the data in the table, the Transformer score can be found to be similar to the homogeneous graph method score, which indicates that the Transformer can be regarded as a full-link graph of sentence level. KHHGS effect is also superior to that of the previous heterogeneous image-text summarization model HSG, and the three indexes of Rouge are respectively improved by 0.14/0.46/0.97, so that the relevant strategy provided by the invention can effectively improve the effect of the heterogeneous image-text summarization model on the text summarization task.

In order to better explain the effectiveness of the relevant strategy on the text summarization task, the model is subjected to ablation learning, and an external knowledge base, a homogeneity map and a summarization index calculation module are respectively removed from the model for experimental analysis. As shown in fig. 6, each data in the table represents the experimental results after removing the relevant module. The addition of the knowledge base improves the Rouge-2 and the Rouge-L to a certain extent, but does not improve the Rouge-1 obviously, and the knowledge base used by the method is a general knowledge base, so that the effect of improving the news text corpus is weaker, and no known mature news knowledge base exists at present; the addition of the homogeneous graph can obviously increase various Rouge indexes, particularly Rouge-L values, and probably because the addition of the homogeneous graph enhances the relation between sentences, the relation between sentences can be better utilized by a model, and the number of the longest overlapped substrings in the final abstract extraction result is influenced; in addition, the effect can be effectively improved according to the multi-angle standard of the abstract, further experiments are carried out on the influence of different indexes on the abstract, as shown in fig. 7, the horizontal axis represents the range of a certain index score value, the vertical axis represents the probability that the sentence with the score in the range belongs to the reference abstract, and as can be known from the figure, the influence of the relevancy and the Rouge score on the abstract result is large, and the probability that the two sentences with high scores belong to the reference abstract very probably; the new information volume also has a certain influence on the abstract, but the new information volume is not obvious, and because the invention uses the general knowledge base as the background knowledge for calculating the new information volume, the general knowledge with lower identification degree can be considered to be filtered in the content, and similar to the operation of removing stop words during data processing, the model can put the emphasis on other key sentences. The redundancy has no obvious influence, and the evaluation of the redundancy of a single sentence cannot effectively improve the final effect of the abstract method because the abstract method based on the sentence level is provided by the invention, and the Rouge score used for model evaluation belongs to the evaluation of the abstract level.

In order to more intuitively express the functions of the indexes provided by the invention, a test sample is selected to quantize the indexes, as shown in fig. 8, each line in the table is a sentence of an original text, the right side corresponds to the normalized fraction and the total fraction of each index calculated by a formula, and through the analysis of the above, the redundancy rate does not play a critical role in the model, so the redundancy rate is not added into a quantization list. The data in the table shows that the sentence length is in certain relation with the relevance index, the longer the sentence is, the higher the relevance of the sentence with the original text is, the more the contained information is, so that the longest sentence 2 in the table obtains the highest relevance score, and meanwhile, the content of the sentence 2 is easy to judge, the description of the sentence is very similar to that of the reference abstract, so that the sentence 2 obtains a higher Rouge score, and finally, the sentence 2 obtains the highest comprehensive score.

In summary, compared with the existing commonly used summarization method, the text summarization method based on the heterogeneous graph provided by the invention has great advantages in capturing long-distance dependency relationship.

The text summarization method based on the heterogeneous graph provided by the embodiment of the invention is characterized in that a text is connected according to semantic and syntactic relations to construct the text heterogeneous graph, two types of node characteristics of words and sentences are updated by combining a graph attention network, and a plurality of measurement indexes related to summaries are designed to perform weighted evaluation on the sentences for final summarization extraction, so that not only is the information transfer between the words and the sentences considered, but also the mutual influence between the sentences considered. The further added external knowledge base can better help the model to understand the text, corresponding weights can be added to sentences before classification aiming at multi-angle indexes designed by the abstract task, the utilization capacity of the model on text features is effectively improved, and then the abstract which is more accurate and higher in readability is generated. The embodiment of the invention has a more direct mode that sentences and words are respectively used as two types of nodes to construct a heterogeneous graph, and word nodes are used as the intermediaries of the sentences, so that the association among the sentences is enriched and the information is indirectly transmitted.

Example two

In order to solve the above technical problems in the prior art, an embodiment of the present invention provides a text summarization apparatus based on heterogeneous graphs.

FIG. 9 is a schematic structural diagram of a text summarization apparatus based on heterogeneous graphs according to a second embodiment of the present invention; referring to fig. 9, the text summarization device based on heterogeneous graphs of the present invention includes a text heterogeneous graph construction module, an update module, a classification weight acquisition module, and a summary generation module, which are connected in sequence.

The text heterogeneous graph building module is used for carrying out knowledge fusion on a preset knowledge base and the target text, obtaining word features and sentence features of the target text, and building a text heterogeneous graph of the target text based on the word features and the sentence features.

And the updating module is used for updating the text heterogeneous image through the image attention network based on the edge weight and the attention weight to obtain an updated version of the text heterogeneous image.

The classification weight acquisition module is used for calculating multi-class abstract indexes of sentence vectors in the updated version text heterogeneous image and calculating classification weights of corresponding sentence vectors according to the multi-class abstract indexes corresponding to each sentence vector.

The abstract generating module is used for weighting sentence characteristics in the updated version text heterogeneous image respectively based on the classification weight of the sentence vector, acquiring corresponding sentence labels based on the weighted sentence characteristics, and generating the text abstract according to the acquired sentence labels.

The text summarization device based on the heterogeneous graph provided by the embodiment of the invention connects texts according to semantic and syntactic relations to construct the text heterogeneous graph, updates two types of node characteristics of words and sentences by combining a graph attention network, and designs a plurality of measure indexes related to summaries to perform weighted evaluation on the sentences for final summarization extraction, thereby not only considering information transfer between the words and the sentences, but also considering mutual influence between the sentences. The further added external knowledge base can better help the model to understand the text, corresponding weights can be added to sentences before classification aiming at multi-angle indexes designed by the abstract task, the utilization capacity of the model on text features is effectively improved, and then the abstract which is more accurate and higher in readability is generated. The embodiment of the invention has a more direct mode that sentences and words are respectively used as two types of nodes to construct a heterogeneous graph, and word nodes are used as the intermediaries of the sentences, so that the association among the sentences is enriched and the information is indirectly transmitted.

EXAMPLE III

To solve the above technical problems in the prior art, an embodiment of the present invention further provides a storage medium storing a computer program, and the computer program, when executed by a processor, can implement all the steps in the text summarization method based on heterogeneous graphs in the first embodiment.

The specific steps of the text summarization method based on heterogeneous images and the beneficial effects obtained by applying the readable storage medium provided by the embodiment of the invention are the same as those of the first embodiment, and are not described herein again.

It should be noted that: the storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Example four

In order to solve the technical problems in the prior art, the embodiment of the invention also provides a terminal.

Fig. 10 is a schematic structural diagram of a four-terminal according to an embodiment of the present invention, and referring to fig. 10, the terminal according to this embodiment includes a processor and a memory, which are connected to each other; the memory is used for storing computer programs, and the processor is used for executing the computer programs stored in the memory, so that the terminal can realize all the steps in the text summarization method based on heterogeneous graphs in the embodiment when being executed.

The specific steps of the text summarization method based on the heterogeneous graph and the beneficial effects obtained by the terminal applying the embodiment of the invention are the same as those of the first embodiment, and are not described herein again.

It should be noted that the Memory may include a Random Access Memory (RAM), and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Similarly, the Processor may also be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components.

Although the embodiments of the present invention have been described above, the above description is only for the convenience of understanding the present invention, and is not intended to limit the present invention. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A text summarization method based on heterogeneous graphs comprises the following steps:

2. The method of claim 1, wherein the knowledge fusion of a preset knowledge base and a target text, and the obtaining of the word features and sentence features of the target text comprises:

and respectively performing local feature capture and global feature capture on the word features of the word vectors contained in each sentence vector in the target text to obtain the local features and the global features of each sentence vector, and respectively obtaining the sentence features of the corresponding sentences according to the local features and the global features of each sentence vector.

3. The method of claim 2, wherein constructing a text heterogeneous graph of the target text based on the word features and sentence features comprises:

4. The method according to claim 3, wherein the step of updating the text heterogeneous map through a graph attention network based on the edge weight and the attention weight comprises:

5. The method of claim 4, wherein updating all word nodes and all sentence nodes in the text heterogeneous graph through a graph attention network based on attention weights between every two sentence vectors of all sentence vectors in the target text, attention weights between all word vectors and sentence vectors to which the word vectors belong, all homogeneous edge weights and all heterogeneous edge weights comprises:

and taking sentence nodes as central nodes, taking the product of the attention weight between the sentence nodes connected with the central nodes and the homogeneous edge weight between the sentence nodes connected with the central nodes as a weight, and carrying out weighted aggregation on the sentence characteristics of the sentence nodes connected with the central nodes to realize the update of the sentence nodes.

6. The method of claim 1, wherein calculating a plurality of classes of abstract indicators of sentence vectors in the updated text heterogeneous graph, and calculating classification weights of corresponding sentence vectors according to the plurality of classes of abstract indicators corresponding to each sentence vector respectively comprises:

7. The method of claim 6, wherein the step of calculating relevance scores, redundancy scores, new information scores, and recall-assessment-oriented metric scores for individual sentence vectors in the updated textual heterogeneous graph comprises:

8. A text summarization device based on heterogeneous graphs is characterized by comprising:

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the method for text summarization based on heterogeneous maps according to any one of claims 1 to 7.

10. A terminal, comprising: a processor and a memory;

the memory is used for storing computer programs, and the processor is used for executing the computer programs stored by the memory so as to enable the terminal to execute the text summarization method based on heterogeneous graphs according to any one of claims 1 to 7.