CN114860920A - Method for generating monolingual subject abstract based on heteromorphic graph - Google Patents

Method for generating monolingual subject abstract based on heteromorphic graph Download PDF

Info

Publication number
CN114860920A
CN114860920A CN202210416073.3A CN202210416073A CN114860920A CN 114860920 A CN114860920 A CN 114860920A CN 202210416073 A CN202210416073 A CN 202210416073A CN 114860920 A CN114860920 A CN 114860920A
Authority
CN
China
Prior art keywords
word
sentence
nodes
embedding
language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210416073.3A
Other languages
Chinese (zh)
Inventor
云静
郑博飞
焦磊
袁静姝
刘利民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inner Mongolia University of Technology
Original Assignee
Inner Mongolia University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inner Mongolia University of Technology filed Critical Inner Mongolia University of Technology
Priority to CN202210416073.3A priority Critical patent/CN114860920A/en
Publication of CN114860920A publication Critical patent/CN114860920A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

A method for generating a monolingual subject abstract based on an abnormal composition comprises the steps of crawling an abstract data set of a source language from a network, and performing word segmentation, sentence segmentation and labeling operations; using countertraining to learn a space linear mapping from a source language to a target language to obtain word vectors of the source language and the target language in the same shared vector space, preprocessing data obtained by word segmentation, sentence segmentation and labeling operation to obtain vectors containing word nodes, sentence nodes and edge features, wherein the source language is the language of a data set needing to generate a summary, and the target language is a large language; using a graph attention network to perform information aggregation on vectors containing word nodes, sentence nodes and edge features, and continuously updating the word nodes and the sentence nodes to obtain sentence nodes after information aggregation; and carrying out node classification on the sentence nodes after information aggregation, taking cross entropy loss as a training target, sorting according to the scores of the sentences, and screening the sentence nodes suitable for serving as the abstract.

Description

Method for generating monolingual subject abstract based on heteromorphic graph
Technical Field
The invention belongs to the technical field of artificial intelligence, and particularly relates to a method for generating a single-language theme abstract based on an abnormal picture.
Background
The rapid development of the internet and the emergence of various text data containing news lead people to be difficult to rapidly acquire the subject information in the text; in addition, the same news can have different perspectives under different reporter reports, thereby helping a user to know the overall view of a certain event and how to summarize all reports into a main perspective; the main content of news is wanted to be known quickly when people want to see long news in leisure time. How to quickly acquire core content from text information is particularly necessary and urgent in the current situation.
The existing GCN or GAT model uses the method of a heterogeneous graph to realize single language abstract. But the disadvantage is that many relations between words are not considered and the prior art is mostly directed to english and for other languages, lacks its corresponding word embedding.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide a method for generating a single language topic abstract based on an abnormal picture, which can consider the relationship between words, such as a sentence relation and a semantic relation, by using a multi-GCN; word embedding of other languages can be generated by utilizing the GAN network, so that the abstract problem of multiple languages can be solved; and the accuracy of generating the abstract is improved by paying attention to the node information in the network aggregation abnormal graph.
In order to achieve the purpose, the invention adopts the technical scheme that:
a method for generating a monolingual subject abstract based on an abnormal picture comprises the following steps:
step 1, crawling a summary data set of a source language from a network, and performing word segmentation, sentence segmentation and labeling operation, wherein the source language is the language of the data set needing to generate the summary;
step 2, using countertraining to learn a space linear mapping from a source language to a target language to obtain word vectors of the source language and the target language in the same shared vector space, and preprocessing data obtained by word segmentation, sentence segmentation and labeling operation to obtain vectors containing word nodes, sentence nodes and edge features, wherein the target language is a large language;
step 3, using a graph attention network to perform information aggregation on the vectors containing the word nodes, the sentence nodes and the edge features, and continuously updating the word nodes and the sentence nodes to obtain the sentence nodes after information aggregation;
and 4, carrying out node classification on the sentence nodes after information aggregation, taking cross entropy loss as a training target, sorting according to the scores of the sentences, and screening the sentence nodes suitable for serving as the abstract.
Compared with the prior art, the invention aims at the phenomenon that the main theme of news is difficult to obtain quickly when the text data of news is crowded, and the complete picture of a thing is needed to be known quickly, and the sentence is aggregated with the information of words in the data set by adopting the heteromorphic graph and the graph attention mechanism, and the word nodes are updated and iterated continuously, so that the sentence with higher importance, namely the subject abstract of the article, can be obtained. By adopting the method and the device, the problem of inaccurate abstract of the long text is solved to a great extent, the accuracy of the abstract of the extraction type is improved, and contribution is made for a user to quickly obtain news information.
Drawings
FIG. 1 is a schematic overall flow diagram of the present invention.
FIG. 2 is a multi-GCN model diagram according to an embodiment of the present invention
FIG. 3 is a diagram illustrating sentence node information updating according to an embodiment of the present invention.
Detailed Description
The embodiments of the present invention will be described in detail below with reference to the drawings and examples.
The invention discloses a method for generating a monolingual subject abstract based on an abnormal composition, which comprises the following steps as shown in figure 1:
step 1, crawling a summary data set of a source language from a network.
The data set crawled by the network has a plurality of invalid characters or errors, so that the data needs to be cleaned firstly to ensure the correctness and the completeness of the data set,
in this embodiment, chinese is used as the source language. And performing word segmentation (Chinese needs to be segmented to meet the requirement of embedding subsequent words, and other languages do not need the operation), sentence segmentation and labeling on the data in the abstract data set, wherein the source language is the language of the data set needing to generate the abstract.
For Chinese, word segmentation can be performed by referring to a jieba lexicon, abnormal characters and separators remained in the word segmentation can be cleaned, and sentence segmentation and labeling operations can be performed on a data set after word segmentation. The label can be that the abstract is the (i-1) th sentence in the text.
And 2, preprocessing data obtained by word segmentation, sentence segmentation and labeling operation to obtain a vector containing single word nodes, sentence nodes and edge features.
In the step, word embedding operation is carried out on data obtained by word segmentation, sentence segmentation and labeling operation by using word vectors in a source language, sentence embedding is obtained by integrating word embedding, the word embedding is used as word nodes, the sentence embedding is used as sentence nodes, the relation between the word embedding and the sentence embedding is used as edge characteristics, and an abnormal composition is constructed.
In order to ensure that the method is applicable to all languages, word vectors of a source language are required to be generated for word embedding, the graph structure of the method is divided into word nodes, sentence nodes and edge features according to the characteristics of the abnormal graph, and four-step preprocessing is carried out on a data set to obtain the vectors containing the word nodes, the sentence nodes and the edge features. The specific steps are as follows:
step 2.1, in order to generate word vectors of a source language as materials required for embedding abstract model words, the method adopts a GAN network, utilizes data sets of two languages (one of the two languages is the source language, the other one is a target language, and the target language is generally a large language such as English, French and the like) which correspond one to one as a data set for countermeasures, and learns a space linear mapping from the source language to the target language by using the countermeasures through a countermeasures training method to obtain the word vectors of the source language and the target language in the same shared vector space. The method comprises the step of generating word vectors of a source language through GAN network training.
The mapping function W in which the space is linearly mapped is as follows
Figure BDA0003606038630000031
In the formula, X represents word embedding in the source language, Y represents word embedding in the target language corresponding to X, the generator generates a mapping W of X to Y,
Figure BDA0003606038630000032
is a real matrix of a dimension d × d, | | | | | non-woven phosphor F Is the F norm symbol, W * A value representing a mapping function when the F norm of WX-Y is minimal; the discriminator discriminates the difference between WX and the corresponding Y, through the constant confrontation of the discriminator and the generator, until WX is similar to Y to make the discriminator indistinguishable.
The parameter of the discriminator is theta D The penalty function of the discriminator is as follows:
Figure BDA0003606038630000041
n is the number of words in the source language, m is the number of words in the target language, x i Word embedding, y, representing the ith word in the Source language i Word embedding representing the ith word in the target language;
Figure BDA0003606038630000042
indicates that the discriminator considers Wx i Is the probability of the source language embedding,
Figure BDA0003606038630000043
the representation discriminator considers y i Is the probability of target language embedding;
training W so that the original embedded WX and Y cannot be distinguished by the discriminator, the loss function is:
Figure BDA0003606038630000044
Figure BDA0003606038630000045
indicates that the discriminator considers Wx i Is the probability of the target language being embedded,
Figure BDA0003606038630000046
indicates that the discriminator considers y i Is the probability of source language embedding;
given an input sample, the discriminator and the mapping function W are updated in turn by a stochastic gradient descent method such that L DD I W) and L W (W|θ D ) The sum is minimum;
the training method of the model is to resist the flow of the network, give the input samples, the discriminator and the mapping matrix W are updated by the stochastic gradient descent method in turn, so that L D And L W The sum is minimal.
In order to generate reliable matching pairs among languages, the criterion is improved in the embodiment of the invention, a CSLS method is adopted, word vectors of a source language and a target language in the same shared vector space are finally obtained, the neighbor nodes of any source language word are all corresponding words of the target language, and the generated word vectors of the source language are used as materials required for embedding abstract model words.
And 2.2, performing word embedding operation on the word part in the data set by using the word vector generated in the step 2.1.
And 2.3, using the word embedding generated in the step 2.2, initializing the words of each sentence in the data set by adopting CNN + BilSTM, capturing a plurality of relationships among the words by using Multi-GCN to obtain the word embedding of the sentence, and integrating the word embedding to obtain the sentence embedding.
In the step, based on the word vector of the source language, a Convolutional Neural Network (CNN) is adopted to capture the local n-garm characteristics of each sentence, namely the joint probability of the words. Then, a bidirectional long-short term memory network (BilSTM) is adopted to capture sentence-level characteristics, local n-garm characteristics and the sentence-level characteristics are connected to obtain context word embedding, namely an initialized result, and a plurality of relationships among words are captured by using a Multi-GCN for the initialized result.
In particular, referring to FIG. 2, where for syntactic relationships, when there is a dependency between two words, A r [w i ,w j ]When there is no dependency, A r [w i ,w j ]0; for semantic relations, constructed using the absolute value of the dot product between word embeddings,
Figure BDA0003606038630000051
in the formula, A r [w i ,w j ]Represents the ith word w i And the jth word w j Syntactic or semantic relationships between;
Figure BDA0003606038630000052
a transpose of the word vector representing the ith word,
Figure BDA0003606038630000053
the word vector for the jth word,
Figure BDA0003606038630000054
represent
Figure BDA0003606038630000055
And
Figure BDA0003606038630000056
absolute value of dot product to judge two wordsWhether the semantics are similar.
Next, for A r [w i ,w j ]And performing fusion, wherein the fusion of the first layer represents the calculation of the relationship between a certain word and the first word (such as I, like, eat, apple, and like, play and badminton. if the relationship between the word and other words is captured, the layer 3 calculates the relationship between the eating and the apple), the updating function is defined as:
Figure BDA0003606038630000057
Figure BDA0003606038630000058
in the formula (I), the compound is shown in the specification,
Figure BDA0003606038630000059
and
Figure BDA00036060386300000510
the weight and the deviation are represented by,
Figure BDA00036060386300000511
for initial embedding after initialization, i.e. context word embedding,
Figure BDA00036060386300000512
indicating that the resulting word embedding of layer l-1 of a certain word,
Figure BDA00036060386300000513
it represents the l-th GCN layer,
Figure BDA00036060386300000514
word embedding representing the relationship between a certain word fusion and the ith word,
Figure BDA00036060386300000515
indicates the final word embedding of the l GCN layer, and passes through a plurality of wordsAfter a GCN layer, the last updated result H is obtained, and the final word is embedded into F w =H+X w Integrating word embedding of sentences to obtain sentence embedding F s
In the step, the final word embedding of all sentences and the corresponding sentence embedding are obtained, the word embedding is used as word nodes of the heterogeneous graph structure, and the sentence embedding is used as sentence nodes of the heterogeneous graph structure.
And 2.4, adopting TF-IDF to represent the relation between words and sentences as the edge characteristics of the graph structure.
In the step, TF-IDF value is injected into the edge characteristics, and the word frequency TF represents the ith word w i In the jth sentence s j Number of occurrences, inverse document frequency IDF represents w i The inverse function of occurrence.
And 3, using the graph attention network to perform information aggregation on the vectors containing the word nodes, the sentence nodes and the edge features, and continuously updating the word nodes and the sentence nodes to obtain the sentence nodes after the information aggregation. The method comprises the following steps:
and 3.1, modifying the GAT (graph attention network) by combining a multi-head attention mechanism and residual error connection.
Specifically, in this step, the attention network uses the graph convolution neural network as a basic framework, the framework introduces an attention mechanism, and adds a residual error connection, and this embodiment adopts multi-head attention. An attention mechanism is introduced for collecting and aggregating the characteristic representation of neighbor nodes with close distances, multi-head attention is adopted to play an integration role, overfitting is prevented, and residual error connection is added to prevent the problem of gradient disappearance during iteration when node information is aggregated. The specific modification process is as follows:
an attention mechanism is introduced, and the word node and the sentence node are respectively F w And F s The node is characterized by F w ∪F s The edge features are denoted as E, the graph of the node feature and edge feature structure is denoted as G, and the representation of the semantic nodes is updated using the graph attention network.
Specifying
Figure BDA0003606038630000061
As a hidden state of an output node, the attention layer design is as follows:
z ij =LeakyReLU(W α [W q h i ;W k h j ])
Figure BDA0003606038630000062
Figure BDA0003606038630000063
in the formula W a ,W q ,W k ,W v Are trainable weights; a is ij Is h i And h j The attention weight in between, expressed in multi-head attention as:
Figure BDA0003606038630000064
to prevent the gradient from disappearing after multiple iterations in aggregating information, a residual join is added, so the final output is represented as:
h' i =u i +h i
then, the graph attention network is further modified, and the scalar weight value e of the edge is injected ij Mapping to a multidimensional embedding space
Figure BDA0003606038630000071
The attention tier formula is then modified as:
z i,j =LeakyReLU(W α [W q h i ;W k h j ;e ij ])
finally, a position-level feedforward layer is also added after the attention layer.
This step describes the formulation process where the model draws attention from the GCN to become GAT; the result of the information aggregation is a new representation of the feature vector of the sentence node, a score calculation (i.e. similarity calculation) is performed on the feature vector of the sentence node together with the important vocabulary (keywords), and a score calculation (i.e. similarity calculation) is performed on the feature vector of the sentence node together with the important vocabulary (keywords), wherein the higher the score is, the higher the score is.
And 3.2, updating sentence nodes by using the network modified in the step 3.1.
Updating word nodes and sentence nodes using the graph attention network is as follows:
Figure BDA0003606038630000072
Figure BDA0003606038630000073
wherein
Figure BDA0003606038630000074
Is the word-level information that each sentence aggregate contains,
Figure BDA0003606038630000075
indicating that the sentence nodes are updated with word nodes,
Figure BDA0003606038630000076
and
Figure BDA0003606038630000077
Figure BDA0003606038630000078
indicating that a calculation of an attention mechanism was performed, wherein
Figure BDA0003606038630000079
In order to pay attention to the query of the mechanism,
Figure BDA00036060386300000710
key and value indicating the attention mechanism.
Then, a new representation of the word node is obtained using the updated sentence node, and the sentence node is further iteratively updated. Each iteration involves a sentence-to-word and word-to-sentence update process. The tth iteration process can be expressed as:
Figure BDA00036060386300000711
Figure BDA00036060386300000712
Figure BDA00036060386300000713
representing the word-level information contained in each sentence aggregation at the tth iteration,
Figure BDA00036060386300000714
indicating that the tth iteration updates the sentence node with the word node,
Figure BDA0003606038630000081
key and value indicating the attention mechanism at the t-th iteration,
Figure BDA0003606038630000082
updating key and value of an attention mechanism by using a feedforward layer FNN, wherein the FNN is a feedforward network, and the GAT is an image attention network;
referring to fig. 3, the processing steps of updating sentence nodes (each iteration of the sentence nodes is to update the sentence nodes, that is, the information contained in the nodes is updated, by calculating the feature vector through query, key and value at the GAT to obtain a new feature vector) are as follows:
(1) each sentence s in the document i Aggregating the contained word-level information;
(2) by the word w i The sentence s i The new representation of (2) updates the sentence node. Since the characteristic vector of the sentence node is the feature of the word nodeAnd if the feature vectors of the word nodes are updated after the vectors are added, the feature vectors of the sentence nodes are also updated synchronously. Thus, the sentence nodes can be updated with the new representation of the words contained in the sentence (i.e., the new feature vectors after the feature vector update).
Through the step, a new representation of the feature vector of the sentence node is obtained, a score calculation (namely similarity calculation) is carried out in the step 4, the score is high, and the sentence is represented as a summary candidate.
And 4, carrying out node classification on the sentence nodes after information aggregation, taking cross entropy loss as a training target, sorting according to the scores of the sentences, and screening the sentence nodes suitable for serving as the abstract. The method comprises the following specific steps:
(1) grading and ranking the updated sentence nodes; the specific method comprises the following steps:
1) the sentence node feature vector is linearly transformed to a probability of appearing in the abstract (specifically, if the sentence contains more keywords, the score of the sentence is higher, the probability of the sentence as the abstract is higher), and the probability of appearing in the sentence is related to the vector of the edge feature obtained by TD-IDF;
2) and sorting according to the probability, and selecting the first k as the abstract.
3) And discarding sentences having duplicate triples with the higher ranked sentences.
(2) Removing the sentences with the scores ranked later and keeping the sentences with the scores ranked earlier as key sentences;
(3) removing sentences which are ranked later and have repeated semanteme and sentence meaning of the preceding sentences or have excessive repeated keyword in the key sentences;
(4) and extracting a final abstract.
In one embodiment of the invention, the hardware is a computer configured to include a hardware environment: a CPU: intel Core processor (3.1GHz)/4.5GHz/8 GT; GPU: 6 blocks 16G _ TESLA-P100 _4096b _ P _ CAC; memory: 16 root 32G ECC Registered DDR 42666; software environment: operating the system: ubantu 16.04; a deep learning framework: a Pythrch; language and development environment: python 3.6, Anaconda 3.
In the embodiment, the data of the 2017 nlpc evaluation data set is used as an analysis object, the data of the CNN/DM data set is used as an analysis object in English, the Chinese carries out operations such as word segmentation and sentence segmentation according to the steps, the English only carries out sentence segmentation operation, word node characteristics, sentence node characteristics and edge characteristics are extracted from the processed data set, the sentence characteristics are aggregated, word node information is aggregated, sentence nodes are updated, the last sentence nodes are ranked, a proper abstract is screened out, and the final corresponding Chinese and English abstract is obtained.
Table 1 shows a data set (partially schematic) after word segmentation, sentence segmentation, and other operations in two languages, namely chinese and english, as follows:
TABLE 1
Figure RE-GDA0003732917780000091
Figure RE-GDA0003732917780000101
The following table 2 shows the extraction results of feature extraction, sentence node aggregation and updating of the data set shown in table one, and the selection of sentences suitable for being used as abstract:
TABLE 2
Figure RE-GDA0003732917780000102
Figure RE-GDA0003732917780000111

Claims (9)

1. A method for generating a monolingual subject abstract based on an abnormal picture is characterized by comprising the following steps:
step 1, crawling a summary data set of a source language from a network, and performing word segmentation, sentence segmentation and labeling operation, wherein the source language is the language of the data set needing to generate the summary;
step 2, using countertraining to learn a space linear mapping from a source language to a target language to obtain word vectors of the source language and the target language in the same shared vector space, and preprocessing data obtained by word segmentation, sentence segmentation and labeling operation to obtain vectors containing word nodes, sentence nodes and edge features, wherein the target language is a large language;
step 3, using a graph attention network to perform information aggregation on the vectors containing the word nodes, the sentence nodes and the edge features, and continuously updating the word nodes and the sentence nodes to obtain the sentence nodes after information aggregation;
and 4, carrying out node classification on the sentence nodes after information aggregation, taking cross entropy loss as a training target, sorting according to the scores of the sentences, and screening the sentence nodes suitable for serving as the abstract.
2. The method for generating a monolingual subject abstract based on an abnormal composition as claimed in claim 1, wherein in the step 2, word embedding operation is performed on data obtained by word segmentation, sentence segmentation and labeling operation by using word vectors of the source language, word embedding is integrated to obtain sentence embedding, word embedding is used as word nodes, sentence embedding is used as sentence nodes, and the relationship between word embedding and sentence embedding is used as edge characteristics, so as to construct an abnormal composition.
3. The method for generating the monolingual subject summary based on the heteromorphic graph according to claim 2, wherein the countermeasure training employs a GAN network, wherein the data sets for countermeasure are in one-to-one correspondence, and the mapping function of the spatial linear mapping is as follows:
Figure FDA0003606038620000011
wherein X represents word embedding in a source language, Y represents word embedding in a target language corresponding to X,the generator generates a mapping W of X to Y,
Figure FDA0003606038620000012
is a real matrix of a dimension d × d, | | | | | non-woven phosphor F Is F norm symbol, W * A value representing a mapping function when the F norm of WX-Y is minimal; the discriminator distinguishes the difference between WX and the corresponding Y, and the discriminator cannot distinguish the difference through the continuous confrontation of the discriminator and the generator until the similarity degree of the WX and the Y is ensured;
the parameter of the discriminator is theta D The penalty function for the arbiter is as follows:
Figure FDA0003606038620000021
n is the number of words in the source language, m is the number of words in the target language, x i Word embedding, y, representing the ith word in the Source language i Word embedding representing the ith word in the target language;
Figure FDA0003606038620000022
indicates that the arbiter believes Wx i Is the probability of the source language embedding,
Figure FDA0003606038620000023
indicates that the discriminator considers y i Is the probability of target language embedding;
training W so that the discriminator cannot distinguish between WX and Y, the penalty function is:
Figure FDA0003606038620000024
Figure FDA0003606038620000025
indicates that the discriminator considers Wx i Is the probability of the target language being embedded,
Figure FDA0003606038620000026
indicates that the arbiter believes y i Is the probability of source language embedding;
given an input sample, the discriminator and the mapping function W are updated in turn by a stochastic gradient descent method such that L DD I W) and L W (W|θ D ) The sum is minimum;
finally, word vectors of the source language and the target language in the same shared vector space are obtained, and neighbor nodes of any source language word are corresponding words of the target language.
4. The method as claimed in claim 3, wherein based on the word vectors of the source language, a convolutional neural network is used to capture local n-garm features of each sentence, then a bidirectional long-short term memory network is used to capture sentence-level features, the local n-garm features are connected with the sentence-level features to obtain context word embedding, and then a multi-GCN is used to capture multiple relationships between words r [w i ,w j ]When no dependency exists, A r [w i ,w j ]0; for semantic relations, constructed using the absolute value of the dot product between word embeddings,
Figure FDA0003606038620000031
wherein A is r [w i ,w j ]Represents the ith word w i And the jth word w j Syntactic or semantic relationships therebetween;
Figure FDA0003606038620000032
a transpose of the word vector representing the ith word,
Figure FDA0003606038620000033
the jth wordThe word vector of (a) is,
Figure FDA0003606038620000034
represent
Figure FDA0003606038620000035
And
Figure FDA0003606038620000036
absolute value of dot product to judge whether two words are similar.
5. The method for generating a monolingual subject summary based on an idiogram as claimed in claim 4, wherein for A r [w i ,w j ]And performing fusion, wherein the fusion of the ith layer represents the calculation of the relation between a certain word and the ith word, and the updating function is defined as:
Figure FDA0003606038620000037
Figure FDA0003606038620000038
wherein, W r (l) And
Figure FDA0003606038620000039
the weight and the deviation are represented by,
Figure FDA00036060386200000310
for initial embedding after initialization, i.e. the context word embedding,
Figure FDA00036060386200000311
indicating that the resulting word embedding of layer l-1 of a certain word,
Figure FDA00036060386200000312
it represents the l-th GCN layer,
Figure FDA00036060386200000313
word embedding representing the relationship between a certain word fusion and the ith word,
Figure FDA00036060386200000314
the word embedding obtained finally by the ith GCN layer is shown, a final updated result H is obtained after the word embedding passes through a plurality of GCN layers, and the final word embedding F w =H+X w Integrating word embedding of sentences to obtain sentence embedding F s
6. The method of claim 5 wherein TF-IDF values are injected into the edge features, the frequency TF representing the ith word w i In the jth sentence s j Number of occurrences, inverse document frequency IDF represents w i The inverse function of occurrence.
7. The method for generating the abstract of the topic of monolingual language based on the heteromorphic graph in the step 3, wherein the graph attention network takes a graph convolution neural network as a basic frame, and introduces an attention mechanism and adds a residual error connection; updating word nodes and sentence nodes using the graph attention network as follows:
Figure FDA00036060386200000315
Figure FDA00036060386200000316
wherein
Figure FDA00036060386200000317
Is the word-level information that each sentence aggregate contains,
Figure FDA00036060386200000318
indicating that the sentence nodes are updated with word nodes,
Figure FDA0003606038620000041
and
Figure FDA0003606038620000042
Figure FDA0003606038620000043
indicating that a calculation of an attention mechanism was performed, wherein
Figure FDA0003606038620000044
The query, i.e. the sentence node,
Figure FDA0003606038620000045
key and value, i.e., word node, representing the attention mechanism;
then, using the updated sentence nodes to obtain new representations of the word nodes, and further updating the sentence nodes iteratively, wherein each iteration comprises a sentence-to-word and a word-to-sentence updating process, and the t-th iteration process is represented as:
Figure FDA0003606038620000046
Figure FDA0003606038620000047
Figure FDA0003606038620000048
representing the word-level information contained in each sentence aggregation at the tth iteration,
Figure FDA0003606038620000049
indicating that the tth iteration updates the sentence node with the word node,
Figure FDA00036060386200000410
the key and value representing the attention mechanism at the t-th iteration,
Figure FDA00036060386200000411
updating key and value of an attention mechanism by using a feedforward layer FNN, wherein the FNN is a feedforward network, and the GAT is an image attention network;
the processing steps for updating sentence nodes are as follows:
(1) each sentence s in the document i Aggregating the contained word-level information;
(2) by the word w i Sentence s in i The new representation of (2) updates the sentence node.
8. The method for generating a monolingual subject abstract based on an idiosyncratic graph according to claim 1, wherein in the step 4, the abstract sentence selecting step is as follows:
(1) grading and ranking the updated sentence nodes;
(2) removing the sentences with the scores ranked later and keeping the sentences with the scores ranked earlier as key sentences;
(3) removing sentences which are ranked later and have repeated semantics and sentences in front or too many repeated keywords in the key sentences;
(4) and extracting a final abstract.
9. The method for generating a monolingual subject abstract based on an idiosyncratic graph according to claim 8, wherein the specific method for ranking the updated sentence nodes by score is as follows:
1) the sentence node feature vector is linearly transformed to the probability of appearing in the abstract;
2) and sorting according to the probability, and selecting the first k as the abstract.
3) And discarding sentences having duplicate triples with the higher ranked sentences.
CN202210416073.3A 2022-04-20 2022-04-20 Method for generating monolingual subject abstract based on heteromorphic graph Pending CN114860920A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210416073.3A CN114860920A (en) 2022-04-20 2022-04-20 Method for generating monolingual subject abstract based on heteromorphic graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210416073.3A CN114860920A (en) 2022-04-20 2022-04-20 Method for generating monolingual subject abstract based on heteromorphic graph

Publications (1)

Publication Number Publication Date
CN114860920A true CN114860920A (en) 2022-08-05

Family

ID=82630702

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210416073.3A Pending CN114860920A (en) 2022-04-20 2022-04-20 Method for generating monolingual subject abstract based on heteromorphic graph

Country Status (1)

Country Link
CN (1) CN114860920A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117725928A (en) * 2024-02-18 2024-03-19 西南石油大学 Financial text abstracting method based on keyword heterograms and semantic matching

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111553154A (en) * 2020-04-28 2020-08-18 合肥工业大学 Cross-language word vector construction method based on disturbance countermeasure training
CN112541343A (en) * 2020-12-03 2021-03-23 昆明理工大学 Semi-supervised counterstudy cross-language abstract generation method based on word alignment
CN112818113A (en) * 2021-01-26 2021-05-18 山西三友和智慧信息技术股份有限公司 Automatic text summarization method based on heteromorphic graph network
CN112926324A (en) * 2021-02-05 2021-06-08 昆明理工大学 Vietnamese event entity recognition method integrating dictionary and anti-migration
CN113128214A (en) * 2021-03-17 2021-07-16 重庆邮电大学 Text abstract generation method based on BERT pre-training model
CN113127632A (en) * 2021-05-17 2021-07-16 同济大学 Text summarization method and device based on heterogeneous graph, storage medium and terminal
CN113254616A (en) * 2021-06-07 2021-08-13 佰聆数据股份有限公司 Intelligent question-answering system-oriented sentence vector generation method and system
CN113641820A (en) * 2021-08-10 2021-11-12 福州大学 Visual angle level text emotion classification method and system based on graph convolution neural network
CN113743133A (en) * 2021-08-20 2021-12-03 昆明理工大学 Chinese cross-language abstract method fusing word granularity probability mapping information
CN113901229A (en) * 2021-09-15 2022-01-07 昆明理工大学 Syntactic graph convolution-based Chinese-Yue bilingual news event causal relationship extraction method
CN114091429A (en) * 2021-10-15 2022-02-25 山东师范大学 Text abstract generation method and system based on heterogeneous graph neural network

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111553154A (en) * 2020-04-28 2020-08-18 合肥工业大学 Cross-language word vector construction method based on disturbance countermeasure training
CN112541343A (en) * 2020-12-03 2021-03-23 昆明理工大学 Semi-supervised counterstudy cross-language abstract generation method based on word alignment
CN112818113A (en) * 2021-01-26 2021-05-18 山西三友和智慧信息技术股份有限公司 Automatic text summarization method based on heteromorphic graph network
CN112926324A (en) * 2021-02-05 2021-06-08 昆明理工大学 Vietnamese event entity recognition method integrating dictionary and anti-migration
CN113128214A (en) * 2021-03-17 2021-07-16 重庆邮电大学 Text abstract generation method based on BERT pre-training model
CN113127632A (en) * 2021-05-17 2021-07-16 同济大学 Text summarization method and device based on heterogeneous graph, storage medium and terminal
CN113254616A (en) * 2021-06-07 2021-08-13 佰聆数据股份有限公司 Intelligent question-answering system-oriented sentence vector generation method and system
CN113641820A (en) * 2021-08-10 2021-11-12 福州大学 Visual angle level text emotion classification method and system based on graph convolution neural network
CN113743133A (en) * 2021-08-20 2021-12-03 昆明理工大学 Chinese cross-language abstract method fusing word granularity probability mapping information
CN113901229A (en) * 2021-09-15 2022-01-07 昆明理工大学 Syntactic graph convolution-based Chinese-Yue bilingual news event causal relationship extraction method
CN114091429A (en) * 2021-10-15 2022-02-25 山东师范大学 Text abstract generation method and system based on heterogeneous graph neural network

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117725928A (en) * 2024-02-18 2024-03-19 西南石油大学 Financial text abstracting method based on keyword heterograms and semantic matching
CN117725928B (en) * 2024-02-18 2024-04-30 西南石油大学 Financial text abstracting method based on keyword heterograms and semantic matching

Similar Documents

Publication Publication Date Title
Liu et al. The emerging trends of multi-label learning
JP7195365B2 (en) A Method for Training Convolutional Neural Networks for Image Recognition Using Image Conditional Mask Language Modeling
Zhang et al. The gap of semantic parsing: A survey on automatic math word problem solvers
CN109840287B (en) Cross-modal information retrieval method and device based on neural network
Cheng et al. Neural summarization by extracting sentences and words
WO2021223323A1 (en) Image content automatic description method based on construction of chinese visual vocabulary list
CN105279495B (en) A kind of video presentation method summarized based on deep learning and text
CN112131350B (en) Text label determining method, device, terminal and readable storage medium
CN112732916B (en) BERT-based multi-feature fusion fuzzy text classification system
CN113268995B (en) Chinese academy keyword extraction method, device and storage medium
CN110162771B (en) Event trigger word recognition method and device and electronic equipment
CN111680159A (en) Data processing method and device and electronic equipment
WO2023134083A1 (en) Text-based sentiment classification method and apparatus, and computer device and storage medium
CN112989208B (en) Information recommendation method and device, electronic equipment and storage medium
CN108345583A (en) Event recognition and sorting technique based on multi-lingual attention mechanism and device
CN111507093A (en) Text attack method and device based on similar dictionary and storage medium
CN112488301A (en) Food inversion method based on multitask learning and attention mechanism
CN113535949B (en) Multi-modal combined event detection method based on pictures and sentences
David et al. Comparison of word embeddings in text classification based on RNN and CNN
CN114860920A (en) Method for generating monolingual subject abstract based on heteromorphic graph
Mansour et al. Text vectorization method based on concept mining using clustering techniques
Liu et al. Adaptive Semantic Compositionality for Sentence Modelling.
Mahmoud et al. Arabic semantic textual similarity identification based on convolutional gated recurrent units
US20240028917A1 (en) Generating a knowledge base from mathematical formulae in technical documents
Alam et al. Probabilistic neural network and word embedding for sentiment analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination