CN111104797B - Dual-based sequence-to-sequence generation paper network representation learning method - Google Patents
Dual-based sequence-to-sequence generation paper network representation learning method Download PDFInfo
- Publication number
- CN111104797B CN111104797B CN201911300281.1A CN201911300281A CN111104797B CN 111104797 B CN111104797 B CN 111104797B CN 201911300281 A CN201911300281 A CN 201911300281A CN 111104797 B CN111104797 B CN 111104797B
- Authority
- CN
- China
- Prior art keywords
- paper
- sequence
- node
- content
- semantic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 73
- 230000009977 dual effect Effects 0.000 title claims abstract description 36
- 238000013507 mapping Methods 0.000 claims abstract description 20
- 230000004927 fusion Effects 0.000 claims abstract description 17
- 239000013598 vector Substances 0.000 claims description 67
- 239000011159 matrix material Substances 0.000 claims description 21
- 238000012512 characterization method Methods 0.000 claims description 12
- 230000006870 function Effects 0.000 claims description 10
- 238000005295 random walk Methods 0.000 claims description 10
- 238000013527 convolutional neural network Methods 0.000 claims description 8
- 230000004913 activation Effects 0.000 claims description 4
- 238000012360 testing method Methods 0.000 claims description 4
- 238000012549 training Methods 0.000 claims description 4
- 230000015654 memory Effects 0.000 claims description 3
- 238000005065 mining Methods 0.000 claims description 3
- 230000006403 short-term memory Effects 0.000 claims description 3
- 238000006243 chemical reaction Methods 0.000 claims description 2
- 230000000295 complement effect Effects 0.000 claims description 2
- 238000007499 fusion processing Methods 0.000 claims description 2
- 238000011176 pooling Methods 0.000 claims description 2
- 238000005096 rolling process Methods 0.000 claims description 2
- 238000007418 data mining Methods 0.000 description 7
- 238000011160 research Methods 0.000 description 4
- 238000003058 natural language processing Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 230000002457 bidirectional effect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004883 computer application Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000003012 network analysis Methods 0.000 description 1
- 238000007500 overflow downdraw method Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Landscapes
- Machine Translation (AREA)
Abstract
A dual sequence-to-sequence generation-based paper network representation learning method, the method comprising: a paper parallel sequence generation section; a paper node identification part (paper content embedding, paper content sequence coding, paper identification sequence generation); a paper content generation part (paper node identification embedding, paper identification sequence coding, paper semantic decoding, paper content generation); and a dual fusion portion. The method integrates content information (namely the topics or abstracts of the papers) of the paper nodes in the paper network and structural information (namely the quotation relations among the papers) of the papers, fuses the two kinds of information more fully through the mutual mapping process of the two kinds of information, and learns the representation of the paper nodes with more meanings. The invention can also continue to decode new text after decoding the text content of the input paper sequence, i.e. new paper content predicted after considering the structure information and content information of the input paper sequence.
Description
Technical Field
The invention belongs to the technical field of computer application, data mining and network representation learning.
Background
Web-based learning is becoming an increasingly popular research topic because it can be applied in many different downstream tasks. However, because the structure of the network data is very complex, and there is accompanying information, for example, the network data of a large amount of papers includes not only the topic and abstract of the papers, but also the quotation relation information among the papers, and the highly nonlinear information presents challenges for the study of the network representation. In recent years, researchers have made a lot of efforts in the field of network representation learning, have obtained a lot of research results, and have roughly classified network representation learning methods into two categories based on input information of models.
One type is structure-preserving network embedding, such as the classical deep walk [1] model, uses a first-order neighbor structure to perform random walk sampling and learn node characterization based on the resulting node sequence. The node vector model node2vec 2 further provides a random walk algorithm based on a second-order neighbor structure. While Tang Jian et al propose reconstruction losses for first and second order neighbor structures between nodes directly modeled by a large-scale information network embedded model LINE 3. The GraRep model [4] is further generalized to higher-order neighbor structures. However, existing models often require manual specification of structural information to be preserved, such as first order, second order, etc., and still have certain limitations in practical applications.
The other type is network embedding fused with accompanying information, nodes in real network data are often accompanied with information such as labels, types, attributes and the like besides structural information, the accompanying information of the nodes and a topological structure belong to completely different modes, and characteristics of the nodes and high-level semantic relation among the nodes are described from different angles. Based on the deep walk model, liu Zhiyuan and the like of the university of Qinghai introduce node content [5] and label information [6] respectively, so that the performance of node classification tasks is effectively improved. In the embedding research of heterogeneous information network, the models of HINE 7, HNE 8, etc. further consider the types of nodes and edges, so as to model network structure information in finer granularity. However, the existing method lacks of deep mining of node content information and has a certain limitation.
Reference is made to:
[1]Perozzi B,Al-Rfou R,Skiena S.Deepwalk:Online learning of social representations[C]. Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2014:701-710.
[2]Grover A,Leskovec J.node2vec:Scalable feature learning for networks[C].Proceedings of the 22th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2016:855–864.
[3]Tang J,Qu M,Wang M,et al.LINE:Large-scale information network embedding[C]. Proceedings of the 24th International Conference on World Wide Web.International World Wide Web Conferences Steering Committee,2015:1067-1077.
[4]Cao S,Lu W,Xu Q.Grarep:Learning graph representations with global structural information[C].Proceedings of the 24th ACM International Conference on Information and Knowledge Management.ACM,2015:891-900.
[5]Yang C,Liu Z,Zhao D,et al.Chang.Network representation learning with rich text information[C].Proceedings of the 24th International Joint Conference on Artificial Intelligence.2015:2111-2117.
[6]Tu C,Zhang W,Liu Z,et al.Max-margin deepwalk:Discriminative learning of network representation[C].Proceedings of the 25th International Joint Conference on Artificial Intelligence.2016:3889-3895.
[7]Huang Z,Mamoulis N.Heterogeneous information network embedding for meta path based proximity[J].arXiv preprint arXiv:1701.05291,2017.
[8]Chang S,Han W,Tang J,et al.Heterogeneous network embedding via deep architectures[C].Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2015:119-128.
disclosure of Invention
The invention aims to solve the problem of effective fusion of complex network structures and paper node content information in a paper network and provides a paper network representation learning method based on dual sequence-to-sequence generation.
The technical proposal of the invention
A dual sequence-to-sequence generation-based paper network representation learning method comprises the following steps:
step 1) paper parallel sequence generation part
Firstly, a random walk method is adopted to walk the paper network to obtain a paper node sequence, and because each paper in the paper network has two types of information including paper numbers and paper text contents, each paper node sequence obtained by walk corresponds to two types of sequences containing different information, namely a paper node identification sequence and a paper node content sequence. The paper node identification sequence comprises the structure information of paper nodes, namely the quotation relation among papers, the paper node content sequence comprises the content information of the papers and part of the inter-paper structure information, and the two sequences are a group of parallel sequences of the papers. Because the two sequences contain different information, the paper network structure information and the paper node content information can be fused through the mutual mapping process of the two sequences.
Step 2) step 2.1) a paper node identification part for implementing mapping from a paper node content sequence to a paper node identification sequence, paper content embedding of the paper node identification part
For the text content of each paper node, firstly, segmenting the text, randomly initializing each word vector, and then capturing the text content information of the paper node by adopting a convolutional neural network (Convolutional Neural Network, CNN), wherein each paper node obtains the corresponding paper node semantic characteristics;
step 2.2) encoding the content sequence of the paper node in the paper node identification section
Coding the paper node content sequence by adopting a bidirectional long-short-Term Memory network (Bidirectional Long Short-Term Memory, bi-LSTM), coding the sequence into a context characteristic representation, and adopting Bi-LSTM to capture forward and reverse information of the paper sequence, wherein a semantic representation vector obtained by coding comprises semantic information of the whole paper node content sequence and structure information among paper nodes implied in the sequence, namely a quotation relation among papers;
step 2.3) generation of the paper node identification sequence of the paper node identification part
Decoding the semantic representation vector obtained by encoding through a Long Short-Term Memory network (LSTM), and mapping the decoded vector into a paper node identification space to complete the generation process of a paper node identification sequence;
step 3) a paper content generation section for implementing mapping from a paper node identification sequence to a paper node content sequence
Step 3.1) paper node identification embedding in paper content Generation section
Adopting an paper node identification embedding layer, and obtaining vector representations of different paper node identifications in a paper node identification sequence by searching an initialization embedding matrix of the paper node;
step 3.2) encoding of the paper node identification sequence of the paper content Generation part
The method comprises the steps of adopting Bi-LSTM to encode an thesis node identification sequence, and encoding the thesis node identification sequence into a context characteristic representation according to sequence structure information among thesis nodes, namely, a quotation relation among papers, and taking the context characteristic representation as input of a subsequent semantic decoding process;
step 3.3) paper semantic decoding of paper content Generation part
Before the content of the paper node is generated, the context characteristic representation is required to be decoded to obtain a paper semantic characteristic sequence which is used for connecting two modal spaces of the paper network structure and the paper node content, and a decoder adopts LSTM;
step 3.4) paper content Generation by the paper content Generation part
Generating text content, namely word sequence, by adopting classical LSTM to semantic characterization of each paper node in the paper semantic feature sequence;
step 4) Dual fusion paper node identification part and paper content generation part
Through sharing of the intermediate hidden layers of the paper node identification part and the paper content generation part, the two parts are simultaneously learned, and the context characteristic representations obtained in the step 2.2) and the step 3.2) are fused in a linear fusion mode.
The sequence-to-sequence model is a translation model that translates one language sequence into another and maps one sequence into another. The sequence-to-sequence model is composed of an encoder and a decoder, wherein an input sequence is firstly encoded into a semantic representation vector, and then the semantic representation vector is decoded into a sequence, so that the mapping from the sequence to the sequence is completed. The sequence-to-sequence model is initially applied to the field of natural language processing for machine translation and abstract generation, is now also applied to the field of network representation learning, fuses different information through a sequence-to-sequence mapping process, and adopts an intermediate result of the model as node representation in a network.
As shown in FIG. 1, the method for learning the paper network representation based on dual sequence-to-sequence generation provided by the invention utilizes the structural relation of paper nodes in the paper network to obtain the paper node sequence by random walk, and each paper node has two modal information: the paper node identification (i.e. the number of the paper) and the paper node content (i.e. the topic or abstract of the paper) finally obtain a group of paper parallel sequences, namely the paper node identification sequence and the corresponding paper node content sequence.
Based on the paper parallel sequences, the invention designs two dual sequence-to-sequence generating parts, namely a paper node identifying part (Node Identification, NI) and a paper content generating part (Content Generation, CG), namely semantic mapping modeling from the paper node content sequence to the paper node identifying sequence and semantic mapping modeling from the paper node identifying sequence to the paper node content sequence. Based on the proposed dual fusion method, the two parts can carry out effective knowledge transfer through a certain fusion strategy. And finally, extracting hidden vectors in intermediate layers of the paper node identification part and the paper content generation part as learned paper node characterization, and applying the hidden vectors to subsequent paper network analysis tasks.
The invention has the advantages and beneficial effects that:
paper node characterization
According to the method, the content information of the paper nodes and the structure information among the paper nodes in the paper network are integrated, the characterization of the paper nodes is learned, and compared with the previous research, the method has the advantages that the content information of the paper nodes and the structure information are fused more fully, and the characterization of the paper nodes is more meaningful.
Paper content prediction
The invention can continue to generate the text content of the new paper by using the trained method, and can continue to decode the content of the new paper after decoding the text content of the input paper sequence in the paper content generation stage of the paper content generation part, namely, consider the structure information and the content information of the input paper sequence and predict the text content of the new paper.
Drawings
FIG. 1 is a flow chart of the present invention learning a representation of a paper node from a paper network.
Fig. 2 is a diagram of a method of performing dual fusion of a paper node recognition part and a paper content generation part of the present invention.
Detailed Description
The invention provides a dual-based sequence-to-sequence generation paper network representation learning method, which is described in detail below with reference to the accompanying drawings and specific implementation.
Example 1:
in order to ensure the normal operation of the system, the invention mainly adopts a deep learning technology to perform paper node characterization learning on a paper network, and in particular implementation, requires that a used computer platform is provided with a memory of not lower than 11G, CPU cores are not lower than 4 and have a main frequency of not lower than 2.6GHz, a GPU environment and a Linux operating system, and necessary software environments such as Python 3.6 and above, pyrach 0.4 and above are installed.
As shown in the method diagram of the dual fusion performed by the paper node identification part and the paper content generation part in fig. 2, a paper network representation learning method based on dual sequence-to-sequence generation comprises the following detailed steps:
step 1) paper parallel sequence generation part
The paper network g= (V, E), V represents the set of all paper nodes in the network,the edges in the paper network are set and comprise quotation relation information among papers, and if quotation and quotation relation exists among the papers, the edges exist among the papersFor each paper node V e V in the paper network, V is used i Representing the numbering of the nodes of the paper, using v c Content information representing paper nodes. The random walk method is adopted to walk the paper network, and a walk paper node sequence S= { v is obtained 1 ,v 2 ,…,v T For each sequence S, there is a corresponding paper node identification sequence +.>And the treatise node content order->The paper node identification sequence and the paper node content sequence are referred to as a set of paper parallel sequences. For example, there is an edge between paper 1 and paper 3, an edge between paper 3 and paper 6, an edge between paper 6 and paper 4, and an edge between paper 4 and paper 9, when random walk, walk from paper 1, walk to paper 3, then walk to paper 6, 4, 9, if the walk length is set to 5, the walk sequence is paper 1→paper 3→paper 6→paper 4→paper 9, then according to the serial number of the paper, the paper node identification sequence 1→3→6→4→9 can be obtained, according to the content information of the paper, the paper node content sequence "data mining#" → "big data#" → "natural language processing" → "text analysis#" web data mining "can be obtained.
Step 2) a paper node identification part step 2.1) a paper node identification part for realizing mapping from a paper node content sequence to a paper node identification sequence:
for the text content of each paper node, firstly, the text is segmented, each word vector is initialized randomly, then CNN is adopted to capture the text content information of the paper node, and each paper node obtains the corresponding paper node semantic characteristics.
Parallel sequences of articles are wherein />For the content sequence of the thesis node with the sequence length of n, < >>The sequence is identified for the paper node with the sequence length of n, and the dictionary is +.>Randomly initialized word embedding matrix is +.> For the size of the dictionary, k m Representing the dimension of word embedding, first, a look-up function LookUp is adopted w (. Cndot. Cndot.) will be->Text content of the t-th paper node in (2)>Matrix spliced by word embedding vectors> wherein ut,i For the i-th word in the node content of the t-th paper,>number of content words for the t-th paper node:
For example, in a paper network, the paper nodes are identified as the numbers of the papers, the text content of the paper nodes is the title or abstract of the paper, the sequence length obtained by random walk is 5, and the walk isThe paper node identification sequence of the paper node is 1-3-6-4-9, the paper node content sequence is ' data-mining# ' - ' big data# ' - ' natural language processing ' - ' text analysis# ' - ' web data-mining ' - ' and# is a filling character, firstly, content words of each paper node are embedded and spliced, for example, words ' data ' are embedded to obtain 100-dimensional word vectors [1,0.89,1,23,0.54, …,1,03 ] corresponding to ' data ']The corresponding word vector is obtained for the content word of each paper node, and is spliced, such as the final result U (v) of the paper node content embedding with the node identification of 1 t ) Vectors of 3 x 100 dimensions [ [1,0.89,1,23,0.54, …,1,03 ]],[0.48,0.93,1.07,0.76,…,1.32],[1.78,1.24,0.65,0.79,…,0.36]]。
Using a plurality of widths k m Is set in U (v t ) Rolling and maximum pooling operations are performed on the model building machine, and the model building machine can model buildingLearning +.>Is>
Original thesis node content sequenceBecome the paper node semantic feature sequence +.> T is the sequence length. After CNN modeling, for each paper node, the paper node content sequence embeds the result U (v t ) Convolving to a 100-dimensional vector ∈>Content feature vector of paper node as node identification 1 +.>Is [0.79,0.68,1.03,0.98, …,0.76]。
Step 2.2) the paper node content sequence coding of the paper node identification part:
node content sequence of paperIn the method, semantic association information exists among different paper node contents, and in order to capture global semantic information existing in a paper node content sequence, a paper node semantic feature sequence output in a paper content embedding layer is +.>Above, bi-LSTM is used to encode the paper node semantic feature sequences. A forward LSTM will accumulate the semantic features of all paper nodes that the code is going to go through from the beginning of the sequence to get the current hidden state vector +.>/>
The LSTM of the latter is used for accumulating the semantic features of all paper nodes from the end of the sequence to the current experience in the reverse order to obtain the current hidden state
wherein and />Respectively representing the fusion learning process performed by the forward and backward LSTM when processing the t-th paper node in the sequence.
The representation of the t-th paper node in the paper node content sequence encoding stage is as follows
For example, the paper node with node identification 1 is the first node in the sequence, so the corresponding forward hidden state isThe corresponding backward hidden state is +>The paper node with final node identification of 1 is learned in the paper node content sequence coding stage as +.> and />Is->Is [0.38, -0.48, …,0.19,1.02, -0.98,1.29, …,0.96,1.20 ]]If the paper node with node identification 1 appears multiple times in the sequence, taking the average of the multiple paper node representations as the representation of the last paper node, and performing the same treatment when the paper node representations are calculated by other parts in the method.
Finally, the context characteristic representation of the whole paper node semantic characteristic sequence is obtained through splicing the final hidden state representation of the forward LSTM and the backward LSTM. Since the representation of the last hidden state of the forward and backward LSTM contains information of the entire sequence, the spliced last hidden state representation of the forward and backward LSTM is used as a representation of the entire sequence.
Wherein [ · ]]Representing the process of longitudinally splicing vectors to finally obtain the contextual feature representation z of the whole paper node semantic feature sequence NI Is [1.39, -0.98, …,0.29,1.05]。
Step 2.3) paper node identification sequence generation of the paper node identification part:
the contextual characteristics obtained in step 2.2) represent z NI Fused paper node content sequenceContent information of all paper nodes in +.> and />Sequence information carried by the user. To generate a corresponding paper node identification sequence, LSTM is first employed, in z NI As an initial state, a high-level implicit characteristic sequence oriented to the paper node identification space is directly generated without inputting the characteristic sequence>Wherein the t-th implicit feature->The generation process of (2) is as follows:
then based on the high-layer implicit characteristic sequence obtained by decodingBy means of full connectionThe junction layer adds each node feature in the high-level implicit feature sequence>Mapping to a node identification space to obtain the identification of the t-th paper node in the node identification space ∈>A semantic mapping from content modalities to structural modalities is achieved,
wherein σ (·) is a sigmoid activation function, W NI-Tran and bNI-Tran The weight matrix and the bias term of the full connection layer are respectively. Subsequently, the softmax layer is further adoptedNormalized to probability distribution over all |v| paper node identities:
finally, probability distribution is obtainedIs a probability value such as 0.29 #>The probability of j representing the predicted t-th paper node is 0.29. By comparing the probabilities on all |V| paper node identifiers, the paper node identifier with the highest probability value is finally taken as the predicted node identifier of the t-th paper node.
In the generation stage of the paper node identification sequence, the expression of the t-th paper node is as follows
Step 3) a paper node identification embedding for implementing a paper content generation part mapping from a paper node identification sequence to a paper node content sequence step 3.1) a paper content generation part
And adopting an paper node identification embedding layer, and obtaining vector representations of different paper nodes in the paper node identification sequence by looking up an initial embedding matrix of the paper nodes.
wherein ,initializing an embedding matrix for all |v| paper node identities, | +.>Node identification vector, k, for the t-th paper node n Is the dimension of the embedded vector. Query function LookUp v (. Cndot.) identify each paper node +.>The corresponding embedded vector->Sequentially combined into sequence->
For example, the random walk paper node identification sequence is 1- & gt 3- & gt 6- & gt 4- & gt 9, and the matrix V is embedded by searching, and each row of the matrix V represents the pairThe identification vector of the paper node at the position is used for obtaining k of each paper node n The identity vector of the dimension. The matrix v is initialized randomly, the identification vector of the paper node with the node identification of 1 is the first row in the matrix v, and the first row of the matrix v is taken as the paper node identification vector with the node identification of 1
Step 3.2) encoding of the paper node identification sequence of the paper content Generation part
After obtainingThen, the Bi-LSTM is adopted to code the paper node identification sequence according toSequence structure information between them, coding the paper node identification sequence into a context characteristic representation z CG As input to the subsequent content generation process. In the process->An embedded vector of each paper node identification +.>When a forward LSTM accumulates the identification features of all paper nodes from the beginning of the sequence to the current one, to obtain the current hidden state vector
Simultaneously, a backward LSTM is utilized to accumulate the identification characteristics of all paper nodes from the end of the sequence to the current experience in reverse order to obtain the current hidden state vector
wherein and />Respectively, the learning process performed by the forward and backward LSTM at step t.
The representation of the t-th paper node in the coding stage of the paper node identification sequence is as follows
For example, the paper node with node identification 1 is the first node in the sequence, so the corresponding forward hidden state isThe corresponding backward hidden state is +>The paper node with final node identification of 1 is learned in the coding stage of the paper node identification sequence and is expressed as +.> and />Is->Is [0.32, -0.78, …,0.89,1.89, -0.38,1.02, …,0.39,1.01 ]].
By at least one ofIterative learning is performed, and structural semantic information in the paper node identification sequence is effectively mined from two opposite directions. And then, the final hidden state representation of the forward LSTM and the backward LSTM is spliced to obtain the fusion representation of the whole paper node identification sequence:
wherein [ · ]]Representing the process of longitudinally splicing vectors to finally obtain the contextual feature representation z of the whole paper node identification feature sequence CG Is [1.39, -0.98, …,0.29,1.05]
Step 3.3) paper semantic decoding of paper content Generation part
After passing through the paper node identification embedding layer and the paper node identification sequence coding layer, the paper node identification sequence is already codedFusing structural information in a compressed contextual feature representation z CG Is a kind of medium. As a key step before generating the paper node content, the contextual feature representation z is required CG Decoding to obtain semantic feature sequence of the whole paper sequenceThe method is used for connecting the two modal spaces of the network structure and the paper node content. Using LSTM, z is represented by contextual characteristics GG In the initial state, the output sequence is directly generated without inputting the characteristic sequence. Wherein semantic feature of the t-th paper node +.>The generation process of (2) is as follows: />
Representing z based on contextual characteristics CG ,LSTM CG-Dec (. Cndot.) all T paper nodes are generated in sequence from front to backCorresponding content semantic features. Each of which isHas been fused with->The paper node identification information contained in the file and the sequence structure in the sequence are used as the basis for generating the content information. In addition, after the paper nodes in the input sequence are generated, the generation can be continued, and the semantic vectors of the new paper nodes can be predicted. Such as decoding +.>Can continue decoding +.>Content semantic vectors of new paper nodes are predicted.
Step 3.4) paper content Generation by the paper content Generation part
Finally, the sequence is characterized based on the paper semantics after decodingEach +.>Text content, i.e., word sequences, is generated. Following convention, LSTM is used to +.>As an initial state, word representation sequences of paper nodes are directly generated.
Given the maximum length L of the generated text, the LSTM will start from scratch, gradually generating a word sequence. When the length of the word sequence reaches L, or the generated word is the stop symbol < EOS >, the generation process stops. For the t-th paper node in the sequence, the implicit characterization of the first word is generated as follows:
when l=1, the high-level semantic features are usedTo generate the 1 st hidden state directly without inputting features>For further generating words. And at l>1, the word vector of the last word which has been generated is characterized +.>As input feature, combine the transferred hidden state +.>Co-generation of the current hidden state->For further generation of the current word. In the training phase and the testing phase, the word vector characterization of the last word already generated ++>There are different arrangements. In the training process, in order to maximize likelihood probability of text content of paper node, from given +.>The first-1 real word is selected and its word vector is used as +.>The text content of the paper node with the node identification of 1 is input into the LSTM and is data mining#, and when the second word is predicted to be mining, an embedded vector [1,0.89,1,23,0.54, …,1,03 ] with the characteristic of data is input]:
And in the test phase, when predicting new text content for the paper node,the word vector corresponding to the word predicted in the previous step is:
wherein Is about->The function of (2) represents the probability that the word predicted in the previous step is the jth word in the vocabulary, and the max function represents the word with the highest probability of generating the selection, for example, the maximum probability of the first word predicted for the paper node with node identification of 1 is "data", then the embedding vector of "data" is [1,0.89,1,23,0.54, …,1,03 ]]As input to predict the next word, i.e. +.>
Based on the text generation process, a text semantic sequence with a length L (the maximum length of the word sequence of the paper node content is set to 3 in the example) is decoded for the t-th paper node in the sequenceUsing a fully-connected layer to connect each/>Mapping to +.>In the dictionary space of dimensions:
wherein σ (·) is a sigmoid activation function, W CG-Word and bCG-Word Weight matrix and bias term of the full connection layer respectively, and adopt softmax layer to makeFurther conversion to at all->Probability distribution over individual words:
finally, probability distribution is obtainedIs a probability value such as 0.35, < >>The first word representing the node of the predictive t-th paper is m j The probability of (2) is 0.35. By comparison at all +.>And finally taking the word with the maximum probability value as the predicted word of the first word of the t-th node, and generating a result of data mining # for the content of the paper node with the node identification of 1.
For new decoded nodes if it is desired to predict the contents of the new paper nodeAnd executing the same content generation operation to generate a new content word sequence of the paper node.
Step 4) Dual fusion paper node identification part and paper content generation part
The paper node identification part and the paper content generation part are closely related, which model the cross-modal semantic generation relation between the paper node content sequence and the paper node identification sequence from two opposite angles, and in order to realize the fusion of complementary knowledge in two dual parts, the two parts are coupled together by using a linear layer and learning is performed simultaneously by using the sharing of an intermediate hidden layer.
wherein ,WDual,1 、b Dual,1 、W Dual,2 、b Dual,2 Is the weight and bias term of the linear fusion layer. After having undergone the above dual fusion process, at this time and />Some semantic information from the target modality has been included. Thus, will +.> and />And respectively feeding the decoding layers into the sequence decoding layers described in the step 2.3) and the step 3.3), thereby improving the accuracy of decoding and generating.
The vector of the final t-th paper node is expressed as:
where [. Cndot. ] represents the process of stitching vectors longitudinally, such that the paper node with node identification 1 is ultimately represented as [0.38, -0.48, …,0.19,1.02, -0.98,1.29, …,0.96,1.20,0.37, -0.21, …,0.28,1.79,0.32, -0.78, …,0.89,1.89, -0.38,1.02, …,0.39,1.01,0.31, -0.51, …,0.78,1.23].
Claims (10)
1. The dual sequence-to-sequence generation-based paper network representation learning method is characterized by comprising the following steps of:
step 1) paper parallel sequence generation part
The random walk method is adopted to walk the paper network to obtain paper node sequences, and as each paper in the paper network has two types of information including paper numbers and paper text contents, each paper node sequence obtained by walk corresponds to two types of sequences including different information, namely a paper node identification sequence and a paper node content sequence, and the two types of sequences are a group of parallel sequences;
step 2) a paper node identification part for realizing mapping from a paper node content sequence to a paper node identification sequence
Step 2.1) embedding of paper content of the paper node identification part
For the text content of each paper node, firstly, segmenting the text, randomly initializing each word vector, then capturing the text content information of the paper node by adopting a convolutional neural network CNN, and obtaining the corresponding paper node semantic characteristics by each paper node;
step 2.2) encoding the content sequence of the paper node in the paper node identification section
The method comprises the steps of adopting a Bi-directional long-short-term memory network Bi-LSTM to encode a paper node content sequence, encoding the sequence into a context characteristic representation, adopting the Bi-LSTM to capture forward and reverse information of the paper sequence, and obtaining a semantic representation vector by encoding, wherein the semantic representation vector comprises semantic information of the whole paper node content sequence and structure information among paper nodes implied in the sequence, namely, quotation relations among papers;
step 2.3) generation of the paper node identification sequence of the paper node identification part
Decoding the semantic representation vector obtained by encoding through a long short-term memory network LSTM, and mapping the decoded vector into a paper node identification space to complete the generation process of a paper node identification sequence;
step 3) a paper content generation section for implementing mapping from a paper node identification sequence to a paper node content sequence
Step 3.1) paper node identification embedding in paper content Generation section
Adopting an paper node identification embedding layer, and obtaining vector representations of different paper node identifications in a paper node identification sequence by searching an initialization embedding matrix of the paper node;
step 3.2) encoding of the paper node identification sequence of the paper content Generation part
The method comprises the steps of adopting Bi-LSTM to encode an thesis node identification sequence, and encoding the thesis node identification sequence into a context characteristic representation according to sequence structure information among thesis nodes, namely, a quotation relation among papers, and taking the context characteristic representation as input of a subsequent semantic decoding process;
step 3.3) paper semantic decoding of paper content Generation part
Before the content of the paper node is generated, the context characteristic representation is required to be decoded to obtain a paper semantic characteristic sequence which is used for connecting two modal spaces of the paper network structure and the paper node content, and a decoder adopts LSTM;
step 3.4) paper content Generation by the paper content Generation part
Generating text content, namely word sequence, by adopting classical LSTM to semantic characterization of each paper node in the paper semantic feature sequence;
step 4) Dual fusion paper node identification part and paper content generation part
Through sharing of the intermediate hidden layers of the paper node identification part and the paper content generation part, the two parts are simultaneously learned, and the context characteristic representations obtained in the step 2.2) and the step 3.2) are fused in a linear fusion mode.
2. The method for learning the paper network representation based on dual sequence-to-sequence generation according to claim 1, wherein the method for generating the parallel sequence part of the paper in step 1) is as follows:
the paper network g= (V, E), V represents the set of all paper nodes in the network,then it is a collection of edges in the paper network, V for each paper node V e V in the paper network i Representing the numbering of the nodes of the paper, using v c Content information representing paper nodes; the random walk method is adopted to walk the paper network, and a walk paper node sequence S= { v is obtained 1 ,v 2 ,...,v T T represents the number of nodes contained in the paper node sequence S, i.e. the sequence length, for each sequence S there is a corresponding paper node identification sequence +.>And the thesis node content sequence->The paper node identification sequence and the paper node content sequence are called a group of paper parallel sequences; paper node identification sequenceContaining structural information among paper nodes, namely quotation relation among papers, and the content sequence of the paper nodesThe content information and partial inter-paper structure information of the paper are contained, and because the two sequences contain different information, the paper network structure information can be fused through the mutual mapping process of the two sequencesInformation and paper node content information.
3. The method of claim 2, wherein the method of step 2.1) is as follows:
for the text content of each paper node, firstly, segmenting the text, randomly initializing each word vector, then capturing the text content information of the paper node by adopting CNN, and obtaining the corresponding node semantic feature by each paper node;
parallel sequences of articles are wherein />For the content sequence of the thesis node with the sequence length of T, < >>The sequence is identified for the paper node with the sequence length of T, and the dictionary is +.>Randomly initialized word embedding matrix is +.> For the size of the dictionary, k m Representing the dimension of word embedding, first, a look-up function LookUp is adopted w (. Cndot. Cndot.) will be->Text content of the t-th paper node in (2)>Matrix formed by splicing word embedding vectors wherein t=1,2,...,T,u t,i For the i-th word in the node content of the t-th paper,>number of content words for the t-th paper node:
using a plurality of widths k m Is set in U (v t ) Rolling and maximum pooling operations are performed on the model building machine, and modeling can be performedLearning +.>Is>
4. A method for learning a dual sequence-to-sequence generated paper web representation according to claim 3, wherein the method for encoding the paper node content sequence of the paper node identification part in step 2.2) is as follows:
node content sequence of paperIn the method, semantic association information exists among different paper node contents, and in order to capture global semantic information existing in a paper node content sequence, a paper node semantic feature sequence output by a paper content embedding method is +.>The Bi-LSTM is adopted to code the semantic feature sequence of the paper node; a forward LSTM will accumulate the semantic features of all paper nodes that the code is going to go through from the beginning of the sequence to get the current hidden state vector +.>
The LSTM of the latter is used for accumulating the semantic features of all paper nodes from the end of the sequence to the current experience in the reverse order to obtain the current hidden state
wherein and />Representing forward and backward LSTM networks, respectively, < >>And->Respectively representing a forward hiding state and a backward hiding state corresponding to a T node, wherein the value range of T is t=1, 2;
the representation of the t-th paper node in the paper node content sequence encoding stage is as follows
Finally, the context feature representation z of the whole paper node semantic feature sequence is obtained by splicing the final hidden state representation of the forward LSTM and the backward LSTM NI ,
Where [. Cndot. ] represents the process of stitching vectors longitudinally.
5. The method for learning the network representation of the paper based on the dual sequence-to-sequence generation of claim 4, wherein the method for generating the sequence of the paper node identification part in the step 2.3) is as follows:
the contextual characteristics obtained in step 2.2) represent z NI Fused paper node content sequenceContent information of all paper nodes in +.> and />In order to generate corresponding paper node identification sequence, LSTM is adopted first to make z NI As an initial state, a high-level implicit characteristic sequence oriented to the paper node identification space is directly generated without inputting the characteristic sequence>Wherein the t-th implicit feature->The generation process of (2) is as follows:
then based on the high-layer implicit characteristic sequence obtained by decodingUsing the full connection layer to add each node feature in the high-level implicit feature sequence>Is mapped to the node identification space,obtaining the identification of the t-th paper node in the node identification space ∈>A semantic mapping from content modalities to structural modalities is achieved,
wherein σ (·) is a sigmoid activation function, W NI-Tran and bNm-Tran Respectively a weight matrix and a bias term of the full connection layer; subsequently, the softmax layer is further adoptedNormalized to the probability distribution over all |v| node identities:
in the generation stage of the paper node identification sequence, the expression of the t-th paper node is as follows/>
6. The method for learning the dual sequence-to-sequence generated paper network representation according to claim 5, wherein the paper node identification embedding method of the paper content generating part in step 3.1) is as follows:
adopting a paper node identification embedding layer, and acquiring identification vector representations of different paper nodes in a paper node identification sequence by searching an initialization embedding matrix of the paper nodes;
wherein ,initializing an embedding matrix for all |v| paper node identities, | +.>Node identification vector, k, for the t-th paper node n Is the dimension of the embedded vector; query function LookUp v (. Cndot.). Cndot. Identifies each paper nodeThe corresponding embedded vector->Sequentially combined into sequence->
7. The method for learning the network representation of the paper based on the dual sequence-to-sequence generation according to claim 6, wherein the method for encoding the node identification sequence of the paper in the paper content generation part in the step 3.2) is as follows:
after obtainingThen, the Bi-LSTM is adopted to code the paper node identification sequence according to +.>Sequence structure information between them, coding the paper node identification sequence into a context characteristic representation z CG As input to the subsequent paper content generation process; in the process->An embedded vector of each paper node identification +.>When a forward LSTM will accumulate the identification features of all paper nodes that the code goes through from the beginning of the sequence to the current, get the current hidden state vector +.>
Simultaneously, a backward LSTM is utilized to accumulate the identification characteristics of all paper nodes from the end of the sequence to the current experience in reverse order to obtain the current hidden state vector
wherein and />Representing forward and backward LSTM networks, respectively, < >>And->Respectively representing a forward hiding state and a backward hiding state corresponding to a T node, wherein the value range of T is t=1, 2;
the representation of the t-th paper node in the coding stage of the paper node identification sequence is as follows
By at least one ofAnd (3) performing iterative learning, namely effectively mining structural semantic information in the paper node identification sequence from two opposite directions, and then obtaining the representation of the whole paper node identification sequence by splicing the final hidden state representations of the forward LSTM and the backward LSTM:
8. the method for learning the dual sequence-to-sequence generated paper web representation according to claim 7, wherein the paper semantic decoding method of the paper content generating part in step 3.3) is as follows:
after passing through the paper node identification embedding layer and the paper node identification sequence coding layer, the paper node identification sequence is already codedFusing structural information in a compressed contextual feature representation z CG In this, as a key step before node content is generated, z needs to be represented for the context feature CG Decoding to obtain semantic feature sequence +.>The system is used for connecting two modal spaces of a network structure and node contents; using LSTM, z is represented by contextual characteristics CG For initial state, directly generating output sequence without inputting feature sequence, wherein semantic feature of t-th paper node +.>The generation process of (2) is as follows:
representing z based on contextual characteristics CG ,LSTM CG-Dec (. Cndot.) according to the sequence from front to back, sequentially generating content semantic features corresponding to all T paper nodes, eachHas been fused with->The identity information of the paper nodes and the sequence structure in the sequence are used as the basis for generating content information; in addition, after the semantic vectors of the paper nodes in the input sequence are generated, the semantic vectors of new paper nodes can be continuously generated, and predicted;
9. The method for learning the dual sequence-to-sequence generated paper web representation according to claim 8, wherein the paper content generating method in the paper content generating part in step 3.4) is as follows:
finally, based on the decoded paper semantic feature sequenceLSTM is adopted to->As an initial state, directly generating a word representation sequence of the node;
given the maximum length L of the generated text, LSTM will start from scratch, gradually generating a word sequence, stopping the generation process when the length of the word sequence reaches L, or the generated word is stop sign < EOS >; for the t-th paper node in the sequence, the implicit characterization of the first word is generated as follows:
when l=1, the high-level semantic features are usedTo be hidden, the 1 st hidden state is directly generated without inputting featuresFor further generating words; and when l > 1, the word vector of the last word which has been generated is characterized +.>As input feature, combine the transferred hidden state +.>Co-generation of the current hidden state->For further generating a current word; in the training phase and the testing phase, the word vector characterization of the last word already generated ++>There are different settings; in order to maximize the likelihood probability of the text content of the node during training, from a given +.>The first-1 real word is selected and its word vector is used as +.>Input into LSTM:
and in the test phase, when predicting new text content for the paper node,the word vector corresponding to the word predicted in the previous step is:
wherein Is about->Representing the probability that the word predicted in the previous step is the jth word in the vocabulary, and selecting the word with the highest probability;
based on the text generation process, decoding text semantic sequence with length L for the t-th paper node in the sequenceEach +.>Mapping to +.>In the dictionary space of the dimension, the vector representation +.>
Wherein σ (·) is a sigmoid activation function, W GC-Word and bCG-Word Weight matrix and bias term of the full connection layer respectively, and adopt softmax layer to makeFurther conversion to at all->Probability distribution over individual words:
if the content of a new paper node is expected to be predicted, the same operation is performed on the semantic vector of the predicted new paper node, so that the content word sequence of the new paper node can be obtained.
10. The method for learning the network representation of the paper based on the dual sequence-to-sequence generation according to claim 9, wherein the method for the dual fusion paper node identification part and the paper content generation part in the step 4) is as follows:
the paper node identification part and the paper content generation part are closely related, which model the cross-modal semantic generation relation between the paper node content sequence and the paper node identification sequence from two opposite angles, and in order to realize the fusion of complementary knowledge in two dual parts, the two parts are coupled together by using a linear layer and learning is performed simultaneously by using the sharing of an intermediate hidden layer;
wherein ,WDual,1 、b Dual,1 、W Dual,2 、b Dual,2 The weight and bias term of the linear fusion layer; after having undergone the above dual fusion process, at this time and />Semantic information from the target modality has been included; thus, will +.>Into the sequence decoding layers described in step 2.3) and step 3.3), respectively, to improve the accuracy of decoding and generation;
The vector of the final t-th paper node is expressed as:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911300281.1A CN111104797B (en) | 2019-12-17 | 2019-12-17 | Dual-based sequence-to-sequence generation paper network representation learning method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911300281.1A CN111104797B (en) | 2019-12-17 | 2019-12-17 | Dual-based sequence-to-sequence generation paper network representation learning method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111104797A CN111104797A (en) | 2020-05-05 |
CN111104797B true CN111104797B (en) | 2023-05-02 |
Family
ID=70423010
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911300281.1A Active CN111104797B (en) | 2019-12-17 | 2019-12-17 | Dual-based sequence-to-sequence generation paper network representation learning method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111104797B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111598223B (en) * | 2020-05-15 | 2023-10-24 | 天津科技大学 | Network embedding method based on attribute and structure depth fusion and model thereof |
CN111708881A (en) * | 2020-05-22 | 2020-09-25 | 国网天津市电力公司 | Text representation learning method introducing incidence relation |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109558494A (en) * | 2018-10-29 | 2019-04-02 | 中国科学院计算机网络信息中心 | A kind of scholar's name disambiguation method based on heterogeneous network insertion |
CN109710946A (en) * | 2019-01-15 | 2019-05-03 | 福州大学 | A kind of joint debate digging system and method based on dependence analytic tree |
CN110008323A (en) * | 2019-03-27 | 2019-07-12 | 北京百分点信息科技有限公司 | A kind of the problem of semi-supervised learning combination integrated study, equivalence sentenced method for distinguishing |
-
2019
- 2019-12-17 CN CN201911300281.1A patent/CN111104797B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109558494A (en) * | 2018-10-29 | 2019-04-02 | 中国科学院计算机网络信息中心 | A kind of scholar's name disambiguation method based on heterogeneous network insertion |
CN109710946A (en) * | 2019-01-15 | 2019-05-03 | 福州大学 | A kind of joint debate digging system and method based on dependence analytic tree |
CN110008323A (en) * | 2019-03-27 | 2019-07-12 | 北京百分点信息科技有限公司 | A kind of the problem of semi-supervised learning combination integrated study, equivalence sentenced method for distinguishing |
Non-Patent Citations (3)
Title |
---|
Na Li等.Hybrid algorithm based scheduling optimization in robotic cell with dual-gripper.《Proceeding of the IEEE International Conference on Information and Automation》.2014,第147-152页. * |
Sindri Magn´usson等.Communication Complexity of Dual Decomposition Methods for Distributed Resource Allocation Optimization.《IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING》.2018,第第12卷卷(第第12卷期),第717-732页. * |
江东灿 ; 陈维政 ; 闫宏飞 ; .基于deepwalk方法的适应有限文本信息的DWLTI算法.郑州大学学报(理学版).2017,49(01),第29-33页. * |
Also Published As
Publication number | Publication date |
---|---|
CN111104797A (en) | 2020-05-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111291212B (en) | Zero sample sketch image retrieval method and system based on graph convolution neural network | |
CN109299342B (en) | Cross-modal retrieval method based on cycle generation type countermeasure network | |
CN109492113B (en) | Entity and relation combined extraction method for software defect knowledge | |
CN109299216A (en) | A kind of cross-module state Hash search method and system merging supervision message | |
CN111538848A (en) | Knowledge representation learning method fusing multi-source information | |
CN109101235A (en) | A kind of intelligently parsing method of software program | |
CN111027595A (en) | Double-stage semantic word vector generation method | |
CN111104797B (en) | Dual-based sequence-to-sequence generation paper network representation learning method | |
CN113254616B (en) | Intelligent question-answering system-oriented sentence vector generation method and system | |
CN112559764A (en) | Content recommendation method based on domain knowledge graph | |
CN110781290A (en) | Extraction method of structured text abstract of long chapter | |
CN113987169A (en) | Text abstract generation method, device and equipment based on semantic block and storage medium | |
CN113971837A (en) | Knowledge-based multi-modal feature fusion dynamic graph neural sign language translation method | |
Yi et al. | Efficient online label consistent hashing for large-scale cross-modal retrieval | |
CN113300813A (en) | Attention-based combined source channel method for text | |
Wang et al. | Fusion-supervised deep cross-modal hashing | |
CN115510236A (en) | Chapter-level event detection method based on information fusion and data enhancement | |
CN114281982B (en) | Book propaganda abstract generation method and system adopting multi-mode fusion technology | |
CN117496388A (en) | Cross-modal video description model based on dynamic memory network | |
CN111723649B (en) | Short video event detection method based on semantic decomposition | |
CN117235216A (en) | Knowledge reasoning method based on heterogeneous knowledge fusion | |
CN114298052B (en) | Entity joint annotation relation extraction method and system based on probability graph | |
CN115730232A (en) | Topic-correlation-based heterogeneous graph neural network cross-language text classification method | |
CN113254575B (en) | Machine reading understanding method and system based on multi-step evidence reasoning | |
CN114358006A (en) | Text content abstract generation method based on knowledge graph |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |