CN111104797B

CN111104797B - Dual-based sequence-to-sequence generation paper network representation learning method

Info

Publication number: CN111104797B
Application number: CN201911300281.1A
Authority: CN
Inventors: 刘杰; 李娜; 何志成
Original assignee: Nankai University
Current assignee: Nankai University
Priority date: 2019-12-17
Filing date: 2019-12-17
Publication date: 2023-05-02
Anticipated expiration: 2039-12-17
Also published as: CN111104797A

Abstract

A dual sequence-to-sequence generation-based paper network representation learning method, the method comprising: a paper parallel sequence generation section; a paper node identification part (paper content embedding, paper content sequence coding, paper identification sequence generation); a paper content generation part (paper node identification embedding, paper identification sequence coding, paper semantic decoding, paper content generation); and a dual fusion portion. The method integrates content information (namely the topics or abstracts of the papers) of the paper nodes in the paper network and structural information (namely the quotation relations among the papers) of the papers, fuses the two kinds of information more fully through the mutual mapping process of the two kinds of information, and learns the representation of the paper nodes with more meanings. The invention can also continue to decode new text after decoding the text content of the input paper sequence, i.e. new paper content predicted after considering the structure information and content information of the input paper sequence.

Description

Dual-based sequence-to-sequence generation paper network representation learning method

Technical Field

The invention belongs to the technical field of computer application, data mining and network representation learning.

Background

Web-based learning is becoming an increasingly popular research topic because it can be applied in many different downstream tasks. However, because the structure of the network data is very complex, and there is accompanying information, for example, the network data of a large amount of papers includes not only the topic and abstract of the papers, but also the quotation relation information among the papers, and the highly nonlinear information presents challenges for the study of the network representation. In recent years, researchers have made a lot of efforts in the field of network representation learning, have obtained a lot of research results, and have roughly classified network representation learning methods into two categories based on input information of models.

One type is structure-preserving network embedding, such as the classical deep walk [1] model, uses a first-order neighbor structure to perform random walk sampling and learn node characterization based on the resulting node sequence. The node vector model node2vec 2 further provides a random walk algorithm based on a second-order neighbor structure. While Tang Jian et al propose reconstruction losses for first and second order neighbor structures between nodes directly modeled by a large-scale information network embedded model LINE 3. The GraRep model [4] is further generalized to higher-order neighbor structures. However, existing models often require manual specification of structural information to be preserved, such as first order, second order, etc., and still have certain limitations in practical applications.

The other type is network embedding fused with accompanying information, nodes in real network data are often accompanied with information such as labels, types, attributes and the like besides structural information, the accompanying information of the nodes and a topological structure belong to completely different modes, and characteristics of the nodes and high-level semantic relation among the nodes are described from different angles. Based on the deep walk model, liu Zhiyuan and the like of the university of Qinghai introduce node content [5] and label information [6] respectively, so that the performance of node classification tasks is effectively improved. In the embedding research of heterogeneous information network, the models of HINE 7, HNE 8, etc. further consider the types of nodes and edges, so as to model network structure information in finer granularity. However, the existing method lacks of deep mining of node content information and has a certain limitation.

Reference is made to:

[1]Perozzi B,Al-Rfou R,Skiena S.Deepwalk:Online learning of social representations[C]. Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2014:701-710.

[2]Grover A,Leskovec J.node2vec:Scalable feature learning for networks[C].Proceedings of the 22th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2016:855–864.

[3]Tang J,Qu M,Wang M,et al.LINE:Large-scale information network embedding[C]. Proceedings of the 24th International Conference on World Wide Web.International World Wide Web Conferences Steering Committee,2015:1067-1077.

[4]Cao S,Lu W,Xu Q.Grarep:Learning graph representations with global structural information[C].Proceedings of the 24th ACM International Conference on Information and Knowledge Management.ACM,2015:891-900.

[5]Yang C,Liu Z,Zhao D,et al.Chang.Network representation learning with rich text information[C].Proceedings of the 24th International Joint Conference on Artificial Intelligence.2015:2111-2117.

[6]Tu C,Zhang W,Liu Z,et al.Max-margin deepwalk:Discriminative learning of network representation[C].Proceedings of the 25th International Joint Conference on Artificial Intelligence.2016:3889-3895.

[7]Huang Z,Mamoulis N.Heterogeneous information network embedding for meta path based proximity[J].arXiv preprint arXiv:1701.05291,2017.

[8]Chang S,Han W,Tang J,et al.Heterogeneous network embedding via deep architectures[C].Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2015:119-128.

disclosure of Invention

The invention aims to solve the problem of effective fusion of complex network structures and paper node content information in a paper network and provides a paper network representation learning method based on dual sequence-to-sequence generation.

The technical proposal of the invention

A dual sequence-to-sequence generation-based paper network representation learning method comprises the following steps:

step 1) paper parallel sequence generation part

Firstly, a random walk method is adopted to walk the paper network to obtain a paper node sequence, and because each paper in the paper network has two types of information including paper numbers and paper text contents, each paper node sequence obtained by walk corresponds to two types of sequences containing different information, namely a paper node identification sequence and a paper node content sequence. The paper node identification sequence comprises the structure information of paper nodes, namely the quotation relation among papers, the paper node content sequence comprises the content information of the papers and part of the inter-paper structure information, and the two sequences are a group of parallel sequences of the papers. Because the two sequences contain different information, the paper network structure information and the paper node content information can be fused through the mutual mapping process of the two sequences.

Step 2) step 2.1) a paper node identification part for implementing mapping from a paper node content sequence to a paper node identification sequence, paper content embedding of the paper node identification part

For the text content of each paper node, firstly, segmenting the text, randomly initializing each word vector, and then capturing the text content information of the paper node by adopting a convolutional neural network (Convolutional Neural Network, CNN), wherein each paper node obtains the corresponding paper node semantic characteristics;

step 2.2) encoding the content sequence of the paper node in the paper node identification section

Coding the paper node content sequence by adopting a bidirectional long-short-Term Memory network (Bidirectional Long Short-Term Memory, bi-LSTM), coding the sequence into a context characteristic representation, and adopting Bi-LSTM to capture forward and reverse information of the paper sequence, wherein a semantic representation vector obtained by coding comprises semantic information of the whole paper node content sequence and structure information among paper nodes implied in the sequence, namely a quotation relation among papers;

step 2.3) generation of the paper node identification sequence of the paper node identification part

Decoding the semantic representation vector obtained by encoding through a Long Short-Term Memory network (LSTM), and mapping the decoded vector into a paper node identification space to complete the generation process of a paper node identification sequence;

step 3) a paper content generation section for implementing mapping from a paper node identification sequence to a paper node content sequence

Step 3.1) paper node identification embedding in paper content Generation section

Adopting an paper node identification embedding layer, and obtaining vector representations of different paper node identifications in a paper node identification sequence by searching an initialization embedding matrix of the paper node;

step 3.2) encoding of the paper node identification sequence of the paper content Generation part

The method comprises the steps of adopting Bi-LSTM to encode an thesis node identification sequence, and encoding the thesis node identification sequence into a context characteristic representation according to sequence structure information among thesis nodes, namely, a quotation relation among papers, and taking the context characteristic representation as input of a subsequent semantic decoding process;

step 3.3) paper semantic decoding of paper content Generation part

Before the content of the paper node is generated, the context characteristic representation is required to be decoded to obtain a paper semantic characteristic sequence which is used for connecting two modal spaces of the paper network structure and the paper node content, and a decoder adopts LSTM;

step 3.4) paper content Generation by the paper content Generation part

Generating text content, namely word sequence, by adopting classical LSTM to semantic characterization of each paper node in the paper semantic feature sequence;

step 4) Dual fusion paper node identification part and paper content generation part

Through sharing of the intermediate hidden layers of the paper node identification part and the paper content generation part, the two parts are simultaneously learned, and the context characteristic representations obtained in the step 2.2) and the step 3.2) are fused in a linear fusion mode.

The sequence-to-sequence model is a translation model that translates one language sequence into another and maps one sequence into another. The sequence-to-sequence model is composed of an encoder and a decoder, wherein an input sequence is firstly encoded into a semantic representation vector, and then the semantic representation vector is decoded into a sequence, so that the mapping from the sequence to the sequence is completed. The sequence-to-sequence model is initially applied to the field of natural language processing for machine translation and abstract generation, is now also applied to the field of network representation learning, fuses different information through a sequence-to-sequence mapping process, and adopts an intermediate result of the model as node representation in a network.

As shown in FIG. 1, the method for learning the paper network representation based on dual sequence-to-sequence generation provided by the invention utilizes the structural relation of paper nodes in the paper network to obtain the paper node sequence by random walk, and each paper node has two modal information: the paper node identification (i.e. the number of the paper) and the paper node content (i.e. the topic or abstract of the paper) finally obtain a group of paper parallel sequences, namely the paper node identification sequence and the corresponding paper node content sequence.

Based on the paper parallel sequences, the invention designs two dual sequence-to-sequence generating parts, namely a paper node identifying part (Node Identification, NI) and a paper content generating part (Content Generation, CG), namely semantic mapping modeling from the paper node content sequence to the paper node identifying sequence and semantic mapping modeling from the paper node identifying sequence to the paper node content sequence. Based on the proposed dual fusion method, the two parts can carry out effective knowledge transfer through a certain fusion strategy. And finally, extracting hidden vectors in intermediate layers of the paper node identification part and the paper content generation part as learned paper node characterization, and applying the hidden vectors to subsequent paper network analysis tasks.

The invention has the advantages and beneficial effects that:

paper node characterization

According to the method, the content information of the paper nodes and the structure information among the paper nodes in the paper network are integrated, the characterization of the paper nodes is learned, and compared with the previous research, the method has the advantages that the content information of the paper nodes and the structure information are fused more fully, and the characterization of the paper nodes is more meaningful.

Paper content prediction

The invention can continue to generate the text content of the new paper by using the trained method, and can continue to decode the content of the new paper after decoding the text content of the input paper sequence in the paper content generation stage of the paper content generation part, namely, consider the structure information and the content information of the input paper sequence and predict the text content of the new paper.

Drawings

FIG. 1 is a flow chart of the present invention learning a representation of a paper node from a paper network.

Fig. 2 is a diagram of a method of performing dual fusion of a paper node recognition part and a paper content generation part of the present invention.

Detailed Description

The invention provides a dual-based sequence-to-sequence generation paper network representation learning method, which is described in detail below with reference to the accompanying drawings and specific implementation.

Example 1:

in order to ensure the normal operation of the system, the invention mainly adopts a deep learning technology to perform paper node characterization learning on a paper network, and in particular implementation, requires that a used computer platform is provided with a memory of not lower than 11G, CPU cores are not lower than 4 and have a main frequency of not lower than 2.6GHz, a GPU environment and a Linux operating system, and necessary software environments such as Python 3.6 and above, pyrach 0.4 and above are installed.

As shown in the method diagram of the dual fusion performed by the paper node identification part and the paper content generation part in fig. 2, a paper network representation learning method based on dual sequence-to-sequence generation comprises the following detailed steps:

step 1) paper parallel sequence generation part

The paper network g= (V, E), V represents the set of all paper nodes in the network,

the edges in the paper network are set and comprise quotation relation information among papers, and if quotation and quotation relation exists among the papers, the edges exist among the papersFor each paper node V e V in the paper network, V is used ⁱ Representing the numbering of the nodes of the paper, using v ^c Content information representing paper nodes. The random walk method is adopted to walk the paper network, and a walk paper node sequence S= { v is obtained ₁ ,v ₂ ,…,v _T For each sequence S, there is a corresponding paper node identification sequence +.>

And the treatise node content order->

The paper node identification sequence and the paper node content sequence are referred to as a set of paper parallel sequences. For example, there is an edge between paper 1 and paper 3, an edge between paper 3 and paper 6, an edge between paper 6 and paper 4, and an edge between paper 4 and paper 9, when random walk, walk from paper 1, walk to paper 3, then walk to paper 6, 4, 9, if the walk length is set to 5, the walk sequence is paper 1→paper 3→paper 6→paper 4→paper 9, then according to the serial number of the paper, the paper node identification sequence 1→3→6→4→9 can be obtained, according to the content information of the paper, the paper node content sequence "data mining#" → "big data#" → "natural language processing" → "text analysis#" web data mining "can be obtained.

Step 2) a paper node identification part step 2.1) a paper node identification part for realizing mapping from a paper node content sequence to a paper node identification sequence:

for the text content of each paper node, firstly, the text is segmented, each word vector is initialized randomly, then CNN is adopted to capture the text content information of the paper node, and each paper node obtains the corresponding paper node semantic characteristics.

Parallel sequences of articles are

wherein />

For the content sequence of the thesis node with the sequence length of n, < >>

The sequence is identified for the paper node with the sequence length of n, and the dictionary is +.>

Randomly initialized word embedding matrix is +.>

For the size of the dictionary, k _m Representing the dimension of word embedding, first, a look-up function LookUp is adopted ^w (. Cndot. Cndot.) will be->

Text content of the t-th paper node in (2)>

Matrix spliced by word embedding vectors>

wherein u_t,i For the i-th word in the node content of the t-th paper,>

number of content words for the t-th paper node:

wherein the operator is

Representing the operation of stitching the vectors laterally into a matrix.

For example, in a paper network, the paper nodes are identified as the numbers of the papers, the text content of the paper nodes is the title or abstract of the paper, the sequence length obtained by random walk is 5, and the walk isThe paper node identification sequence of the paper node is 1-3-6-4-9, the paper node content sequence is ' data-mining# ' - ' big data# ' - ' natural language processing ' - ' text analysis# ' - ' web data-mining ' - ' and# is a filling character, firstly, content words of each paper node are embedded and spliced, for example, words ' data ' are embedded to obtain 100-dimensional word vectors [1,0.89,1,23,0.54, …,1,03 ] corresponding to ' data ']The corresponding word vector is obtained for the content word of each paper node, and is spliced, such as the final result U (v) of the paper node content embedding with the node identification of 1 _t ) Vectors of 3 x 100 dimensions [ [1,0.89,1,23,0.54, …,1,03 ]],[0.48,0.93,1.07,0.76,…,1.32],[1.78,1.24,0.65,0.79,…,0.36]]。

Using a plurality of widths k _m Is set in U (v _t ) Rolling and maximum pooling operations are performed on the model building machine, and the model building machine can model building

Learning +.>

Is>

Original thesis node content sequence

Become the paper node semantic feature sequence +.>

T is the sequence length. After CNN modeling, for each paper node, the paper node content sequence embeds the result U (v _t ) Convolving to a 100-dimensional vector ∈>

Content feature vector of paper node as node identification 1 +.>

Is [0.79,0.68,1.03,0.98, …,0.76]。

Step 2.2) the paper node content sequence coding of the paper node identification part:

node content sequence of paper

In the method, semantic association information exists among different paper node contents, and in order to capture global semantic information existing in a paper node content sequence, a paper node semantic feature sequence output in a paper content embedding layer is +.>

Above, bi-LSTM is used to encode the paper node semantic feature sequences. A forward LSTM will accumulate the semantic features of all paper nodes that the code is going to go through from the beginning of the sequence to get the current hidden state vector +.>

/>

The LSTM of the latter is used for accumulating the semantic features of all paper nodes from the end of the sequence to the current experience in the reverse order to obtain the current hidden state

wherein

and />

Respectively representing the fusion learning process performed by the forward and backward LSTM when processing the t-th paper node in the sequence.

The representation of the t-th paper node in the paper node content sequence encoding stage is as follows

For example, the paper node with node identification 1 is the first node in the sequence, so the corresponding forward hidden state is

The corresponding backward hidden state is +>

The paper node with final node identification of 1 is learned in the paper node content sequence coding stage as +.>

and />

Is->

Is [0.38, -0.48, …,0.19,1.02, -0.98,1.29, …,0.96,1.20 ]]If the paper node with node identification 1 appears multiple times in the sequence, taking the average of the multiple paper node representations as the representation of the last paper node, and performing the same treatment when the paper node representations are calculated by other parts in the method.

Finally, the context characteristic representation of the whole paper node semantic characteristic sequence is obtained through splicing the final hidden state representation of the forward LSTM and the backward LSTM. Since the representation of the last hidden state of the forward and backward LSTM contains information of the entire sequence, the spliced last hidden state representation of the forward and backward LSTM is used as a representation of the entire sequence.

Wherein [ · ]]Representing the process of longitudinally splicing vectors to finally obtain the contextual feature representation z of the whole paper node semantic feature sequence ^NI Is [1.39, -0.98, …,0.29,1.05]。

Step 2.3) paper node identification sequence generation of the paper node identification part:

the contextual characteristics obtained in step 2.2) represent z ^NI Fused paper node content sequence

Content information of all paper nodes in +.>

and />

Sequence information carried by the user. To generate a corresponding paper node identification sequence, LSTM is first employed, in z ^NI As an initial state, a high-level implicit characteristic sequence oriented to the paper node identification space is directly generated without inputting the characteristic sequence>

Wherein the t-th implicit feature->

The generation process of (2) is as follows:

then based on the high-layer implicit characteristic sequence obtained by decoding

By means of full connectionThe junction layer adds each node feature in the high-level implicit feature sequence>

Mapping to a node identification space to obtain the identification of the t-th paper node in the node identification space ∈>

A semantic mapping from content modalities to structural modalities is achieved,

wherein σ (·) is a sigmoid activation function, W ^NI-Tran and b^NI-Tran The weight matrix and the bias term of the full connection layer are respectively. Subsequently, the softmax layer is further adopted

Normalized to probability distribution over all |v| paper node identities:

finally, probability distribution is obtained

Is a probability value such as 0.29 #>

The probability of j representing the predicted t-th paper node is 0.29. By comparing the probabilities on all |V| paper node identifiers, the paper node identifier with the highest probability value is finally taken as the predicted node identifier of the t-th paper node.

In the generation stage of the paper node identification sequence, the expression of the t-th paper node is as follows

The first paper node is represented as in the paper node identification sequence generation phase

Step 3) a paper node identification embedding for implementing a paper content generation part mapping from a paper node identification sequence to a paper node content sequence step 3.1) a paper content generation part

And adopting an paper node identification embedding layer, and obtaining vector representations of different paper nodes in the paper node identification sequence by looking up an initial embedding matrix of the paper nodes.

wherein ,

initializing an embedding matrix for all |v| paper node identities, | +.>

Node identification vector, k, for the t-th paper node _n Is the dimension of the embedded vector. Query function LookUp ^v (. Cndot.) identify each paper node +.>

The corresponding embedded vector->

Sequentially combined into sequence->

For example, the random walk paper node identification sequence is 1- & gt 3- & gt 6- & gt 4- & gt 9, and the matrix V is embedded by searching, and each row of the matrix V represents the pairThe identification vector of the paper node at the position is used for obtaining k of each paper node _n The identity vector of the dimension. The matrix v is initialized randomly, the identification vector of the paper node with the node identification of 1 is the first row in the matrix v, and the first row of the matrix v is taken as the paper node identification vector with the node identification of 1

After obtaining

Then, the Bi-LSTM is adopted to code the paper node identification sequence according to

Sequence structure information between them, coding the paper node identification sequence into a context characteristic representation z ^CG As input to the subsequent content generation process. In the process->

An embedded vector of each paper node identification +.>

When a forward LSTM accumulates the identification features of all paper nodes from the beginning of the sequence to the current one, to obtain the current hidden state vector

Simultaneously, a backward LSTM is utilized to accumulate the identification characteristics of all paper nodes from the end of the sequence to the current experience in reverse order to obtain the current hidden state vector

wherein

and />

Respectively, the learning process performed by the forward and backward LSTM at step t.

The representation of the t-th paper node in the coding stage of the paper node identification sequence is as follows

The corresponding backward hidden state is +>

The paper node with final node identification of 1 is learned in the coding stage of the paper node identification sequence and is expressed as +.>

and />

Is->

Is [0.32, -0.78, …,0.89,1.89, -0.38,1.02, …,0.39,1.01 ]].

By at least one of

Iterative learning is performed, and structural semantic information in the paper node identification sequence is effectively mined from two opposite directions. And then, the final hidden state representation of the forward LSTM and the backward LSTM is spliced to obtain the fusion representation of the whole paper node identification sequence:

wherein [ · ]]Representing the process of longitudinally splicing vectors to finally obtain the contextual feature representation z of the whole paper node identification feature sequence ^CG Is [1.39, -0.98, …,0.29,1.05]

Step 3.3) paper semantic decoding of paper content Generation part

After passing through the paper node identification embedding layer and the paper node identification sequence coding layer, the paper node identification sequence is already coded

Fusing structural information in a compressed contextual feature representation z ^CG Is a kind of medium. As a key step before generating the paper node content, the contextual feature representation z is required ^CG Decoding to obtain semantic feature sequence of the whole paper sequence

The method is used for connecting the two modal spaces of the network structure and the paper node content. Using LSTM, z is represented by contextual characteristics ^GG In the initial state, the output sequence is directly generated without inputting the characteristic sequence. Wherein semantic feature of the t-th paper node +.>

The generation process of (2) is as follows: />

Representing z based on contextual characteristics ^CG ，LSTM ^CG-Dec (. Cndot.) all T paper nodes are generated in sequence from front to backCorresponding content semantic features. Each of which is

Has been fused with->

The paper node identification information contained in the file and the sequence structure in the sequence are used as the basis for generating the content information. In addition, after the paper nodes in the input sequence are generated, the generation can be continued, and the semantic vectors of the new paper nodes can be predicted. Such as decoding +.>

Can continue decoding +.>

Content semantic vectors of new paper nodes are predicted.

In the thesis semantic decoding stage, the expression of the t-th thesis node is as follows

The first paper node is represented as in the paper semantic decoding stage

Step 3.4) paper content Generation by the paper content Generation part

Finally, the sequence is characterized based on the paper semantics after decoding

Each +.>

Text content, i.e., word sequences, is generated. Following convention, LSTM is used to +.>

As an initial state, word representation sequences of paper nodes are directly generated.

Given the maximum length L of the generated text, the LSTM will start from scratch, gradually generating a word sequence. When the length of the word sequence reaches L, or the generated word is the stop symbol < EOS >, the generation process stops. For the t-th paper node in the sequence, the implicit characterization of the first word is generated as follows:

when l=1, the high-level semantic features are used

To generate the 1 st hidden state directly without inputting features>

For further generating words. And at l>1, the word vector of the last word which has been generated is characterized +.>

As input feature, combine the transferred hidden state +.>

Co-generation of the current hidden state->

For further generation of the current word. In the training phase and the testing phase, the word vector characterization of the last word already generated ++>

There are different arrangements. In the training process, in order to maximize likelihood probability of text content of paper node, from given +.>

The first-1 real word is selected and its word vector is used as +.>

The text content of the paper node with the node identification of 1 is input into the LSTM and is data mining#, and when the second word is predicted to be mining, an embedded vector [1,0.89,1,23,0.54, …,1,03 ] with the characteristic of data is input]：

/>

And in the test phase, when predicting new text content for the paper node,

the word vector corresponding to the word predicted in the previous step is:

wherein

Is about->

The function of (2) represents the probability that the word predicted in the previous step is the jth word in the vocabulary, and the max function represents the word with the highest probability of generating the selection, for example, the maximum probability of the first word predicted for the paper node with node identification of 1 is "data", then the embedding vector of "data" is [1,0.89,1,23,0.54, …,1,03 ]]As input to predict the next word, i.e. +.>

Based on the text generation process, a text semantic sequence with a length L (the maximum length of the word sequence of the paper node content is set to 3 in the example) is decoded for the t-th paper node in the sequence

Using a fully-connected layer to connect each/>

Mapping to +.>

In the dictionary space of dimensions:

wherein σ (·) is a sigmoid activation function, W ^CG-Word and b^CG-Word Weight matrix and bias term of the full connection layer respectively, and adopt softmax layer to make

Further conversion to at all->

Probability distribution over individual words:

finally, probability distribution is obtained

Is a probability value such as 0.35, < >>

The first word representing the node of the predictive t-th paper is m _j The probability of (2) is 0.35. By comparison at all +.>

And finally taking the word with the maximum probability value as the predicted word of the first word of the t-th node, and generating a result of data mining # for the content of the paper node with the node identification of 1.

For new decoded nodes if it is desired to predict the contents of the new paper node

And executing the same content generation operation to generate a new content word sequence of the paper node.

The paper node identification part and the paper content generation part are closely related, which model the cross-modal semantic generation relation between the paper node content sequence and the paper node identification sequence from two opposite angles, and in order to realize the fusion of complementary knowledge in two dual parts, the two parts are coupled together by using a linear layer and learning is performed simultaneously by using the sharing of an intermediate hidden layer.

wherein ,W^Dual,1 、b ^Dual,1 、W ^Dual,2 、b ^Dual,2 Is the weight and bias term of the linear fusion layer. After having undergone the above dual fusion process, at this time

and />

Some semantic information from the target modality has been included. Thus, will +.>

and />

And respectively feeding the decoding layers into the sequence decoding layers described in the step 2.3) and the step 3.3), thereby improving the accuracy of decoding and generating.

The vector of the final t-th paper node is expressed as:

where [. Cndot. ] represents the process of stitching vectors longitudinally, such that the paper node with node identification 1 is ultimately represented as [0.38, -0.48, …,0.19,1.02, -0.98,1.29, …,0.96,1.20,0.37, -0.21, …,0.28,1.79,0.32, -0.78, …,0.89,1.89, -0.38,1.02, …,0.39,1.01,0.31, -0.51, …,0.78,1.23].

Claims

1. The dual sequence-to-sequence generation-based paper network representation learning method is characterized by comprising the following steps of:

step 1) paper parallel sequence generation part

The random walk method is adopted to walk the paper network to obtain paper node sequences, and as each paper in the paper network has two types of information including paper numbers and paper text contents, each paper node sequence obtained by walk corresponds to two types of sequences including different information, namely a paper node identification sequence and a paper node content sequence, and the two types of sequences are a group of parallel sequences;

step 2) a paper node identification part for realizing mapping from a paper node content sequence to a paper node identification sequence

Step 2.1) embedding of paper content of the paper node identification part

For the text content of each paper node, firstly, segmenting the text, randomly initializing each word vector, then capturing the text content information of the paper node by adopting a convolutional neural network CNN, and obtaining the corresponding paper node semantic characteristics by each paper node;

The method comprises the steps of adopting a Bi-directional long-short-term memory network Bi-LSTM to encode a paper node content sequence, encoding the sequence into a context characteristic representation, adopting the Bi-LSTM to capture forward and reverse information of the paper sequence, and obtaining a semantic representation vector by encoding, wherein the semantic representation vector comprises semantic information of the whole paper node content sequence and structure information among paper nodes implied in the sequence, namely, quotation relations among papers;

Decoding the semantic representation vector obtained by encoding through a long short-term memory network LSTM, and mapping the decoded vector into a paper node identification space to complete the generation process of a paper node identification sequence;

step 3.3) paper semantic decoding of paper content Generation part

step 3.4) paper content Generation by the paper content Generation part

2. The method for learning the paper network representation based on dual sequence-to-sequence generation according to claim 1, wherein the method for generating the parallel sequence part of the paper in step 1) is as follows:

then it is a collection of edges in the paper network, V for each paper node V e V in the paper network ⁱ Representing the numbering of the nodes of the paper, using v ^c Content information representing paper nodes; the random walk method is adopted to walk the paper network, and a walk paper node sequence S= { v is obtained ₁ ，v ₂ ，...，v _T T represents the number of nodes contained in the paper node sequence S, i.e. the sequence length, for each sequence S there is a corresponding paper node identification sequence +.>

And the thesis node content sequence->

The paper node identification sequence and the paper node content sequence are called a group of paper parallel sequences; paper node identification sequence

Containing structural information among paper nodes, namely quotation relation among papers, and the content sequence of the paper nodes

The content information and partial inter-paper structure information of the paper are contained, and because the two sequences contain different information, the paper network structure information can be fused through the mutual mapping process of the two sequencesInformation and paper node content information.

3. The method of claim 2, wherein the method of step 2.1) is as follows:

for the text content of each paper node, firstly, segmenting the text, randomly initializing each word vector, then capturing the text content information of the paper node by adopting CNN, and obtaining the corresponding node semantic feature by each paper node;

parallel sequences of articles are

wherein />

For the content sequence of the thesis node with the sequence length of T, < >>

The sequence is identified for the paper node with the sequence length of T, and the dictionary is +.>

Randomly initialized word embedding matrix is +.>

Text content of the t-th paper node in (2)>

Matrix formed by splicing word embedding vectors

wherein t＝1，2，...，T，u _t，i For the i-th word in the node content of the t-th paper,>

number of content words for the t-th paper node:

wherein the operator is

Representing the operation of stitching the vectors laterally into a matrix;

using a plurality of widths k _m Is set in U (v _t ) Rolling and maximum pooling operations are performed on the model building machine, and modeling can be performed

Learning +.>

Is>

Original thesis node content sequence

Becoming a thesis node semantic feature sequence

T is the sequence length.

4. A method for learning a dual sequence-to-sequence generated paper web representation according to claim 3, wherein the method for encoding the paper node content sequence of the paper node identification part in step 2.2) is as follows:

node content sequence of paper

In the method, semantic association information exists among different paper node contents, and in order to capture global semantic information existing in a paper node content sequence, a paper node semantic feature sequence output by a paper content embedding method is +.>

The Bi-LSTM is adopted to code the semantic feature sequence of the paper node; a forward LSTM will accumulate the semantic features of all paper nodes that the code is going to go through from the beginning of the sequence to get the current hidden state vector +.>

wherein

and />

Representing forward and backward LSTM networks, respectively, < >>

And->

Respectively representing a forward hiding state and a backward hiding state corresponding to a T node, wherein the value range of T is t=1, 2;

Finally, the context feature representation z of the whole paper node semantic feature sequence is obtained by splicing the final hidden state representation of the forward LSTM and the backward LSTM ^NI ，

Where [. Cndot. ] represents the process of stitching vectors longitudinally.

5. The method for learning the network representation of the paper based on the dual sequence-to-sequence generation of claim 4, wherein the method for generating the sequence of the paper node identification part in the step 2.3) is as follows:

Content information of all paper nodes in +.>

and />

In order to generate corresponding paper node identification sequence, LSTM is adopted first to make z ^NI As an initial state, a high-level implicit characteristic sequence oriented to the paper node identification space is directly generated without inputting the characteristic sequence>

Wherein the t-th implicit feature->

The generation process of (2) is as follows:

Using the full connection layer to add each node feature in the high-level implicit feature sequence>

Is mapped to the node identification space,obtaining the identification of the t-th paper node in the node identification space ∈>

wherein σ (·) is a sigmoid activation function, W ^NI-Tran and b^Nm-Tran Respectively a weight matrix and a bias term of the full connection layer; subsequently, the softmax layer is further adopted

Normalized to the probability distribution over all |v| node identities:

/>

6. The method for learning the dual sequence-to-sequence generated paper network representation according to claim 5, wherein the paper node identification embedding method of the paper content generating part in step 3.1) is as follows:

adopting a paper node identification embedding layer, and acquiring identification vector representations of different paper nodes in a paper node identification sequence by searching an initialization embedding matrix of the paper nodes;

wherein ,

initializing an embedding matrix for all |v| paper node identities, | +.>

Node identification vector, k, for the t-th paper node _n Is the dimension of the embedded vector; query function LookUp ^v (. Cndot.). Cndot. Identifies each paper node

The corresponding embedded vector->

Sequentially combined into sequence->

7. The method for learning the network representation of the paper based on the dual sequence-to-sequence generation according to claim 6, wherein the method for encoding the node identification sequence of the paper in the paper content generation part in the step 3.2) is as follows:

after obtaining

Then, the Bi-LSTM is adopted to code the paper node identification sequence according to +.>

Sequence structure information between them, coding the paper node identification sequence into a context characteristic representation z ^CG As input to the subsequent paper content generation process; in the process->

An embedded vector of each paper node identification +.>

When a forward LSTM will accumulate the identification features of all paper nodes that the code goes through from the beginning of the sequence to the current, get the current hidden state vector +.>

wherein

and />

Representing forward and backward LSTM networks, respectively, < >>

And->

By at least one of

And (3) performing iterative learning, namely effectively mining structural semantic information in the paper node identification sequence from two opposite directions, and then obtaining the representation of the whole paper node identification sequence by splicing the final hidden state representations of the forward LSTM and the backward LSTM:

8. the method for learning the dual sequence-to-sequence generated paper web representation according to claim 7, wherein the paper semantic decoding method of the paper content generating part in step 3.3) is as follows:

Fusing structural information in a compressed contextual feature representation z ^CG In this, as a key step before node content is generated, z needs to be represented for the context feature ^CG Decoding to obtain semantic feature sequence +.>

The system is used for connecting two modal spaces of a network structure and node contents; using LSTM, z is represented by contextual characteristics ^CG For initial state, directly generating output sequence without inputting feature sequence, wherein semantic feature of t-th paper node +.>

The generation process of (2) is as follows:

representing z based on contextual characteristics ^CG ，LSTM ^CG-Dec (. Cndot.) according to the sequence from front to back, sequentially generating content semantic features corresponding to all T paper nodes, each

Has been fused with->

The identity information of the paper nodes and the sequence structure in the sequence are used as the basis for generating content information; in addition, after the semantic vectors of the paper nodes in the input sequence are generated, the semantic vectors of new paper nodes can be continuously generated, and predicted;

9. The method for learning the dual sequence-to-sequence generated paper web representation according to claim 8, wherein the paper content generating method in the paper content generating part in step 3.4) is as follows:

finally, based on the decoded paper semantic feature sequence

LSTM is adopted to->

As an initial state, directly generating a word representation sequence of the node;

given the maximum length L of the generated text, LSTM will start from scratch, gradually generating a word sequence, stopping the generation process when the length of the word sequence reaches L, or the generated word is stop sign < EOS >; for the t-th paper node in the sequence, the implicit characterization of the first word is generated as follows:

when l=1, the high-level semantic features are used

To be hidden, the 1 st hidden state is directly generated without inputting features

For further generating words; and when l > 1, the word vector of the last word which has been generated is characterized +.>

As input feature, combine the transferred hidden state +.>

Co-generation of the current hidden state->

For further generating a current word; in the training phase and the testing phase, the word vector characterization of the last word already generated ++>

There are different settings; in order to maximize the likelihood probability of the text content of the node during training, from a given +.>

The first-1 real word is selected and its word vector is used as +.>

Input into LSTM:

and in the test phase, when predicting new text content for the paper node,

the word vector corresponding to the word predicted in the previous step is:

wherein

Is about->

Representing the probability that the word predicted in the previous step is the jth word in the vocabulary, and selecting the word with the highest probability;

based on the text generation process, decoding text semantic sequence with length L for the t-th paper node in the sequence

Each +.>

Mapping to +.>

In the dictionary space of the dimension, the vector representation +.>

Wherein σ (·) is a sigmoid activation function, W ^GC-Word and b^CG-Word Weight matrix and bias term of the full connection layer respectively, and adopt softmax layer to make

Further conversion to at all->

Probability distribution over individual words:

if the content of a new paper node is expected to be predicted, the same operation is performed on the semantic vector of the predicted new paper node, so that the content word sequence of the new paper node can be obtained.

10. The method for learning the network representation of the paper based on the dual sequence-to-sequence generation according to claim 9, wherein the method for the dual fusion paper node identification part and the paper content generation part in the step 4) is as follows:

the paper node identification part and the paper content generation part are closely related, which model the cross-modal semantic generation relation between the paper node content sequence and the paper node identification sequence from two opposite angles, and in order to realize the fusion of complementary knowledge in two dual parts, the two parts are coupled together by using a linear layer and learning is performed simultaneously by using the sharing of an intermediate hidden layer;

wherein ,W^Dual，1 、b ^Dual，1 、W ^Dual，2 、b ^Dual，2 The weight and bias term of the linear fusion layer; after having undergone the above dual fusion process, at this time

and />

Semantic information from the target modality has been included; thus, will +.>

Into the sequence decoding layers described in step 2.3) and step 3.3), respectively, to improve the accuracy of decoding and generation；

The vector of the final t-th paper node is expressed as:

/>