CN113268993B

CN113268993B - Mutual information-based non-supervision network representation learning method for attribute heterogeneous information network

Info

Publication number: CN113268993B
Application number: CN202110599831.5A
Authority: CN
Inventors: 陈波冯; 王晓玲; 卢兴见; 张吉
Original assignee: East China Normal University; Zhejiang Lab
Current assignee: East China Normal University; Zhejiang Lab
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2024-05-14
Anticipated expiration: 2041-05-31
Also published as: CN113268993A

Abstract

The invention discloses an attribute heterogeneous information network unsupervised network representation learning method based on mutual information, which comprises the steps of extracting multi-view networks from an original attribute heterogeneous information network according to preset node types and element paths, respectively training an encoder for each single-view network, carrying out the training by adopting the unsupervised learning method based on the mutual information, comprehensively considering mutual information of node representation and global representation of each view, calculating importance scores of each view by using the mutual information after learning to obtain node representation matrixes of each view, and carrying out weighted summation on the node representation matrixes of each view according to the importance scores to obtain a final node representation matrix. The invention comprehensively considers the interaction of different edges among network structure information, node attribute information and heterogeneous information networks and the importance of the nodes in the networks with different view angles, and can effectively improve the accuracy of node representation.

Description

Mutual information-based non-supervision network representation learning method for attribute heterogeneous information network

Technical Field

The invention belongs to the technical field of data networks, and particularly relates to an attribute heterogeneous information network unsupervised network representation learning method based on mutual information.

Background

A graph or network is a general data structure that better stores and expresses entities and their connections relative to other data structures, and is widely used to represent complex structured data, with common networks including social networks, biological networks, financial networks, and the like. In comparison to traditional structured data, there is a more complex relationship between nodes in a network, and network representation learning aims to map each node in the network into a low-dimensional, dense vector form, so that the resulting vector form can have the ability to represent and infer in vector space, which retains some characteristics of the original network, such as the representation of interconnected nodes being more similar in vector space. After the node representation vectors are obtained, they are then used for network analysis such as node classification, link prediction, community detection, or graph mining related tasks. It can be seen that the network representation learning serves as a bridge connecting the subsequent network analysis tasks and the raw network data, which is of great significance. Most of the existing network embedding methods are concentrated on isomorphic (Homogeneous) information networks, and in the modeling of the isomorphic information networks, only partial information in an actual interactive system is usually extracted, or the isomerism between objects is not distinguished, so that information loss is caused. Therefore, more and more researchers begin to pay attention to learning of the heterogeneous information network, and compared with isomorphic information, the heterogeneous information network comprises multiple types of node edges or multiple types of edges, so that accurate abstraction of real life scenes is realized.

For example, in a social network, instead of describing friends relationship from person to person with edges, colleagues or colleagues may be described with other types of edges. Therefore, compared with the isomorphic information network, the heterogeneous information network has the coexistence of multiple types of objects and relations and contains rich structural and semantic information.

Existing heterogeneous information network representation learning methods are divided into two categories: random walk-based algorithms and graph neural network-based algorithms. The random walk-based algorithm generates a plurality of paths in a heterogeneous information network through random walk guided by meta-paths, and then learns through a neural network so that nodes of node pairs with high probability of occurrence together in the plurality of paths have more similar vector representations, and the main defect of the method is that attribute information in the network is not utilized. The algorithm based on the graph neural network firstly utilizes a meta-path to extract nodes with semantic relations, then utilizes the graph neural network to learn vector representation of the nodes, and finally constructs the training of the supervision loss module guide model, and the main defect of the method is that the training of the model is guided by using marked samples. Thus, the challenge of attribute heterogeneous information network representation learning is to consider not only the attribute information of nodes, but also the cross-correlation information between edges of different relationships, and further, how to guide model training without manual annotation information.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, and provides an unsupervised network representation learning method for an attribute heterogeneous information network based on mutual information, which utilizes mutual information maximization, comprehensively considers information in the network by an unsupervised method, so that the learned node representation can capture complex semantic information in the attribute heterogeneous information network, and high-quality node vector representation is obtained.

In order to achieve the above object, the invention provides an unsupervised network representation learning method for a mutual information attribute heterogeneous information network, comprising the following steps:

S1: extracting a multi-view network G= { V, A ⁽¹⁾,A⁽²⁾,…A^(M), X } containing M views from an original attribute heterogeneous information network according to a preset node type and M element paths, wherein V= { V ₁,v₂,…,v_N } represents a node set, V _i represents an ith node, i=1, 2, …, N and N represent the number of nodes; a ^(m) represents an N-order topology matrix of the corresponding view network extracted according to the mth element path, m=1, 2, …, M, and if there is one edge between the node v _i and the node v _j in the mth element path, the corresponding element in the topology matrix a ^(m) Otherwise/>J=1, 2, …, N; x represents a node characteristic matrix with the size of N multiplied by K, and the ith row is a K-dimensional characteristic vector of the node v _i;

s2: after obtaining the multi-view network g= { V, a ⁽¹⁾,A⁽²⁾,…A^(M), X }, training an encoder for each single-view network, respectively, the specific method is as follows:

Splitting the multi-view network G= { V, A ⁽¹⁾,A⁽²⁾,…A^(M), X } into M single-view networks G ^m＝{V,A^(m), X } and configuring an encoder Q ^m for each single-view network G ^m＝{V,A^(m), X } with the structure set according to the requirement, wherein the input of the encoder Q ^m is a topological structure matrix A ^(m) and a node characteristic matrix X of the view, the output is a node representation matrix H ^(m) of the view, and the ith row is a node representation vector of a node V _i Training each encoder Q ^m by using an unsupervised learning method based on mutual information through a maximized loss function L ^m to obtain a node representation matrix H ^(m); the calculation formula of the loss function L ^m is as follows:

Wherein D () represents a scoring function, s ^(m) represents a global representation vector of the single view network G ^m, the expression of which is Readout () represents a read function, η represents an activate function; /(I)Representing the node representation matrix/>, which is obtained by randomly scrambling the node characteristic matrix X and inputting the topology matrix A ^(m) into the encoder Q ^m The i-th row node in (a) represents a vector;

S3: and calculating mutual information MI ^(m) of each view network and other view networks in the attribute heterogeneous information network, wherein the calculation formula is as follows:

Wherein, The node representing the m' th view represents the node representing vector of node v _i in matrix H ^(m′); calculate an importance score a ^(m) for the mth view:

aggregation by means of weighted average produces the final representation vector matrix H of nodes:

The i-th row vector in the final representation vector matrix H is the final node representation vector of the node v _i.

The invention discloses a mutual information-based attribute heterogeneous information network unsupervised network representation learning method, which is characterized in that a multi-view network is extracted from an original attribute heterogeneous information network according to a preset node type and a meta path, an encoder is trained for each single-view network respectively, the unsupervised learning method based on mutual information is adopted during training, the mutual information of each view node representation and global representation is comprehensively considered by a loss function, after a node representation matrix of each view is obtained through learning, the importance score of each view is calculated by using the mutual information, and the node representation matrix of each view is weighted and summed according to the importance score to obtain a final node representation matrix.

The invention has the following beneficial effects:

1) The invention comprehensively considers the interaction of different edges among network structure information, node attribute information and heterogeneous information networks and the importance of the nodes in the networks with different view angles, and can effectively improve the accuracy of node representation;

2) The invention can learn the global node representation of the attribute heterogeneous information network consistency, and the obtained node representation can be applied to various graph analysis tasks;

3) When the encoder is trained, a cross-view mutual information loss function can be adopted to guide model training, and the strategy not only captures the similarity relationship and the interaction association relationship of the nodes in the networks with different view angles, but also eliminates the information redundancy of the nodes in the different view angles;

4) The invention is suitable for the graph with rich node attributes, such as a film data network, and the nodes (films) in the graph have rich text representations which represent the unique information of the films.

Drawings

FIG. 1 is a flow chart of an embodiment of an unsupervised network representation learning method of the attribute heterogeneous information network of the present invention.

Detailed Description

The following description of the embodiments of the invention is presented in conjunction with the accompanying drawings to provide a better understanding of the invention to those skilled in the art. It is to be expressly noted that in the description below, detailed descriptions of known functions and designs are omitted here as perhaps obscuring the present invention.

Examples

FIG. 1 is a flow chart of an embodiment of an unsupervised network representation learning method of the attribute heterogeneous information network of the present invention. As shown in fig. 1, the method for learning the unsupervised network representation of the attribute heterogeneous information network comprises the following specific steps:

s101: extracting a multi-view network based on a meta-path:

A meta-path is a specific path connecting two entities. In this embodiment, the attribute heterogeneous information network is a movie data network, and then the element path of "actor- > -movie- > -director- > -movie- > -actor" can connect two actors, which can be regarded as a way of mining potential relationships between actors. Therefore, the network structure of the heterogeneous information network can be fully and intuitively utilized by utilizing the meta-path to obtain the side information with rich semantics, thereby forming the multi-view network.

In the invention, a multi-view network G= { V, A ⁽¹⁾,A⁽²⁾,…A^(M), X } containing M views is extracted from an original attribute heterogeneous information network according to a preset node type and M element paths, wherein V= { V ₁,v₂,…,v_N } represents a node set, V _i represents an ith node, i=1, 2, …, N, N represents the number of nodes; a ^(m) represents an N-order topology matrix of the corresponding view network extracted according to the mth element path, m=1, 2, …, M, and if there is one edge between the node v _i and the node v _j in the mth element path, the corresponding element in the topology matrix a ^(m) Otherwise/>J=1, 2, …, N; x represents a node characteristic matrix with the size of N multiplied by K, and the ith row is the K-dimensional characteristic vector of the node v _i.

Taking a movie data network as an example, a movie node is selected, and 2 element paths, namely a movie- > director- > movie "or a movie- > actor- > movie", are set, so as to extract a multi-view network containing 2 views.

S102: encoder to build and train single view network:

After obtaining the multiview network g= { V, a ⁽¹⁾,A⁽²⁾,…A^(M), X }, one encoder is trained for each single view network. Each encoder is configured to express a meta-path such that unique semantic information for the view is derived from the attribute information and topology information of the single view network. The specific method comprises the following steps:

Splitting the multi-view network G= { V, A ⁽¹⁾,A⁽²⁾,…A^(M), X } into M single-view networks G ^m＝{V,A^(m), X } and configuring an encoder Q ^m for each single-view network G ^m＝{V,A^(m), X } with the structure set according to the requirement, wherein the input of the encoder Q ^m is a topological structure matrix A ^(m) and a node characteristic matrix X of the view, the output is a node representation matrix H ^(m) of the view, and the ith row is a node representation vector of a node V _i Each encoder Q ^m is trained by using an unsupervised learning method based on mutual information through a maximized loss function L ^m, and a node representation matrix H ^(m) is obtained.

The loss function L ^m in the present invention uses mutual information loss. Mutual information (Mutual Information) is a useful information measure in information theory, which can be seen as the amount of information contained in one random variable about another random variable, or as the uncertainty that one random variable is decreasing due to the knowledge of another random variable. The output may be made to contain more information about the input and may be more concentrated in patterns that occur more frequently in the input, reducing redundancy of the output. The system can be seen as a channel connecting the input and output, while mutual information represents the amount of information transferred per symbol averaged over the channel, maximizing mutual information is equivalent to transferring more information with fewer symbols; in terms of vector representation, more information is expressed by using a smaller embedding space, so that redundancy of the embedding space is small.

In order to enable an encoder of a single view network to learn global structure information of a data graph, a calculation formula of a loss function L ^m adopted in training of the encoder Q ^m is as follows:

Where D () represents a scoring function, a bilinear function is employed in this embodiment. s ^(m) represents the global representation vector of the single view network G ^m, which is expressed as Readout () represents a read function, η represents an activate function, and a sigmoid function is typically used. /(I)Representing the node representation matrix/>, which is obtained by randomly scrambling the node characteristic matrix X and inputting the topology matrix A ^(m) into the encoder Q ^m The i-th row node in (a) represents a vector.

The encoder in this embodiment employs a layer of a convolutional neural network layer in a convolutional neural network (Graph Convolutional Network, GCN). The graph convolution neural network aims to popularize convolution into the graph field, and expands the existing deep neural network model for processing data represented in a graph form. The basic idea of the graph convolutional neural network is that the state information of the adjacent user at the last moment is used by three steps of information construction, neighbor aggregation and representation updating through an information propagation mechanism on the graph, the graph neural network updates the vector information of each node based on isomorphism assumption, and the specific principle and updating process of the graph convolutional neural network model can refer to paper "Kipf T N and Welling M.Semi-supervised classification with graph convolutional networks[J].arXiv preprint arXiv:1609.02907,2016"., when a graph convolutional neural network layer is adopted as an encoder, the expression of a node representation matrix H ^(m) is as follows:

where σ represents the activation function, typically set as RELU functions, I _N represents an identity matrix, w represents a preset weight,/>Representing the degree matrix of nodes in the attribute heterogeneous information network, and W ^(m) represents the weight matrix of the graph roll-up neural network layer.

The node representations under different visual angles can be obtained through the supervised learning and are independent from each other. However, since networks of different relationship types in the attribute heterogeneous information network share the same node set V and node feature matrix X, it is desirable that training weight parameters of M different encoders are as similar as possible, so as to capture hidden association information in the multi-layer network. Therefore, in this embodiment, the mutual information may be further used to model the interaction association relationship between the networks of different views, that is, to maximize the mutual information between the node representation at a certain view and the global representation at other views. That is, the multi-view network collaborative training can be further performed by maximizing the Loss function Loss, and the calculation formula of the Loss function Loss is as follows:

Wherein s ^(m′) represents a global representation vector of the single view network G ^m′, which has the expression of

The Loss function L ^m can be used for learning the node embedded vector with the specific view angle by maximizing the mutual information of the node representation of each view angle and the global representation, the Loss function Loss can be used for performing multi-view network collaborative training by maximizing the mutual information between the node representation of the view angle and the global representation of other view angles, and the node representation vector learned by the encoder is guided to capture the hidden association information and the hidden similarity information in the multi-view network and can learn the interaction information between the view angles.

In order to avoid over fitting of the model, the training is performed by adopting an early-stop strategy in the embodiment, namely, when the loss function of the model is not reduced after 100 rounds, the training is stopped.

S103: mutual information-based node representation vector aggregation:

the importance of the same node in the network of different view angles is different, in order to capture the importance difference, after the model optimization is performed in step S102 to obtain the representation vector of the node, the importance of the node of different view angles is obtained through normalized mutual information, so that the weighted fusion obtains the global vector representation.

The mutual information measures the information sharing quantity among different variables, so that the mutual information MI ^(m) between each view angle network and other view angle networks in the attribute heterogeneous information network is calculated firstly, and the calculation formula is as follows:

Wherein, The node representing the m' th view represents the node representing vector of node v _i in matrix H ^(m′).

After obtaining the mutual information of each view network and other views, in order to obtain the importance degree of the different view networks of different nodes in generating the final node expression vector, the invention adopts a normalization method to calculate the importance score a ^(m) of the mth view, and the calculation formula is as follows:

finally, a final representation vector matrix H of the nodes is generated by aggregation in a weighted average mode:

In order to better illustrate the effectiveness of the invention, the invention is experimentally verified by adopting a specific attribute heterogeneous information network.

First, a movie data network IMDB is employed in which nodes include actors, directors, movies, edges include edges between actors and movies, edges between directors and movies, and 2 different perspectives are extracted through the meta paths of movie-actor-movie and movie-director-movie. Wherein the view network extracted from the movie-actor-movie has 66428 sides, the view network extracted from the movie-director-movie has 13788 sides, the number of nodes is 3550, and the attributes of the nodes represent the outline of the movie.

The invention obtains the node expression vector of the film data network, and then performs node classification and node clustering tasks based on the node expression vector. The network representation learning method based on five different models in the prior art is adopted as a comparison method, and the five models are respectively ：GCN(Graph Convolutional Network)、GAE(Graph Auto Encoder)、DGI(Deep Graph Informax)、GMI(Graph Representation Learning via Graphical Mutual Information Maximization)、HAN(Heterogeneous Graph Attention Network).

The node classification task classifies the input nodes according to the node representation vector by using logistic regression analysis, and the evaluation index is judged by using two F1 scores (Macro-F1 and Micro-F1). The node clustering task performs node clustering using K-means and evaluates performance using Accuracy (ACC) and Normalized Mutual Information (NMI). Table 1 is a comparison table of the performance of the present invention and five comparison methods for a movie data network.

TABLE 1

To illustrate the performance of the present invention on other data networks, two paper citation networks are also used in this embodiment: the ACM dataset and DBLP dataset were experimentally verified. Nodes in the paper citation network comprise papers, authors and topics, edges are edges between the papers and the authors, edges between the papers and the topics, and edges between the papers and the papers, and different edges represent different semantic relationships. For an ACM dataset, 2 different perspectives were extracted through the meta-path of paper-author-paper and paper-topic-paper, where the perspective network extracted from paper-author-paper has 29281 edges, paper-topic-paper has 2210761 edges, the number of nodes is 3025, and the attributes of the nodes represent the abstract of the paper. For DBLP datasets, 3 different perspectives were extracted through the meta-paths of paper-author-paper, paper-paper and paper-author-institution-author-paper, where the perspective network proposed by paper-author-paper has 14238 edges, the perspective network extracted by paper-paper has 90145 edges, the paper-topic-paper has 2210761 edges, the paper-author-institution-author-paper has 57137515 edges, the number of nodes is 3025, and the attributes of the nodes represent summaries of the paper.

Table 2 is a comparative table of the performance of the present invention and five comparative methods for the ACM dataset.

Table 2 table 3 is a comparative table of the performance of the present invention and five comparative methods for DBLP datasets.

TABLE 3 Table 3

According to the performance comparison tables of the three data networks, the method achieves excellent results in unsupervised graph representation learning, and the performance of the method is higher than that of the first advanced method in the comparison method by 3 percentage points on average in classification tasks or clustering tasks.

While the foregoing describes illustrative embodiments of the present invention to facilitate an understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, but is to be construed as protected by the accompanying claims insofar as various changes are within the spirit and scope of the present invention as defined and defined by the appended claims.

Claims

1. The non-supervision network representation learning method for the attribute heterogeneous information network based on the mutual information is characterized by comprising the following steps of:

S1: constructing a film data network as an attribute heterogeneous information network, wherein nodes comprise actors, directors and films, edges comprise edges between the actors and the films, edges between the directors and the films, and a multi-view network G= { V, A ⁽¹⁾,A⁽²⁾,…A^(M) and X } containing M views is extracted from the original attribute heterogeneous information network according to preset node types and M element paths, wherein V= { V ₁,v₂,…,v_N } represents a node set, V _i represents an ith node, and i=1, 2, …, N and N represent the number of nodes; a ^(m) represents an N-order topology matrix of the corresponding view network extracted according to the mth element path, m=1, 2, …, M, and if there is one edge between the node v _i and the node v _j in the mth element path, the corresponding element in the topology matrix a ^(m) Otherwise/>X represents a node characteristic matrix with the size of N multiplied by K, and the ith row is a K-dimensional characteristic vector of the node v _i;

Wherein, The node representing the m' th view represents the node representing vector of node v _i in matrix H ^(m′);

Calculate an importance score a ^(m) for the mth view:

2. The method for learning the non-supervised network representation of the attribute heterogeneous information network according to claim 1, wherein the encoder in the step S2 adopts a layer of the graph convolutional neural network layer in the graph convolutional neural network, and the expression of the node representation matrix H ^(m) is as follows:

Wherein sigma represents the activation function, I _N represents an identity matrix, w represents a preset weight,/>Representing the degree matrix of nodes in the attribute heterogeneous information network, and W ^(m) represents the weight matrix of the graph roll-up neural network layer.

3. The method for learning the non-supervised network representation of the attribute heterogeneous information network according to claim 1, wherein in the step S2, after each encoder Q ^m is trained by the non-supervised learning method based on mutual information, multi-view network collaborative training is further performed by maximizing a Loss function Loss, and a calculation formula of the Loss function Loss is as follows: