CN109800504B - Heterogeneous information network embedding method and device - Google Patents

Heterogeneous information network embedding method and device Download PDF

Info

Publication number
CN109800504B
CN109800504B CN201910054117.0A CN201910054117A CN109800504B CN 109800504 B CN109800504 B CN 109800504B CN 201910054117 A CN201910054117 A CN 201910054117A CN 109800504 B CN109800504 B CN 109800504B
Authority
CN
China
Prior art keywords
node
hyperbolic space
vector
embedding
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910054117.0A
Other languages
Chinese (zh)
Other versions
CN109800504A (en
Inventor
石川
王啸
张依丁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN201910054117.0A priority Critical patent/CN109800504B/en
Publication of CN109800504A publication Critical patent/CN109800504A/en
Application granted granted Critical
Publication of CN109800504B publication Critical patent/CN109800504B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The embodiment of the invention provides an embedding method and a device of a heterogeneous information network, wherein the method can comprise the following steps: determining a representation vector of each node to be embedded into a heterogeneous information network; inputting the determined expression vector into a preset hyperbolic space embedding model; and performing exponential mapping in a hyperbolic space on the representation vector based on the hyperbolic space embedding model to obtain an embedding vector of each node in the hyperbolic space. By applying the embodiment of the invention, the hyperbolic space and the heterogeneous information network have the same power law distribution characteristic, so that the structure and semantic information of the heterogeneous information network can be more closely embodied in the hyperbolic space, and the structure and semantic information of the heterogeneous information network can be more completely reserved. Therefore, the embedding accuracy can be improved.

Description

Heterogeneous information network embedding method and device
Technical Field
The present invention relates to the field of network embedding, and in particular, to a method and an apparatus for embedding a heterogeneous information network.
Background
The heterogeneous information network embedding means that nodes and relationships between the nodes in the heterogeneous information network are projected into a metric space, and the relationships between the nodes are expressed as vectors in the metric space. Generally, the structure of the original heterogeneous information network and the integrity of semantic information in the original heterogeneous information network are guaranteed as much as possible, that is, the relationship between nodes in the original heterogeneous information network is kept unchanged as much as possible.
Currently, most schemes embed heterogeneous information networks in euclidean space, i.e., euclidean space. However, the inventor finds in research that, since the node distribution in the heterogeneous information network follows the power law distribution, and the euclidean space does not have the power law distribution characteristic, the structure of the heterogeneous information network and the semantic information therein are not completely preserved, that is, the embedding accuracy is not high enough when the heterogeneous information network is embedded into the euclidean space.
Disclosure of Invention
The embodiment of the invention aims to provide an embedding method and device of a heterogeneous information network, so as to improve the embedding accuracy.
In order to achieve the above object, an embodiment of the present invention discloses an embedding method for a heterogeneous information network, where the method includes:
determining a representation vector of each node to be embedded into a heterogeneous information network;
inputting the determined expression vector into a preset hyperbolic space embedding model;
and performing exponential mapping in a hyperbolic space on the expression vector based on the hyperbolic space embedding model to obtain an embedding vector of each node in the hyperbolic space.
Optionally, the step of determining a representation vector of each node to be embedded in the heterogeneous information network includes:
and randomly giving a representation vector of each node to be embedded into the heterogeneous information network.
Optionally, before the step of performing exponential mapping in a hyperbolic space on the representation vector based on the hyperbolic space embedding model to obtain an embedding vector of each node in the hyperbolic space, the method further includes:
determining an incidence relation between each node to be embedded in a heterogeneous information network;
for each node, determining a node with the distance to the node within a first preset range as a neighbor node based on the determined incidence relation, and acquiring a representation vector of the neighbor node;
calculating the distance between the representation vector of the node and the representation vector of the neighbor node in the hyperbolic space as a first distance;
calculating the similarity between the node and the neighbor node according to the first distance to serve as a first similarity;
calculating the gradient of the first similarity to the expression vector of the node as the first gradient of the node;
the step of performing exponential mapping in a hyperbolic space on the expression vector based on the hyperbolic space embedding model to obtain an embedding vector of each node in the hyperbolic space includes:
and aiming at each node, performing exponential mapping in a hyperbolic space on the expression vector of the node according to the expression vector of the node and the first gradient of the node based on the hyperbolic space embedding model to obtain the embedding vector of the node in the hyperbolic space.
Optionally, the step of determining an association relationship between each node to be embedded in the heterogeneous information network includes:
generating a meta-path of the heterogeneous information network to be embedded according to the type of each node in the heterogeneous information network to be embedded;
for each node to be embedded into the heterogeneous information network, determining a node with the distance to the node within a second preset range according to the meta-path, and using the node as a related node of the node; the second preset range is larger than the first preset range;
a sequence of relationships is generated that includes the node and the determined associated node.
Optionally, before the step of performing exponential mapping in a hyperbolic space on the representation vector based on the hyperbolic space embedding model to obtain an embedding vector of each node in the hyperbolic space, the method further includes:
determining an incidence relation between each node to be embedded in a heterogeneous information network;
for each node, determining a node with a distance to the node within a first preset range as a neighbor node based on the incidence relation between each node, and acquiring a representation vector of the neighbor node;
calculating the distance between the representation vector of the node and the representation vector of the neighbor node in the hyperbolic space as a first distance;
calculating the similarity between the node and the neighbor node according to the first distance to serve as a first similarity;
calculating the gradient of the first similarity to the expression vector of the node as the first gradient of the node;
for each node, determining a preset number of nodes which have no incidence relation with the node as negative sample nodes, and acquiring a representation vector of the negative sample nodes;
calculating the distance between the representation vector of the node and the representation vector of the negative sample node in the hyperbolic space as a second distance;
according to the second distance, calculating the similarity between the node and the negative sample node as a second similarity;
calculating the sum of the first similarity and the second similarity to obtain a similarity sum;
calculating the gradient of the similarity sum to the representation vector of the node as a second gradient of the node;
the step of performing exponential mapping in a hyperbolic space on the expression vector based on the hyperbolic space embedding model to obtain an embedding vector of each node in the hyperbolic space includes:
and aiming at each node, performing exponential mapping in a hyperbolic space on the expression vector of the node according to the expression vector of the node and the second gradient of the node based on the hyperbolic space embedding model to obtain the embedding vector of the node in the hyperbolic space.
Optionally, after the step of calculating the sum of the first similarity and the second similarity to obtain a total similarity, the method further includes:
calculating the gradient of the similarity sum to the representation vector of the neighbor node as the gradient of the neighbor node;
the step of performing exponential mapping in a hyperbolic space on the expression vector based on the hyperbolic space embedding model to obtain an embedding vector of each node in the hyperbolic space includes:
and aiming at each node, performing exponential mapping in the hyperbolic space on the expression vector of the neighbor node of the node according to the expression vector of the neighbor node of the node and the gradient of the neighbor node based on the hyperbolic space embedding model to obtain the embedding vector of the neighbor node of the node in the hyperbolic space.
Optionally, after the step of calculating the sum of the first similarity and the second similarity to obtain a total similarity, the method further includes:
calculating the gradient of the representation vector of the similarity sum to the negative sample node as the gradient of the negative sample node;
the step of performing exponential mapping in a hyperbolic space on the expression vector based on the hyperbolic space embedding model to obtain an embedding vector of each node in the hyperbolic space includes:
and aiming at each node, performing exponential mapping in a hyperbolic space on the expression vector of the negative sample node of the node according to the expression vector of the negative sample node of the node and the gradient of the negative sample node based on the hyperbolic space embedding model to obtain the embedding vector of the negative sample node of the node in the hyperbolic space.
Optionally, the step of performing exponential mapping in a hyperbolic space on the representation vector based on the hyperbolic space embedding model to obtain an embedding vector of each node in the hyperbolic space includes:
performing exponential mapping in a hyperbolic space on the expression vector based on the hyperbolic space embedding model to obtain a mapping result;
judging whether the mapping times reach a preset value or not;
if yes, determining the mapping result as an embedded vector of each node in the hyperbolic space;
and if not, updating the expression vector based on the mapping result, and returning to execute the step of performing exponential mapping in the hyperbolic space on the expression vector based on the hyperbolic space embedded model.
In order to achieve the above object, an embodiment of the present invention further discloses an embedding apparatus for a heterogeneous information network, where the apparatus includes:
the determining module is used for determining the expression vector of each node to be embedded into the heterogeneous information network;
the input module is used for inputting the determined expression vector into a preset hyperbolic space embedding model;
and the mapping module is used for performing exponential mapping in a hyperbolic space on the expression vector based on the hyperbolic space embedding model to obtain an embedding vector of each node in the hyperbolic space.
Optionally, the determining module is specifically configured to randomly give a representation vector of each node to be embedded in the heterogeneous information network.
According to the embedding method and device for the hyperbolic space heterogeneous information network, the expression vector of each node in the heterogeneous information network is input into a preset hyperbolic space embedding model, and the expression vector is subjected to exponential mapping in a hyperbolic space based on the hyperbolic space embedding model, so that the embedding vector of each node in the hyperbolic space is obtained. Because the data with the power law distribution characteristic can be more accurately modeled in the hyperbolic space, the data in the heterogeneous information network has the power law distribution characteristic, and the heterogeneous information network is embedded in the hyperbolic space, the structure and semantic information of the heterogeneous information network can be more closely embodied, so that the structure and semantic information of the heterogeneous information network are more completely reserved. Therefore, the embedding accuracy can be improved.
Of course, it is not necessary for any product or method of practicing the invention to achieve all of the above-described advantages at the same time.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of a heterogeneous information network embedding method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a heterogeneous information network having three node types;
fig. 3 is a schematic diagram illustrating a process of updating a representative vector of a node according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an embedding apparatus of a heterogeneous information network according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to solve the problem of the prior art, embodiments of the present invention provide an embedding method and apparatus for a heterogeneous information network. The method and the device can be applied to various electronic devices, and are not limited specifically. First, an embedding method of a heterogeneous information network according to an embodiment of the present invention is described below.
As shown in fig. 1, fig. 1 is a schematic flowchart of an embedding method of a heterogeneous information network according to an embodiment of the present invention, and the method may include:
s101: determining a representation vector of each node to be embedded into a heterogeneous information network;
the heterogeneous information network to be embedded is the heterogeneous information network which needs to be embedded in the metric space. The heterogeneous information network to be embedded may be composed of various types of nodes and edges. For example, the to-be-embedded can be represented as
Figure BDA0001951840830000061
Where V represents a set of nodes in the network, E represents a set of nodes and edges between nodes, T represents a set of types of nodes or edges,
Figure BDA0001951840830000062
a mapping function representing a node to a node type,
Figure BDA0001951840830000063
TVrepresenting a set of node types, psi representing a mapping of edges to edge typesFunction, ψ (v): E → TE,TERepresents a collection of edge types, | TV|+|TE|>2。
For example, assume that DBLP (Digital Bibliography) is based on&Library Project, academic paper data set) data constructs a heterogeneous information network having three node types, and referring to fig. 2, the node types in the network include: authors (A), papers (P) and meetings (V), the type of edges in the network, i.e. the type of associations between nodes and nodes, such as A-P, which can be a writing and written relation, or P-V, which can be a publishing and published relation, a in the figure1~a4Representing nodes of type author, p1~p3Node, v, of the representation type thesis1~v2The nodes with the type of the conference are represented, and the connecting lines among the nodes represent the association relationship among the nodes.
In one embodiment, S101 may include: and randomly giving a representation vector of each node to be embedded into the heterogeneous information network. Thus, each node corresponds to a representation vector θiWhere i may be a node number, i ═ 1, 2, …, n.
S102: inputting the determined expression vector into a preset hyperbolic space embedding model;
continuing from the above embodiment, the expression vector of each node given at random may be input to a predetermined hyperbolic space embedding model. The preset hyperbolic space Embedding model may be a HHNE (hyperbaric heterogeneous Information Network Embedding) model, and the model may represent nodes in the heterogeneous Information Network as Embedding vectors, and may make the structure of the original heterogeneous Information Network and semantic Information therein tend to be complete.
S103: and performing exponential mapping in a hyperbolic space on the representation vector based on the hyperbolic space embedding model to obtain an embedding vector of each node in the hyperbolic space.
In an embodiment, before S103, the method may further include: determining an incidence relation between each node to be embedded in a heterogeneous information network; for each node, determining a node with a distance to the node within a first preset range as a neighbor node based on the incidence relation between each node, and acquiring a representation vector of the neighbor node; calculating the distance between the representation vector of the node and the representation vector of the neighbor node in the hyperbolic space as a first distance; calculating the similarity between the node and the neighbor node according to the first distance to serve as a first similarity; and calculating the gradient of the first similarity to the expression vector of the node as the first gradient of the node.
In one case, the step of determining the association relationship between each node to be embedded in the heterogeneous information network may include: generating a meta-path to be embedded into the heterogeneous information network according to the type of each node to be embedded into the heterogeneous information network; determining a node with the distance to the node within a second preset range according to the meta-path as a related node of the node for each node to be embedded into the heterogeneous information network; the second preset range is larger than the first preset range; a sequence of relationships is generated that includes the node and the determined associated node.
For example, a meta-path may be a sequence of relationships formed by connecting node types by edge types, and may be represented as
Figure BDA0001951840830000071
P represents a meta path, v1、v2、…vnRepresents a node in the set of nodes V,
Figure BDA0001951840830000072
indicates the type of the node in the node set V,
Figure BDA0001951840830000073
represents an edge e1、e2、…en-1Type (c) of the cell. The meta-path generated according to the heterogeneous information network shown in fig. 2 may be an APA, an APVPA, etc., and one meta-path instance of the meta-path APA may be an APA
Figure BDA0001951840830000074
And determining a node with a distance to the node within a second preset range according to the meta-path, and generating a relationship sequence including the node and the determined associated node as an associated node of the node. For example, referring to FIG. 2, the meta-path APA is selected to be node a1Setting a second preset range as 4 edges as an initial node, and setting a node passing through the random walk according to the original path as a1、p1、a2、p2And a3The formed relation sequence is a1p1a2p2a3. In one case, based on a meta path
Figure BDA0001951840830000081
Calculating the transition probability of the nodes to realize random walk, wherein the transition probability is calculated by a formula I:
Figure BDA0001951840830000082
(formula one), wherein,
Figure BDA0001951840830000083
represents the ith node in the node set V, and the type of the ith node is
Figure BDA0001951840830000084
Figure BDA0001951840830000085
Representing nodes in a set of nodes V
Figure BDA0001951840830000086
The neighbor nodes of (a) are,
Figure BDA0001951840830000087
and (3) representing the node type of the (i + 1) th node, wherein P is a meta-path.
And for each node, determining the node with the distance to the node within a first preset range as a neighbor node based on the incidence relation between the nodes, and acquiring the expression vector of the neighbor node. For example, set the first predetermined range to be 3 edges for node a1Relation sequence a generated based on random walk1p1a2p2a3Determining the node p1、a2And p2Is node a1Of the neighboring node.
And calculating the distance between the representation vector of the node and the representation vector of the neighbor node in the hyperbolic space as a first distance. For example, each node is uniformly labeled v, and the neighbor nodes of each node are uniformly labeled ctThen the representation vector corresponding to each node may be θvThe representation vector corresponding to the neighbor node of each node may be
Figure BDA0001951840830000088
The first distance may be calculated by equation two:
Figure BDA0001951840830000089
wherein, thetavRepresenting the corresponding representation vector of the node v in the heterogeneous information network,
Figure BDA00019518408300000810
neighbor node c representing a nodetThe corresponding vector of the representation is,
Figure BDA00019518408300000811
a representative vector theta representing a node vvWith its neighbor node ctIs represented by a vector
Figure BDA00019518408300000812
A first distance in a hyperbolic space.
According to the first distance, calculating the similarity between the node and the neighbor node as a first similarity, wherein the first similarity can be obtained by calculating according to a formula three:
Figure BDA00019518408300000813
wherein Θ represents a set of representation vectors of nodes,
Figure BDA00019518408300000814
after the first similarity is obtained through the formula III, the first similarity is utilized
Figure BDA0001951840830000091
The first similarity is maximized. Specifically, L (Θ) can be calculated for node θvRepresents the gradient of the vector as a node thetavA first gradient of (a), a maximization of a first similarity is achieved, wherein
Figure BDA0001951840830000092
I.e. first calculating the gradient in euclidean space
Figure BDA0001951840830000093
Namely, it is
Figure BDA0001951840830000094
Then according to
Figure BDA0001951840830000095
Obtaining Riemann gradients
Figure BDA0001951840830000096
As node thetavA first gradient of (a). Formula V
Figure BDA0001951840830000097
The coefficients of the gradient transformation are represented,
Figure BDA0001951840830000098
after the first gradient of each node is obtained, for each node, performing exponential mapping in a hyperbolic space on the representation vector of the node according to the representation vector of the node and the first gradient of the node, and obtaining an embedded vector of the node in the hyperbolic space. The exponential mapping may be implemented by the following equation six:
Figure BDA0001951840830000099
Figure BDA00019518408300000910
wherein,
Figure BDA00019518408300000911
eta represents the coefficient of the exponential mapping,
Figure BDA00019518408300000912
obtaining a representative vector theta of the nodesvResults of performing exponential mapping
Figure BDA00019518408300000913
After that, use
Figure BDA00019518408300000914
Updating a representation vector θ of a nodevAnd obtaining an embedded vector of the node v in the hyperbolic space. Since an embedded vector in the hyperbolic space is obtained for each node, the embedded vector of each node in the hyperbolic space in the heterogeneous information network is obtained substantially.
In another embodiment, before S103, a first similarity is obtained according to the above embodiment; for each node, determining a preset number of nodes which have no incidence relation with the node as negative sample nodes, and obtaining a representation vector of the negative sample nodes; calculating the distance between the representation vector of the node and the representation vector of the negative sample node in the hyperbolic space as a second distance; according to the second distance, calculating the similarity between the node and the negative sample node as a second similarity; calculating the sum of the first similarity and the second similarity to obtain a similarity sum; the gradient of the similarity sum to the representation vector of the node is calculated as the second gradient of the node.
Specifically, the negative sample node that is determined to have no association with the node may be a negative sample node that is determined to have no edge with the node.
In one case, the expression vector of the negative sample node may be obtained M times, and M second similarities may be obtained. If the negative sample node of each node is marked as nmThen the representation vector of the negative sample node may be
Figure BDA00019518408300001011
The second distance may be
Figure BDA0001951840830000101
Wherein n ismFor the negative sample node selected at the mth time,
Figure BDA00019518408300001012
is a negative sample node nmThe corresponding represents a vector. The second similarity may be
Figure BDA0001951840830000102
And then summing the sum of the M second similarity degrees and the first similarity degree to obtain a similarity sum, maximizing the similarity sum through a formula seven so as to maximize the similarity between the node in the network and the neighbor node thereof and minimize the similarity between the node in the network and the negative sample node thereof:
Figure BDA0001951840830000103
wherein,
Figure BDA00019518408300001013
the sum of the first similarity and the second similarity is represented, M represents the total times of selecting the negative sample nodes, and P (n) represents the preset distribution.
By formula seven, maximizing the sum of similarity can be achieved by calculating the gradient of the sum of similarity to the representation vector of the node.
Specifically, first pass through
Figure BDA0001951840830000104
Figure BDA0001951840830000105
Calculating gradients in Euclidean space
Figure BDA0001951840830000106
Namely, it is
Figure BDA0001951840830000107
Wherein
Figure BDA0001951840830000108
Figure BDA00019518408300001014
When m is 0, u0=v。Iv[um]Is an indicator function for judging umWhether v, u is equal tomMay be nmOr v, if umWhen v is equal to Iv[um]If u is 1m=nmThen, Iv[um]When the gradient is equal to 0, then the Riemann gradient is obtained according to the formula five and the formula eight
Figure BDA00019518408300001010
As the second gradient of node v.
And after the second gradient of each node is obtained, for each node, performing exponential mapping in a hyperbolic space on the representation vector of the node according to the representation vector of the node and the second gradient of the node to obtain an embedded vector of the node in the hyperbolic space. The exponential mapping may be implemented by the following equation six:
Figure BDA0001951840830000111
Figure BDA0001951840830000112
wherein,
Figure BDA0001951840830000113
eta represents the coefficient of the exponential mapping,
Figure BDA0001951840830000114
obtaining a representative vector theta of the nodesvResults of performing exponential mapping
Figure BDA00019518408300001113
After that, use
Figure BDA00019518408300001112
Updating a representation vector θ of a nodevAnd obtaining an embedded vector of the node v in the hyperbolic space. Because an embedded vector in the hyperbolic space is obtained for each node, embedded vectors of all nodes in the hyperbolic space in the heterogeneous information network can be obtained.
In one implementation mode, the sum of the first similarity and the second similarity is calculated, and after the sum of the similarities is obtained, the gradient of the sum of the similarities to the expression vector of the neighbor node is calculated and used as the gradient of the neighbor node; and aiming at each node, performing exponential mapping in a hyperbolic space on the expression vector of the neighbor node of the node according to the expression vector of the neighbor node of the node and the gradient of the neighbor node based on the hyperbolic space embedding model to obtain the embedding vector of the neighbor node of the node in the hyperbolic space.
Specifically, the following formula is used:
Figure BDA0001951840830000115
calculating the gradient of the similarity sum to the expression vector of the neighbor node in Euclidean space
Figure BDA0001951840830000116
Namely, it is
Figure BDA0001951840830000117
Wherein,
Figure BDA00019518408300001114
Figure BDA0001951840830000119
when m is 0, u0=v。Iv[um]Is an indicator function for judging umWhether v, u is equal tomIs nmOr v, if umWhen v is equal to Iv[um]If u is 1m=nmThen, Iv[um]Then the Riemann gradient is obtained according to formula five and formula nine as 0
Figure BDA00019518408300001110
The gradient of the neighbor nodes as node v. According to the formula six, the method leads the formula six to
Figure BDA00019518408300001111
And aiming at each node, performing exponential mapping in a hyperbolic space on the expression vector of the neighbor node of the node to obtain an embedded vector of the neighbor node of the node in the hyperbolic space, wherein the embedded vector is used as the embedded vector of the node in the hyperbolic space in the heterogeneous information network. Because the neighbor node of each node obtains an embedded vector in the hyperbolic space, the embedded vector of each node in the hyperbolic space in the heterogeneous information network is obtained substantially.
In another embodiment, the sum of the first similarity and the second similarity is calculated, and after the sum of the similarities is obtained, the gradient of the sum of the similarities to the expression vector of the negative sample node is calculated and used as the gradient of the negative sample node; and aiming at each node, performing exponential mapping in a hyperbolic space on the expression vector of the negative sample node of the node according to the expression vector of the negative sample node of the node and the gradient of the negative sample node based on the hyperbolic space embedding model to obtain the embedding vector of the negative sample node of the node in the hyperbolic space.
Wherein the phase can be calculated by the formula eightGradient of similarity sum to expression vector of negative sample node in Euclidean space
Figure BDA0001951840830000121
Namely, it is
Figure BDA0001951840830000122
Then obtaining Riemann gradient according to the fifth formula and the eighth formula
Figure BDA0001951840830000123
The gradient of the node is taken as the negative sample of the node. According to the formula six, the method leads the formula six to
Figure BDA0001951840830000124
And aiming at each node, performing exponential mapping in a hyperbolic space on the expression vector of the negative sample node of the node to obtain an embedded vector of the negative sample node of the node in the hyperbolic space, wherein the embedded vector is used as the embedded vector of the node in the hyperbolic space in the heterogeneous information network. Since the negative sample node for each node obtains an embedded vector in the hyperbolic space, the embedded vector of each node in the hyperbolic space in the heterogeneous information network is obtained substantially.
In one embodiment, S103 may include: performing exponential mapping in a hyperbolic space on the representation vector based on the hyperbolic space embedding model to obtain a mapping result; judging whether the mapping times reach a preset value or not; if yes, determining the mapping result as an embedded vector of each node in the hyperbolic space; and if not, updating the representation vector based on the mapping result, and returning to execute the step of performing exponential mapping in the hyperbolic space on the representation vector based on the hyperbolic space embedded model.
Specifically, as shown in fig. 3, fig. 3 is a schematic diagram of an update flow of a representation vector of a node. After S102, S301 may be performed: performing exponential mapping in a hyperbolic space on the representation vector based on the hyperbolic space embedding model to obtain a mapping result; then, S302 is executed: judging whether the mapping times reach a preset value or not; if the mapping times reach the preset value, executing S304: determining a mapping result as an embedded vector of each node in a hyperbolic space; if the mapping times do not reach the preset value, executing S303: updating the representation vector based on the mapping result, and returning to execute the step of performing exponential mapping in the hyperbolic space on the representation vector based on the hyperbolic space embedded model. The preset value of the mapping times can be 3, or a value greater than 3, so that the nodes in the heterogeneous information network are embedded more accurately, and certainly, the preset value can also be 1 or 2.
By applying the embodiment shown in fig. 1, the expression vector of each node in the heterogeneous information network is input into a preset hyperbolic space embedding model, and the expression vector is subjected to exponential mapping in a hyperbolic space based on the hyperbolic space embedding model, so as to obtain the embedding vector of each node in the hyperbolic space. Because the data with the power law distribution characteristic can be more accurately modeled in the hyperbolic space, the data in the heterogeneous information network has the power law distribution characteristic, and the heterogeneous information network is embedded in the hyperbolic space, the structure and semantic information of the heterogeneous information network can be more closely embodied, so that the structure and semantic information of the heterogeneous information network are more completely reserved. Therefore, the embedding accuracy can be improved.
Taking DBLP network and movivelens (movie information data set) as examples, the hyperbolic space embedding model HHNE in the scheme of the present invention is compared with a model in a heterogeneous information network embedding scheme in the prior art, such as metapath2 vec. Specifically, after the two schemes are embedded, AUC (Area enclosed by coordinate axes Under the ROC Curve) values of the network reconstruction are obtained to evaluate the two schemes, and the results are shown in table 1:
table 1: AUC values for network reconstruction
Figure BDA0001951840830000131
Figure BDA0001951840830000141
As can be seen from Table 1, under the condition that the spatial dimension is smaller, for example, the spatial dimension is 2, and the AUC value of HHNE is greater than that of metapath2vec, that is, the original network structure can be more effectively maintained by using HHNE, so that the network can be better reconstructed, especially the reconstruction of P-V and M-D edges. Therefore, the nodes in the heterogeneous information network can be embedded more accurately by using the HHNE, and the structure of the network is better maintained.
In one case, the superiority of the HHNE may also be indicated by link prediction. Link prediction is the inference of unknown links in a link given an observed link structure, thereby verifying the generalization performance of network embedding methods. And randomly deleting 20% of each type of edge in the network, simultaneously ensuring that the rest network structures are still connected to form a rest network, randomly selecting a preset number of node pairs without edge connection in the rest network as negative samples, embedding the nodes in the rest network and the network structures, and evaluating a link prediction result by using an AUC value, wherein the result is shown in Table 2.
Table 2: AUC value of link prediction
Figure BDA0001951840830000142
Figure BDA0001951840830000151
As can be seen from Table 2, the AUC values of HHNE in all dimensions are greater than that of metapath2vec, indicating that the generalization ability of HHNE is better than that of metapath2 vec. Therefore, different types of nodes in a heterogeneous information network can be accurately embedded by using the HHNE, and the structure of the network can be better maintained.
Furthermore, corresponding to the above method embodiment, the present invention also provides an embedding apparatus for heterogeneous information network, as shown in fig. 4, the apparatus may include:
a determining module 401, configured to determine a representation vector of each node to be embedded in the heterogeneous information network;
an input module 402, configured to input the determined representation vector into a preset hyperbolic space embedding model;
a mapping module 403, configured to perform exponential mapping in a hyperbolic space on the representation vector based on the hyperbolic space embedding model, to obtain an embedded vector of each node in the hyperbolic space.
As an embodiment, the determining module may be specifically configured to randomly give a representation vector of each node to be embedded in the heterogeneous information network.
By applying the embodiment shown in fig. 4, the expression vector of each node in the heterogeneous information network is input into a preset hyperbolic space embedding model, and the expression vector is subjected to exponential mapping in a hyperbolic space based on the hyperbolic space embedding model, so as to obtain the embedding vector of each node in the hyperbolic space. Because the data with the power law distribution characteristic can be more accurately modeled in the hyperbolic space, the data in the heterogeneous information network has the power law distribution characteristic, and the heterogeneous information network is embedded in the hyperbolic space, the structure and semantic information of the heterogeneous information network can be more closely embodied, so that the structure and semantic information of the heterogeneous information network are more completely reserved. Therefore, the embedding accuracy can be improved.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (9)

1. A method for embedding a heterogeneous information network, the method comprising:
determining a representation vector of each node to be embedded into a heterogeneous information network;
inputting the determined expression vector into a preset hyperbolic space embedding model;
performing exponential mapping in a hyperbolic space on the expression vector based on the hyperbolic space embedding model to obtain an embedding vector of each node in the hyperbolic space; performing exponential mapping in a hyperbolic space on the expression vector based on the hyperbolic space embedded model to obtain a mapping result; judging whether the mapping times reach a preset value or not; if yes, determining the mapping result as an embedded vector of each node in the hyperbolic space; and if not, updating the expression vector based on the mapping result, and returning to execute the step of performing exponential mapping in the hyperbolic space on the expression vector based on the hyperbolic space embedded model.
2. The method of claim 1, wherein the step of determining a representative vector for each node to be embedded in the heterogeneous information network comprises:
and randomly giving a representation vector of each node to be embedded into the heterogeneous information network.
3. The method according to claim 1, wherein before the step of performing exponential mapping in hyperbolic space on the representation vector based on the hyperbolic space embedding model to obtain the embedding vector of each node in hyperbolic space, the method further comprises:
determining an incidence relation between each node to be embedded in a heterogeneous information network;
for each node, determining a node with the distance to the node within a first preset range as a neighbor node based on the determined incidence relation, and acquiring a representation vector of the neighbor node;
calculating the distance between the representation vector of the node and the representation vector of the neighbor node in the hyperbolic space as a first distance;
calculating the similarity between the node and the neighbor node according to the first distance to serve as a first similarity;
calculating the gradient of the first similarity to the expression vector of the node as the first gradient of the node;
the step of performing exponential mapping in a hyperbolic space on the expression vector based on the hyperbolic space embedding model to obtain an embedding vector of each node in the hyperbolic space includes:
and aiming at each node, performing exponential mapping in a hyperbolic space on the expression vector of the node according to the expression vector of the node and the first gradient of the node based on the hyperbolic space embedding model to obtain the embedding vector of the node in the hyperbolic space.
4. The method of claim 3, wherein the step of determining the association relationship between each node to be embedded in the heterogeneous information network comprises:
generating a meta-path of the heterogeneous information network to be embedded according to the type of each node in the heterogeneous information network to be embedded;
for each node to be embedded into the heterogeneous information network, determining a node with the distance to the node within a second preset range according to the meta-path, and using the node as a related node of the node; the second preset range is larger than the first preset range;
a sequence of relationships is generated that includes the node and the determined associated node.
5. The method according to claim 1, wherein before the step of performing exponential mapping in hyperbolic space on the representation vector based on the hyperbolic space embedding model to obtain the embedding vector of each node in hyperbolic space, the method further comprises:
determining an incidence relation between each node to be embedded in a heterogeneous information network;
for each node, determining a node with a distance to the node within a first preset range as a neighbor node based on the incidence relation between each node, and acquiring a representation vector of the neighbor node;
calculating the distance between the representation vector of the node and the representation vector of the neighbor node in the hyperbolic space as a first distance;
calculating the similarity between the node and the neighbor node according to the first distance to serve as a first similarity;
calculating the gradient of the first similarity to the expression vector of the node as the first gradient of the node;
for each node, determining a preset number of nodes which have no incidence relation with the node as negative sample nodes, and acquiring a representation vector of the negative sample nodes;
calculating the distance between the representation vector of the node and the representation vector of the negative sample node in the hyperbolic space as a second distance;
according to the second distance, calculating the similarity between the node and the negative sample node as a second similarity;
calculating the sum of the first similarity and the second similarity to obtain a similarity sum;
calculating the gradient of the similarity sum to the representation vector of the node as a second gradient of the node;
the step of performing exponential mapping in a hyperbolic space on the expression vector based on the hyperbolic space embedding model to obtain an embedding vector of each node in the hyperbolic space includes:
and aiming at each node, performing exponential mapping in a hyperbolic space on the expression vector of the node according to the expression vector of the node and the second gradient of the node based on the hyperbolic space embedding model to obtain the embedding vector of the node in the hyperbolic space.
6. The method of claim 5, further comprising, after the step of calculating the sum of the first similarity and the second similarity to obtain a sum of similarities:
calculating the gradient of the similarity sum to the representation vector of the neighbor node as the gradient of the neighbor node;
the step of performing exponential mapping in a hyperbolic space on the expression vector based on the hyperbolic space embedding model to obtain an embedding vector of each node in the hyperbolic space includes:
and aiming at each node, performing exponential mapping in the hyperbolic space on the expression vector of the neighbor node of the node according to the expression vector of the neighbor node of the node and the gradient of the neighbor node based on the hyperbolic space embedding model to obtain the embedding vector of the neighbor node of the node in the hyperbolic space.
7. The method of claim 5, further comprising, after the step of calculating the sum of the first similarity and the second similarity to obtain a sum of similarities:
calculating the gradient of the representation vector of the similarity sum to the negative sample node as the gradient of the negative sample node;
the step of performing exponential mapping in a hyperbolic space on the expression vector based on the hyperbolic space embedding model to obtain an embedding vector of each node in the hyperbolic space includes:
and aiming at each node, performing exponential mapping in a hyperbolic space on the expression vector of the negative sample node of the node according to the expression vector of the negative sample node of the node and the gradient of the negative sample node based on the hyperbolic space embedding model to obtain the embedding vector of the negative sample node of the node in the hyperbolic space.
8. An apparatus for embedding a heterogeneous information network, the apparatus comprising:
the determining module is used for determining the expression vector of each node to be embedded into the heterogeneous information network;
the input module is used for inputting the determined expression vector into a preset hyperbolic space embedding model;
the mapping module is used for performing exponential mapping in a hyperbolic space on the expression vector based on the hyperbolic space embedding model to obtain an embedding vector of each node in the hyperbolic space; performing exponential mapping in a hyperbolic space on the expression vector based on the hyperbolic space embedded model to obtain a mapping result; judging whether the mapping times reach a preset value or not; if yes, determining the mapping result as an embedded vector of each node in the hyperbolic space; and if not, updating the expression vector based on the mapping result, and returning to the step of executing the mapping module.
9. The apparatus according to claim 8, wherein the determining module is specifically configured to randomly assign a representation vector for each node to be embedded in the heterogeneous information network.
CN201910054117.0A 2019-01-21 2019-01-21 Heterogeneous information network embedding method and device Active CN109800504B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910054117.0A CN109800504B (en) 2019-01-21 2019-01-21 Heterogeneous information network embedding method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910054117.0A CN109800504B (en) 2019-01-21 2019-01-21 Heterogeneous information network embedding method and device

Publications (2)

Publication Number Publication Date
CN109800504A CN109800504A (en) 2019-05-24
CN109800504B true CN109800504B (en) 2020-10-27

Family

ID=66559787

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910054117.0A Active CN109800504B (en) 2019-01-21 2019-01-21 Heterogeneous information network embedding method and device

Country Status (1)

Country Link
CN (1) CN109800504B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111209611A (en) * 2020-01-08 2020-05-29 北京师范大学 Hyperbolic geometry-based directed network space embedding method
CN111368074B (en) * 2020-02-24 2022-06-10 西安电子科技大学 Link prediction method based on network structure and text information
CN111193540B (en) * 2020-04-08 2020-09-01 北京大学深圳研究生院 Hyperbolic geometry-based sky and land information network unified routing method
CN112887143B (en) * 2021-01-27 2023-03-24 武汉理工大学 Bionic control method based on meta-search
CN113111302B (en) * 2021-04-21 2023-05-12 上海电力大学 Information extraction method based on non-European space
CN116151354B (en) * 2023-04-10 2023-07-18 之江实验室 Learning method and device of network node, electronic device and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559318A (en) * 2013-11-21 2014-02-05 北京邮电大学 Method for sequencing objects included in heterogeneous information network
CN106777339A (en) * 2017-01-13 2017-05-31 深圳市唯特视科技有限公司 A kind of method that author is recognized based on heterogeneous network incorporation model

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105913125B (en) * 2016-04-12 2018-05-25 北京邮电大学 Heterogeneous information network element path determines, link prediction method and device
CN107944629B (en) * 2017-11-30 2020-08-07 北京邮电大学 Recommendation method and device based on heterogeneous information network representation
CN109002488B (en) * 2018-06-26 2020-10-02 北京邮电大学 Recommendation model training method and device based on meta-path context
CN108984532A (en) * 2018-07-27 2018-12-11 福州大学 Aspect abstracting method based on level insertion

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559318A (en) * 2013-11-21 2014-02-05 北京邮电大学 Method for sequencing objects included in heterogeneous information network
CN106777339A (en) * 2017-01-13 2017-05-31 深圳市唯特视科技有限公司 A kind of method that author is recognized based on heterogeneous network incorporation model

Also Published As

Publication number Publication date
CN109800504A (en) 2019-05-24

Similar Documents

Publication Publication Date Title
CN109800504B (en) Heterogeneous information network embedding method and device
Webster et al. Detecting overfitting of deep generative networks via latent recovery
CN109816032B (en) Unbiased mapping zero sample classification method and device based on generative countermeasure network
Chen et al. Signal recovery on graphs: Variation minimization
CN111008447B (en) Link prediction method based on graph embedding method
Fan et al. Querying big graphs within bounded resources
EP4053718A1 (en) Watermark information embedding method and apparatus
WO2017181866A1 (en) Making graph pattern queries bounded in big graphs
US11663485B2 (en) Classification of patterns in an electronic circuit layout using machine learning based encoding
WO2023274059A1 (en) Method for training alternating sequence generation model, and method for extracting graph from text
Huai et al. Zerobn: Learning compact neural networks for latency-critical edge systems
CN115238582A (en) Reliability evaluation method, system, equipment and medium for knowledge graph triples
Mo et al. Network simplification and K-terminal reliability evaluation of sensor-cloud systems
Xu et al. Learning simple thresholded features with sparse support recovery
CN112529057A (en) Graph similarity calculation method and device based on graph convolution network
CN116978450A (en) Protein data processing method, device, electronic equipment and storage medium
CN116467466A (en) Knowledge graph-based code recommendation method, device, equipment and medium
Ma et al. Fuzzy nodes recognition based on spectral clustering in complex networks
CN107038211B (en) A kind of paper impact factor appraisal procedure based on quantum migration
CN113159976B (en) Identification method for important users of microblog network
Yang et al. Large-scale metagenomic sequence clustering on map-reduce clusters
Cyrus et al. Meta-interpretive Learning from Fractal Images
CN114155410A (en) Graph pooling, classification model training and reconstruction model training method and device
Li et al. Community-aware efficient graph contrastive learning via personalized self-training
Matsushita et al. C-AP: Cell-based Algorithm for Efficient Affinity Propagation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant