CN111209408A

CN111209408A - Time-carrying knowledge graph embedding method based on hybrid translation model

Info

Publication number: CN111209408A
Application number: CN201911335182.7A
Authority: CN
Inventors: 王治豪; 李鑫
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2019-12-23
Filing date: 2019-12-23
Publication date: 2020-05-29

Abstract

The invention discloses a time-based knowledge graph embedding method based on a hybrid translation model, which comprises the following steps of: 1) inputting a data set related to the knowledge graph, and performing initialization setting according to the condition of the data set; 2) updating the embedded representation of the knowledge graph by using a hybrid translation model to obtain an embedded representation result; 3) and performing completion operation on the related knowledge graph data set according to the embedded representation result. The method can better solve the problem that the prior algorithm can not process the relation between time and multiple elements simultaneously, and can improve the accuracy of judgment.

Description

Time-carrying knowledge graph embedding method based on hybrid translation model

Technical Field

The invention relates to the field of knowledge graph embedding, in particular to a time-carrying knowledge graph embedding method based on a hybrid translation model.

Background

The knowledge graph is a directed graph, and takes different entities in the real world as nodes and different relations as edges on the directed graph. A set of facts existing in the real world is usually represented in the form of triples (h, r, t), where h represents the head entity, t represents the tail entity, and r represents the relationship between two entities. Although the constructed knowledge-graph contains a large number of facts, it is often the case that the knowledge-graph needs to be complemented. The task of knowledge-graph completion is to predict the most likely missing entities and relationships, the head entity prediction is the most likely head entity under this situation (. Since the knowledge-graph constructed from the real world is large and structurally complex, the symbolization-based approach and the traditional logic-based approach are neither scalable nor suitable for the task of current knowledge-graph completion.

Knowledge graph embedding has become one of the important methods for knowledge graph completion. In knowledge graph representation learning, a knowledge graph embedding method based on a translation model is simple and effective, and has good performance on related prediction tasks. It attempts to learn the low-dimensional embedding of each entity and the relationships in the knowledge-graph in a continuous vector space and evaluate the authenticity of triples using scoring functions defined on the entities and relationship embedding. In the representation learning method, the embedding based on translation considers the simplicity of the model and the accuracy of prediction. The translation-based model is derived from TransE, and when (h, r, t) is a valid triplet, it is expected that h + r ≈ t remains unchanged in the embedding space. TransE is suitable for modeling of 1-1 relationships, but has some drawbacks when dealing with reflexive types and multivariate relationships. To address this issue, related researchers have proposed more translation-based models, including TransH, TransR/CTRANsR, TransD, etc., to further address the problems faced by TransE in order to efficiently model diverse types of multivariate relational knowledge-graphs.

Previous translation-based knowledge-graph embedding models (including TransH, TransR/CTRANsR and TransD) focused on modeling static knowledge-graphs. The triples in the static knowledge-graph should be universally correct. In fact, there are many time-dependent facts in KGS, such as triplets (Einstein, diedlin, Princeton)

A method of networking that models temporal relationships as events that vary over time. It uses RNNs as event coders to model the case of temporal and multivariate relational interactions between entities and uses a so-called neighborhood aggregator to model concurrent interactions at the same time.

Disclosure of Invention

The invention aims to provide a time-carrying knowledge graph embedding method based on a mixed translation model, which achieves the purpose of simultaneously processing time and multivariate relation in a time sequence knowledge graph by mixing two models of TransD and TransH, improves the completion accuracy of the knowledge graph and learns better knowledge graph embedding expression.

The specific technical scheme for realizing the purpose of the invention is as follows:

a time-based knowledge graph embedding method based on a hybrid translation model comprises the following steps:

step 1: inputting a time sequence knowledge graph data set needing to be completed, and carrying out initialization setting according to the size of the data set; wherein the initialization setting specifically is:

step A1: randomly initializing entities and relations in the data set, and displaying the entities and relations in a vector form to obtain initial relation vectors and entity vectors;

step A2: for each relation vector, adding the relation vector into a relation set R, and for each entity vector, adding the relation vector into an entity set E;

step 2: updating the embedded representation of the knowledge graph by using a hybrid translation model to obtain an embedded representation result; wherein, the using of the hybrid translation model to update the embedded representation of the knowledge graph specifically comprises:

step B1: sampling a batch of data with a fixed size of batch from a training set of a data set;

step B2: constructing a negative sample data set, and randomly replacing h, r or t for the triple in the batch to generate an error triple to be added into the batch of data to form training data used by the current batch;

step B3: the correct triplet (h, r, t) is compared with the incorrect triplet [ (h, r ', t), (h ', r, t) or (h, r, t ')]Mapping from the entity space to the relationship space, resulting in the correct triplet (h) in the relationship space_⊥,r,t_⊥) And error triplet [ (h)_⊥,r`,t_⊥)、(h_⊥`,r,t_⊥) Or (h ⊥, r, t)_⊥`)]；

Step B4: projecting all triples in the relation space onto a hyperplane constructed by the corresponding time tau;

step B5: calculating a loss function, and updating the embedded expression of the entity and the relation by adopting a gradient descent algorithm;

step B6: repeating the steps B1-B5 until the obtained result is stable;

in step B2, the method for constructing the negative sample data set is as follows:

h. r, t, h ', r', t 'represent embedded representations of entities and relationships, where h' represents a randomly substituted head entity, r 'represents a randomly substituted relationship, t' represents a randomly substituted tail entity, D⁺Representing a set of positive samples, D_x,τ-representing a negative sample data set;

in step B3, the mapping from the entity space to the relationship space is as follows:

wherein M is_rh、M_rtTo map the matrix, r_pA projection vector representing the relationship is calculated,

an identity matrix of size m × n, h_p ^TProjection transpose vector, h, representing head entity_⊥Representing an embedded representation of the head entity in the relational space after mapping, t_p ^TProjection transpose vector, t, representing the tail entity_⊥Representing an embedded representation of the tail entity in the relationship space after mapping;

in step B4, the projection of all triples in the relationship space onto the hyperplane constructed at the corresponding time τ is as follows:

wherein, ω is_τA hyperplane representing the constructed corresponding time τ;

in step B5, the loss function is:

wherein f is_τ(x)、f_τ(y) a scoring function, x, y representing positive and negative examples, f | | | h_τ+r_τ-t_τ||_L1/L2L1, L2 are regularization terms, γ is the minimum interval between positive and negative samples, T represents the set of all times;

and step 3: performing completion operation on the used knowledge graph data set according to the embedded representation result; the method specifically comprises the following steps:

step C1: for a triple (h, R, t), firstly checking whether h and t belong to E or not, whether R belongs to R or not, if not, then h, t or R is not used as a head-tail entity or relation, the triple does not hold, if so, then executing the next step;

step C2: calculating the scoring function f | | | h according to the established triples_τ+r_τ-t_τ||_L1/L2And sequencing all established triples, wherein the result in the front sequencing is the optimal result and is used for completing.

The invention has the following advantages:

1) the invention can use the traditional translation algorithm in the field to simultaneously process the time and multivariate relation in the knowledge graph.

2) The invention provides a novel negative sample construction method.

3) The invention has better performance on tasks such as link prediction, relation prediction and the like.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings.

The effectiveness of the method is verified by comparison experiments on YAGO11K and Wikidata 12K. All of these data sets consist of a training set, a validation set, and a test stage. Table 1 lists the data statistics for the data set. Each training entry is a triplet (h, r, t) indicating that h and t have a relationship r.

TABLE 1 data set partitioning for Wikidata12K and YAGO11K (k denotes 1000)

The following describes the learning knowledge graph embedding representation of the present invention by taking YAGO11K as an example, and uses the obtained entity vector and relationship vector for link prediction and relationship prediction. As shown in fig. 1, a knowledge graph embedding method based on a hybrid translation model includes the following specific steps:

s1-1 randomly initializing entities and relations in the data set, displaying in a vector form to obtain initial relation vectors and entity vectors, namely randomly generating a representation vector for 10 relations and 10623 entities in the data set;

s1-2, adding each appeared relation vector into a relation set R, and adding each appeared entity vector into an entity set E;

s2.1, sampling a batch of data with fixed size of batch from 16400 training sets in YAGO 11K;

s2.2, constructing a negative sample data set, randomly replacing h, r or t with the triples in the batch to generate an error triplet, and constructing training data used by the current batch, wherein the method for constructing the negative sample is as follows:

h. r, t, h ', r', t 'represent embedded representations of entities and relationships, where h' represents a randomly substituted head entity, r 'represents a randomly substituted relationship, t' represents a randomly substituted tail entity, D⁺Representing a set of positive samples, D_x,τ-representing a set of negative examples. By randomly replacing the head entity, the relation and the tail entity, a new triplet is formed, which does not belong to the positive sample set, and the erroneous triplets are put into the training data.

S2.3, mapping the triplet (h, r, t) from the entity space to the relation space to obtain the triplet (h) of the relation space_⊥,r,t_⊥) The way of mapping from the entity space to the relationship space is:

an identity matrix of size m × n, h_p ^TProjection transpose vector, h, representing head entity_⊥Representing an embedded representation of the head entity in the relational space after mapping, t_p ^TProjection transpose vector, t, representing the tail entity_⊥An embedded representation of the tail entity in the relationship space after the representation mapping.

S2.4 grouping triples (h) in relationship space_⊥,r,t_⊥) Projecting the data to a hyperplane constructed by corresponding time to obtain a triplet (h)_τ,r,t_τ) The way to project the triplets onto the hyperplane constructed in response time is:

wherein, ω is_τRepresenting the constructed hyperplane corresponding to time τ.

S2.5, calculating a loss function, and updating the embedded representation of the entity and the relation by adopting a gradient descent algorithm, wherein the loss function is

Wherein f is_τ(x)、f_τ(y) a scoring function, wherein x and y represent positive and negative samplesThis, f | | | h_τ+r_τ-t_τ||_L1/L2L1, L2 are regularization terms, and γ is the minimum spacing between positive and negative samples.

And (5) repeating the steps (2-1) to (2-5) until the result obtained by the whole algorithm is stable.

S3-1, for a triple (h, R, t), firstly checking whether h and t belong to E, whether R belongs to R, if not, then h, t or R is not suitable as a head-tail entity or relationship, the triple is not established, if yes, then executing the next step;

s3-2 calculates the result of the scoring function according to the triple of the candidate result, wherein the scoring function is shown in the form of- (f | | | h)_τ+r_τ-t_τ||_L1/L2And sorting all the candidate results, wherein the result in the top sorting is the optimal result and is used for completing.

The experiment is compared with the current optimal model, HyTE and RE-NET, on a test set. The two indexes of average ranking (Mean Rank) and top 10 recall (Hits @10) are mainly compared. For each triplet (h, r, t) in the test set, when it is determined that the head entity h and the relationship r are to predict the tail entity t, scores are calculated for all possible tail entities in the data set by using an evaluation function, and then the scores are sorted according to the scores, so that if the score of the tail entity t to be predicted is low, the result before comparison can successfully predict that the triplet (h, r, t) is established. Therefore, the ranking of the tail entity t and whether the tail entity t ranks in the top ten can be used as indexes for measuring the algorithm. Similarly, the relationship r and the tail entity t can be determined to predict the head entity h, which can result in the ranking of the head entities and whether the head entities are ranked in the top ten. From tables 2, 3 and 4, it can be seen that the present invention has, overall, better performance than the same type of process.

TABLE 2 comparison with the results of the translation-based model experiments on the link prediction task (for hits @10, the larger the result is the better; for mean rank, the smaller the result is the better, the black bold is the inventive result)

TABLE 3 comparison with the results of the translation-based model experiments on the task of relationship prediction (for hits @1, the larger the result is the better, for mean rank, the smaller the result is the better, the black bold is the inventive result)

TABLE 4 comparison with the experimental results of the neural network model on the link prediction task (for hits @1, hits @3, hits @10, the larger the result the better, for MRR, the larger the result the better, the black bold is the inventive result)

Claims

1. A time-based knowledge graph embedding method based on a hybrid translation model is characterized by comprising the following steps:

step 1: inputting a time sequence knowledge graph data set needing to be completed, and carrying out initialization setting according to the size of the data set;

step 2: updating the embedded representation of the knowledge graph by using a hybrid translation model to obtain an embedded representation result;

and step 3: and performing completion operation on the used knowledge graph data set according to the embedded representation result.

2. The method for embedding the knowledge graph with time based on the hybrid translation model according to claim 1, wherein in step 1, the initialization setting comprises the following specific steps:

step A2: for each relationship vector, add to the relationship set R, and for each entity vector, add to the entity set E.

3. The hybrid translation model-based time-of-knowledge-graph immersion method according to claim 1, wherein in step 2, the embedded representation of the knowledge graph is updated using the hybrid translation model by the specific steps of:

step B3: the correct triplet (h, r, t) is compared with the incorrect triplet [ (h, r ', t), (h ', r, t) or (h, r, t ')]Mapping from the entity space to the relationship space, resulting in the correct triplet (h) in the relationship space_⊥,r,t_⊥) And error triplet [ (h)_⊥,r`,t_⊥)、(h_⊥`,r,t_⊥) Or (h)_⊥,r,t_⊥`)]；

step B6: and repeating the steps B1-B5 until the obtained result is stable.

4. The hybrid translation model-based time-of-knowledge map embedding method according to claim 3, wherein in step B2, the manner of constructing the negative sample data set is:

h. r, t, h ', r', t 'represent embedded representations of entities and relationships, where h' represents a randomly substituted head entity, r 'represents a randomly substituted relationship, t' represents a randomly substituted tail entity, D⁺Representing a set of positive samples, D_x,τ ^-Representing a negative sample data set.

5. The hybrid translation model-based temporal knowledge graph embedding method according to claim 3, wherein in step B3, the entity space is mapped to the relationship space by:

wherein M is_rh、M_rtTo map the matrix, r_pProjection vector representing a relation, I^m×nAn identity matrix of size m × n, h_p ^TProjection transpose vector, h, representing head entity_⊥Representing an embedded representation of the head entity in the relational space after mapping, t_p ^TProjection transpose vector, t, representing the tail entity_⊥An embedded representation of the tail entity in the relationship space after the representation mapping.

6. The hybrid translation model-based time-of-knowledge map embedding method according to claim 3, wherein in step B4, all the triples in the relationship space are projected onto the hyperplane constructed by the corresponding time τ in a manner that:

7. The hybrid translation model-based time-of-knowledge map embedding method according to claim 3, wherein in step B5, the loss function is:

wherein f is_τ(x)、f_τ(y) a scoring function, x, y representing positive and negative examples, f | | | h_τ+r_τ-t_τ||_L1/L2L1, L2 are regularization terms, γ is the minimum interval between positive and negative samples, and T represents the set of all times.

8. The hybrid translation model-based time-of-knowledge map embedding method according to claim 1, wherein the step 3 is specifically: