CN114077676A

CN114077676A - Knowledge graph noise detection method based on path confidence

Info

Publication number: CN114077676A
Application number: CN202111393836.9A
Authority: CN
Inventors: 马江涛; 周辰宇; 王艳军; 李端阳; 贾泽臣; 马宇科; 李霆; 卢威光; 张蓓蕾; 李清扬; 赵一帆
Original assignee: Henan Tupu Information Technology Co ltd; Zhengzhou University of Light Industry
Current assignee: Henan Tupu Information Technology Co ltd; Zhengzhou University of Light Industry
Priority date: 2021-11-23
Filing date: 2021-11-23
Publication date: 2022-02-22
Anticipated expiration: 2041-11-23
Also published as: CN114077676B

Abstract

The invention provides a knowledge graph noise detection method based on path confidence coefficient, which comprises the following steps: firstly, initializing triples, finding all paths of all triples, carrying out embedded representation on each triplet of each path by using a translation model TransE algorithm, and representing all paths of the triples as path embedded sequences; wherein, a node is formed between adjacent triples in the path embedding sequence; secondly, sequentially inputting the nodes into the CPLL to calculate the confidence score of each node in each path; respectively obtaining a scoring matrix of each path from each path of Bi-GRU; and finally, taking the L2 norm of the score matrix of each path as a path confidence coefficient, and taking the corresponding score matrix when the path confidence coefficient is highest as the optimal embedding matrix of the triplet. The invention combines the method based on the path and the method based on the rule, and improves the efficiency of detecting the noise in the knowledge graph, thereby improving the quality of the knowledge graph.

Description

Knowledge graph noise detection method based on path confidence

Technical Field

The invention relates to the technical field of knowledge graphs, in particular to a knowledge graph noise detection method based on path confidence.

Background

Nowadays, knowledge-graphs play an important role in solving the task of artificial intelligence. However, manually or automatically constructed knowledgemaps have a number of quality issues, and often contain some erroneous or missing triples. Noise in the knowledge-graph may be caused by human error or errors in the data, with most noise appearing as erroneous entities or relationships in the triples. Currently, more and more scholars are beginning to focus on the problem of knowledge-graph noise and come up with many solutions.

Noise detection methods in knowledge-graphs can be broadly divided into path-based methods and rule-based methods. Path-based methods start with TransE, TransH, TransR, etc. translation models, which, although they are mostly used for knowledge-graph embedded representation and completion, can also be used to detect noise in the knowledge-graph. The PaTyBRED model proposed by Melo et al, which incorporates type and path features into a local relationship classifier, preserving a specific path for each relationship to indicate whether a triplet is erroneous. Xie et al propose a CKRL model that utilizes the local and global information of triples to represent the probability of a triplet being erroneous. However, the path-based approach is weak in the ability to find noise and is not suitable for processing knowledge-graphs containing complex relationships. Rule-based methods generally have a stronger noise detection capability than path-based methods. The PSL model proposed by Brocheler et al extracts the most likely correct triples from the uncertain triples using first order predicate logic and weighting rules. Abedini et al propose Correction Tower, identifying discrepancies, inconsistencies, and error relationships in triples in three steps. However, rule-based methods lack the ability to represent knowledge, i.e., after the rule-based methods detect and reject noise in the knowledge-graph, it is also necessary to map the knowledge-graph to a continuous vector space in order to make it easier to manipulate the knowledge-graph in downstream tasks.

If the path-based approach and the rule-based approach can be combined, not only noise can be found, but a noise-free knowledge graph representation can also be constructed. Specifically, firstly, in the path of the triple, a rule is made to screen out the effective features. These features are required to distinguish noise information from correct information, and the correct information includes global triplet information and local triplet information. And then, the noise detection and the triple representation work are completed by utilizing the characteristics, so that the quality of the knowledge graph is improved, and the user experience is improved.

Disclosure of Invention

The invention provides a method for detecting noise of a knowledge graph based on path confidence, which is used for solving the technical problems that the existing method based on the path is weak in noise finding capability and is not suitable for processing the knowledge graph containing complex relationships and the rule-based method lacks the capability of knowledge representation.

The technical scheme of the invention is realized as follows:

a knowledge graph noise detection method based on path confidence includes the following steps:

the method comprises the following steps: initializing the number of triples, finding out all paths of all triples, carrying out embedded representation on each triplet of each path by using a translation model TransE algorithm, and representing all paths of triples as path embedded sequences; a node is formed between adjacent triples in the path embedding sequence, and the number of the nodes is n;

step two: sequentially inputting the nodes to a probability logic layer (CPLL) based on the confidence degree and based on the relevance degree, and calculating a confidence degree score matrix of each node in each path;

step three: respectively inputting the confidence coefficient score matrixes of all nodes in each path into the Bi-GRU to obtain a score matrix of each path;

step four: and taking the L2 norm of the score matrix of each path as the path confidence coefficient, and taking the corresponding score matrix when the path confidence coefficient is highest as the optimal embedding matrix of the triples.

Preferably, in the second step, the specific method is as follows:

s21, initializing the input node T:

T＝N′_i·(N′_i+1)^T (1)；

N′_i＝(x′_i,r′_i,x′_i+1) (2)；

N′_i+1＝(x′_i+1,r′_i+1,x′_i+2) (3)；

wherein, N'_iAn embedded matrix, N ', representing the ith triplet on the path'_i+1An embedded matrix representing the (i +1) th triplet on the path, (N'_i+1)^TRepresenting triplet embedding matrix N'_i+1Transpose of x'_i、x′_i+1、x′_i+2All represent entity, r'_iAnd r'_i+1All represent relationships;

s22, connecting the node T with the parameter matrix W₀The global confidence between the triples is obtained by multiplying, namely the global triple confidence:

GTT(i,i+1)＝T·W₀ (4)；

wherein, GTT (i, i +1) is the confidence of the global triple;

s23, entering into Separate by the node T&In the padd layer, the sub-matrix block T on the diagonal of T is separated₁,T₂,T₃Then T is added₁,T₂,T₃Respectively with the parameter matrix W₁,W₂,W₃Multiplying to obtain D, E and F; and performing logic operation based on the correlation degrees by using the D, the E and the F, and adding to obtain a local confidence coefficient between the triples, namely the local triple confidence coefficient:

T₁＝x′_i·x′_i+1,T₂＝r′_i·r′_i+1,T₃＝x′_i+1·x′_i+2 (5)；

D＝T₁·W₁,E＝T₂·W₂,F＝T₃·W₃ (6)；

wherein MIN (-) represents the minimum value of the matrix, MAX (-) represents the maximum value of the matrix, 1 represents that the elements in the matrix are all 1, -1 represents that the elements in the matrix are all-1,

respectively representing different logic operations, wherein LTT (i, j) is a local triple confidence;

s24, multiplying the confidence coefficient of the global triple and the confidence coefficient of the local triple to obtain the confidence coefficient score G of the node T_i：

G_i＝GTT(i,i+1)·LTT(i,i+1) (12)。

Preferably, in step three, the specific method is as follows:

s31, selecting the confidence score G of each node_iAnd confidence G of neighboring nodes_i+1、G_i-1As the input of the bidirectional GRU, the calculation modes of the ith forward GRU and the backward GRU are respectively as follows:

wherein the content of the first and second substances,

which represents the output result of the forward GRU,

represents the output result of backward GRU, GRU (-) represents the gated loop network.

S32, performing concatenation, linear and normalization operations on the final outputs of the forward GRU and the backward GRU to obtain a path score matrix:

wherein h (p) represents the output result of the gated loop network, i.e. the path score matrix,

representing the final output result of the forward GRU,

represents the final output result of the backward GRU, concat () represents the join function, line () represents the linear function, and softmax () represents the normalization function.

Preferably, in step four, the path confidence and the optimal triplet are calculated by the following methods:

when in use

When, h (f)_k)＝h(p_j) (17)；

Wherein g (p) represents path confidence, h (p)_j) A matrix of the scores of the paths is represented,

l2 function, g (f), representing a matrix_k) Indicates the maximum path confidence, h (f)_k) The optimal path score matrix representing the triplet is also the optimal embedding matrix for the triplet.

Preferably, the designed loss function is as follows:

L＝∑_{(h,r,t)∈{T'∪T”}}log[1+exp(l_(h,r,t)·P(h,r,t))] (18)；

the method comprises the following steps that exp () represents an exponential function with a natural constant e as a base, log () represents a logarithmic function, L represents a loss function, P (h, r, T) represents a path from a head entity h to a tail entity T, r represents a relation, T 'represents a set of valid triples, T' represents a set of invalid triples, the invalid triples refer to triples formed by randomly switching one head entity or one tail entity of original triples, and the valid triples refer to the original triples.

Compared with the prior art, the invention has the following beneficial effects:

1) on the basis of internal structure information in a knowledge graph based on a path, a probability model based on correlation degree is introduced and fused into a neural network structure to detect noise in the knowledge graph and perform knowledge graph representation.

2) The invention constructs a path confidence network to calculate the global triple confidence and the local triple confidence, and obtains the path confidence and the path score matrix of the triple by combining a bidirectional gating circulation network; the path confidence is used to determine whether the triplet is correct, and the path score matrix is used to represent the triplet.

3) The invention solves the problem of knowledge graph noise, completes the representation of the knowledge graph and obtains good effect in the detection test of the knowledge graph noise.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a sub-graph of all paths from entity "champions" to entity "teams";

FIG. 2 is a flow chart of the present invention;

FIG. 3 is a flow chart of the proposed model of the present invention;

FIG. 4 is a block diagram of a correlation-based probabilistic logic model of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.

In general, the existence of some relationship between triplets in a knowledge-graph can be expressed in the form of a path. When the triplet f is expressed as (h, R, t), the path P from the head entity h to the tail entity t as (h, R, t) is an option that cannot be ignored. Wherein, R includes at least one relationship, and possibly several entities, and these entities and relationships may form several triples N, which is referred to as path triples in the present invention. Every two adjacent triplets constitute a node. And R ≧ R, when R ≧ R, path P is equal to f, indicating that f is the shortest path.

There may be multiple paths from the head entity to the tail entity, but some paths are not correct, some paths are not complete, and information in some paths is not suitable for use in the triplet representation. FIG. 1 shows f₁The set of all paths for ("champion", "joining", "team"), i.e., the set of paths for the entity "champion" to the entity "team". In FIG. 1, f₁Is the shortest path, also the triplet itself, f₂(","₃The correct triplet is the "basketball game", "equals" and "match". Thus, f₂Or f₃The combined path with the other triplets is noisy. These noisy paths must undergo some processing before their path score matrices can be used to represent the triples.

However, most path-based knowledge map representation methods do not exclude noise contained in the path. But the rule-based approach is well suited to solve the problem of noise contained in the path. Specifically, a confidence level is given to each node in the path to indicate how likely the node is correct, and then a path confidence level is obtained by probability combination, and the path confidence level indicates how likely the path is correct. If the path from the head entity to the tail entity only has the triplet itself, then the triplet is the only node in the path. At this time, the triple confidence, the node confidence and the path confidence are equal. In fact, there may be multiple paths, and it is most appropriate to take the path with the highest path confidence to represent the triplet. If the triples are represented in the form of a matrix, the path score matrix is obtained by the probability combination between the node confidence degrees, and the L2 norm of the path score matrix is used as the confidence degree of the path.

As shown in fig. 2, an embodiment of the present invention provides a method for detecting noise in a knowledge-graph based on a path confidence, which includes the following specific steps:

the method comprises the following steps: for the triples with the number of E, finding all paths of all the triples, initializing the number of the triples with the number of E as E, and traversing all the triples; and traversing all paths of the triples, wherein the number of the paths is P, and the number of the initialized paths is P. Embedding each triple of each path by using a translation model TransE algorithm, and representing all paths of the triples as path embedding sequences; a node is formed between adjacent triples in the path embedding sequence, and the number of the nodes is n; the number of initialization nodes is N. The structure of the present invention is shown in fig. 3.

Step two: as shown in fig. 4, the nodes are sequentially input to a probability logic layer (CPLL) based on the correlation, and the confidence score of each node in each path is calculated;

in the second step, the specific method is as follows:

s21, initializing the input node T:

T＝N′_i·(N′_i+1)^T (1)；

N′_i＝(x′_i,r′_i,x′_i+1) (2)；

N′_i+1＝(x′_i+1,r′_i+1,x′_i+2) (3)；

wherein, N'_i,N′_i+1Embedded matrices representing the ith and i +1 triplets on a path, respectively, (N'_i+1)^TRepresenting triplet embedding matrix N'_i+1Transpose of x'_i、x′_i+1、x′_i+2Represents entity r'_iAnd r'_i+1Representing the relationship.

GTT(i,i+1)＝T·W₀ (4)；

where GTT (i, i +1) is the global triple confidence.

S23, the node T enters separation&Filling operation layer, separating out sub-matrix block T on diagonal of T₁,T₂,T₃Then T is added₁,T₂,T₃Respectively with the parameter matrix W₁,W₂,W₃Multiplying to obtain D, E and F; and performing logic operation based on the correlation degrees by using the D, the E and the F, and adding to obtain a local confidence coefficient between the triples, namely the local triple confidence coefficient:

D＝T₁·W₁,E＝T₂·W₂,F＝T₃·W₃ (6)；

respectively representing different logical operations, LTT (i, j) is a local triple confidence.

G_i＝GTT(i,i+1)·LTT(i,i+1) (12)。

Step three: respectively inputting the confidence scores of all nodes in each path into a Bi-GRU (bidirectional gated-loop network) according to the front and back sequence to obtain a score matrix of each path;

in the third step, the specific method is as follows:

wherein the content of the first and second substances,

which represents the output result of the forward GRU,

S32, in order to retain the effective information to the maximum, performing the connection, linear and normalization operations on the final outputs of the forward GRU and the backward GRU to obtain the path score matrix:

representing the final output result of the forward GRU,

Step four: and taking the L2 norm of the score matrix of each path as the path confidence coefficient, and taking the corresponding score matrix when the path confidence coefficient is highest as the optimal triple.

In the fourth step, the calculation methods of the path confidence coefficient and the optimal triplet are respectively as follows:

when in use

When, h (f)_k)＝h(p_j)(17)；

In order to train the model proposed by the present invention, the designed loss function is as follows:

L＝∑_{(h,r,t)∈{T'∪T”}}log[1+exp(l_(h,r,t)·P(h,r,t))] (18)；

The present invention uses three reference datasets FB15K, WN18, and NELL995 of knowledge-map noise detection, which are constructed from information extracted from the Freebase, WordNet, and NELL knowledge bases, respectively. Their statistics are listed in table 1.

TABLE 1 statistics of the baseline data sets FB15K, WN18, and NELL995

To evaluate the performance of the model, noise needs to be added to the data set described above. The basic method is as follows: for a given positive triplet (h, r, t), one of the head or tail entities is randomly switched to form a negative triplet (h ', r, t) or (h, r, t') as noise. In this way, a data set containing 10%, 20%, 40% noise is constructed for each reference data set. These noisy data sets share the same entity, relationship, validation, and test sets as the original data set, and all the noise generated is fused into the original training set.

The invention combines the L2 norm of the path score matrix

As path confidence, all triples in the training set are then ranked according to these path confidence. The greater the path confidence of a triplet, the more effective the triplet is.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A knowledge graph noise detection method based on path confidence is characterized by comprising the following steps:

2. The method for detecting knowledge-graph noise based on path confidence as claimed in claim 1, wherein in the second step, the specific method is:

s21, initializing the input node T:

T＝N′_i·(N′_i+1)^T (1)；

N′_i＝(x′_i,r′_i,x′_i+1) (2)；

N′_i+1＝(x′_i+1,r′_i+1,x′_i+2) (3)；

GTT(i,i+1)＝T·W₀ (4)；

wherein, GTT (i, i +1) is the confidence of the global triple;

D＝T₁·W₁,E＝T₂·W₂,F＝T₃·W₃ (6)；

G_i＝GTT(i,i+1)·LTT(i,i+1) (12)。

3. The method for detecting knowledge-graph noise based on path confidence as claimed in claim 2, wherein in step three, the specific method is:

wherein the content of the first and second substances,

which represents the output result of the forward GRU,

representing the final output result of the forward GRU,

representing the final output of the backward GRUAs a result, concat () represents a join function, line () represents a linear function, and softmax () represents a normalization function.

4. The method for knowledge-graph noise detection based on path confidence as claimed in claim 3, wherein in step four, the path confidence and the optimal triplet are calculated by:

when in use

When, h (f)_k)＝h(p_j) (17)；

5. The method of knowledge-graph noise detection based on path confidence of claim 4, wherein the designed loss function is as follows: