CN112948597A

CN112948597A - Unsupervised knowledge graph entity alignment method and unsupervised knowledge graph entity alignment equipment

Info

Publication number: CN112948597A
Application number: CN202110404436.7A
Authority: CN
Inventors: 赵翔; 曾维新; 唐九阳; 李欣奕; 谭真; 谭跃进; 姜江
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2021-01-06
Filing date: 2021-04-15
Publication date: 2021-06-11

Abstract

The invention discloses an unsupervised knowledge graph entity alignment method and equipment, wherein the method comprises the following steps: acquiring data of two knowledge maps; generating a text distance matrix by using auxiliary information of entities in the knowledge graph; generating an initial alignment result by utilizing a threshold bidirectional nearest neighbor search as a seed entity pair set; based on the seed entity pair set as the marked data, learning a structure distance matrix of the entity by using a graph convolution network; fusing a text distance matrix and a structural distance matrix of an entity to obtain a fusion distance matrix; performing progressive learning to obtain a newly generated alignment entity pair, merging the newly generated alignment entity pair into a seed entity pair set, and using the merged seed entity pair set to iteratively update structure embedding; and repeating the previous three steps until the number of the newly generated alignment entity pairs is lower than a preset value, and obtaining a final entity alignment result.

Description

Unsupervised knowledge graph entity alignment method and unsupervised knowledge graph entity alignment equipment

Technical Field

The invention relates to the technical field of knowledge graphs in natural language processing, in particular to an unsupervised knowledge graph entity alignment method and unsupervised knowledge graph entity alignment equipment.

Background

In recent years, a large number of knowledge maps (KGs) have emerged, such as YAGO, DBpedia, NELL, and CN-DBpedia, zhishi. Knowledge maps have been used in various fields such as natural language processing and information retrieval. In the knowledge graph construction process, the trade-off between the coverage rate and the accuracy rate is inevitably needed. Any knowledge graph cannot be complete or completely correct.

In order to improve the coverage rate and accuracy of the knowledge graph, one possible method is to introduce relevant knowledge from other knowledge graphs, because the knowledge redundancies and complementation exist among the knowledge graphs constructed in different ways. To integrate knowledge in the external knowledge-graph into the target knowledge-graph, the most important step is to align the different knowledge-graphs. For this reason, an Entity Alignment (EA) task is proposed and receives a wide attention. The task is to find pairs of entities in different knowledge graphs that express the same meaning. And the entity pairs serve as hubs for linking different knowledge maps to serve subsequent tasks.

Currently, the latest entity alignment solutions assume that equivalent entities typically possess similar proximity information. Therefore, they generated structure embedding of entities in each knowledge graph using a knowledge graph embedding model (e.g., TransE) or a Graph Neural Network (GNN) model (e.g., GCN) (non-patent documents: Zhichun Wang, Qingsong Lv, Xiahan Lan, and Yu Zhang. Cross-linking knowledge graph video graph connected networks. in EMNLP, pages 349-. These separate embeddings are then projected into a unified embedding space using the pairs of seed entities as connections, in order to directly compare entities from different knowledge-graphs. Finally, in order to determine the comparison result, most of the current work formally determines the comparison process as a ranking problem; that is, for each entity in the source knowledge-graph, they rank all entities in the target knowledge-graph according to some distance metric, and the closest entity is considered the equivalent target entity.

However, these methods have certain drawbacks: 1) depending on the marking data. Most methods rely on pre-aligned pairs of seed entities to connect two knowledgemaps and use uniform embedding of a knowledgemap structure to align entities. However, this labeled data may not exist in the real-world knowledge map. In this case, the latest method relying only on the structural information will not be realized. 2) All current entity alignment solutions operate under closed domain settings; that is, they assume that each entity in the source knowledge-graph has an equivalent entity in the target knowledge-graph. However, in a real environment, there are always entities with no corresponding matches. To our knowledge, there is currently no entity alignment method aimed at resolving the mismatch.

Disclosure of Invention

The present invention is directed to solving at least one of the problems of the prior art. Therefore, the invention discloses an unsupervised knowledge graph entity alignment method and unsupervised knowledge graph entity alignment equipment. The method first uses a threshold bi-directional nearest neighbor search (TBNNS) to generate alignment results using the auxiliary information of the entity, these preliminary matches are treated as pseudo-labeled data for connecting two separate knowledge-graphs and generating a unified knowledge-graph structure embedding, and then the method combines the signals from the auxiliary information and knowledge-graph structure to provide a more comprehensive alignment view.

The technical scheme of the invention is that an unsupervised knowledge graph entity alignment method comprises the following steps:

step 1, acquiring data of two knowledge maps;

step 2, generating a text distance matrix by using auxiliary information of entities in the knowledge graph;

step 3, generating an initial alignment result by utilizing threshold value bidirectional nearest neighbor search, and using the initial alignment result as a seed entity pair set;

step 4, based on the seed entity pair set as the marked data, learning the structure distance matrix of the entity by using a graph convolution network;

step 5, fusing the text distance matrix and the structure distance matrix of the entity to obtain a fusion distance matrix;

step 6, performing progressive learning, adaptively adjusting a threshold value, obtaining a newly generated alignment entity pair by utilizing threshold value bidirectional nearest neighbor search and the fusion distance matrix, merging the newly generated alignment entity pair into a seed entity pair set, and using the merged seed entity pair set to iteratively update structure embedding;

and 7, repeating the steps 4 to 6 until the number of the newly generated alignment entity pairs is lower than a preset value, and obtaining a final entity alignment result.

Specifically, the auxiliary information in step 2 is an entity name that takes into account auxiliary information at the semantic level where average word embedding is used to capture semantic information for the entity name and at the string level, given the semantic embedding of the source and target entities, a semantic distance score is calculated by subtracting their cosine similarity scores from 1, denoted as MⁿAt the level of the character string, the data is stored,levenshtein distance is used to measure the information on the degree of difference between two sequences, denoted M^lThe text distance matrix is as follows: m^t＝αMⁿ+(1-α)M^lWhere α is a hyper-parameter that balances semantic information and disparity information.

Specifically, the threshold bidirectional nearest neighbor search in step 3 is: given a source entity u and a target entity v, (u, v) is considered as an aligned entity pair if u and v are the nearest neighbors of each other and the distance between them is below a given threshold θ, M (u, v) represents an element in the u-th row and v-th column of the distance matrix M. The bidirectional nearest neighbor search of the threshold can effectively filter inaccurate entity pairs by setting the threshold, thereby avoiding the influence of noise on the subsequent process to a great extent; in addition, the strategy has smaller limitation compared with other methods, and the number of the aligned entity pairs can be ensured.

Specifically, in step 4, a graph convolution network is adopted to obtain the proximity information of the entities, a structure embedding matrix Z of the entities is obtained, and a structure distance score between the source entity and the target entity is calculated by subtracting a cosine similarity score between the embedding of the entities from 1, so that a structure distance matrix is obtained, which is expressed as M^sThe method for fusing the text distance matrix and the structure distance matrix of the entity in the step 5 comprises the following steps:

M＝βM^t+(1-β)M^s

where β is a hyperparameter that balances the two distance matrices.

Further, the adaptive threshold adjustment method in the progressive learning process in step 6 is to set the threshold in the threshold bi-directional nearest neighbor search to a smaller value θ at the beginning₀And gradually increasing the threshold by η in subsequent rounds. The progressive learning ensures the precision of the generated aligned entity pair in the initial learning stage and reduces the negative influence on the subsequent learning process by dynamically adjusting the threshold; and more alignment entity pairs are introduced in the middle and later periods of learning, so that more alignment entity pairs can be generated, and the effect of entity alignment is improved.

The invention also discloses an unsupervised knowledge graph entity alignment device, which comprises:

a processor;

and a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the entity alignment method described above via execution of the executable instructions.

Compared with the prior art, the method has the advantages that: (1) an unsupervised entity alignment framework is provided, which can solve the problem of unsupervised entity matching; (2) merging the auxiliary information from the entities and the structural information of the knowledge-graph to provide a more comprehensive alignment view; (3) by utilizing threshold bidirectional nearest neighbor search, unmatched entities can be effectively filtered out and excluded from the alignment result; (4) a progressive learning algorithm is employed that iteratively updates the structure embedding using the comparison results of the previous round, which may result in not only better knowledge map embedding, but also more accurate matching and alignment.

Drawings

FIG. 1 shows a schematic flow diagram of an embodiment of the invention;

FIG. 2 shows an algorithm pseudo-code diagram of an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Example one

Fig. 1 shows a schematic flow chart of a first embodiment of the present invention. An unsupervised knowledge-graph entity alignment method, comprising the steps of:

step 1, acquiring data of two knowledge maps;

The auxiliary information in step 2 is the entity name, which takes into account the auxiliary information at the semantic level where the semantic information of the entity name is captured using mean word embedding, given the semantic embedding of the source and target entities, a semantic distance score is calculated by subtracting their cosine similarity score from 1, denoted as MⁿAt the string level, Levenshtein distance is used to measure the difference information between two sequences, denoted as M^lThe text distance matrix is as follows: m^t＝αMⁿ+(1-α)M^lWhere α is a hyper-parameter that balances semantic information and disparity information.

The threshold value bidirectional nearest neighbor search in the step 3 is as follows: given a source entity u and a target entity v, (u, v) is considered as an aligned entity pair if u and v are the nearest neighbors of each other and the distance between them is below a given threshold θ, M (u, v) represents an element in the u-th row and v-th column of the distance matrix M.

In step 4, a graph convolution network is adopted to obtain the adjacent information of the entity, a structure embedding matrix Z of the entity is obtained, and a structure distance score between the embedding of the source entity and the target entity is calculated by subtracting the cosine similarity score from 1, thereby obtaining a structure distance matrix which is expressed as M^s。

The method for fusing the text distance matrix and the structure distance matrix of the entity in the step 5 comprises the following steps:

M＝βM^t+(1-β)M^s

where β is a hyperparameter that balances the two distance matrices.

In the step 6, the self-adaptive threshold adjusting method in the progressive learning process is that the threshold in the threshold bidirectional nearest neighbor search is set to be a smaller value theta at the beginning₀And gradually increasing the threshold by η in subsequent rounds.

The specific algorithm flow is shown in fig. 2, and comprises the following steps:

firstly, generating a text distance matrix (line 1) by using auxiliary information;

the text distance matrix is then used to generate an initial alignment result (row 2) using a threshold two-way nearest neighbor search (TBNNS);

excluding entities in the alignment result from the set of entities (line 3);

using the alignment result as marker data to generate structure embedding, and fusing the distance matrix (row 5);

the fused distance matrix is used to generate more aligned pairs of entities from the remaining entities (row 6);

adding these newly generated entity pairs to the alignment result and removing the entities in the alignment result from the entity set (line 7);

finally, in order to step up the quality of the knowledge-graph embedding and detect more aligned results, lines 5-7 are recursively executed until the number of newly generated entity pairs is below a given threshold γ.

Example two

An unsupervised knowledge-graph entity alignment apparatus, comprising:

a processor;

and a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the entity alignment method of embodiment one via execution of the executable instructions.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Claims

1. An unsupervised knowledge-graph entity alignment method, comprising the steps of:

step 1, acquiring data of two knowledge maps;

2. An unsupervised method of knowledgegraph entity alignment according to claim 1, wherein the side information in step 2 is an entity name, the entity name considers side information at the semantic level and at the string level, at the semantic level, the semantic information of the entity name is captured using mean word embedding, given the semantic embedding of the source and target entities, a semantic distance score is calculated by subtracting their cosine similarity score from 1, denoted MⁿAt the string level, Levenshtein distance is used to measure the difference information between two sequences, denoted as M^lThe text distance matrix is as follows: m^t＝αMⁿ+(1-α)M^lWhere α is a hyper-parameter that balances semantic information and disparity information.

3. The unsupervised knowledge-graph entity alignment method of claim 1, wherein the threshold bi-directional nearest neighbor search in step 3 is: given a source entity u and a target entity v, (u, v) is considered as an aligned entity pair if u and v are the nearest neighbors of each other and the distance between them is below a given threshold θ, M (u, v) represents an element in the u-th row and v-th column of the distance matrix M.

4. An unsupervised method for aligning knowledge-graph entities according to claim 2 or 3, characterized in that in step 4, a graph convolution network is used to obtain the proximity information of the entities, a structure embedding matrix Z of the entities is obtained, and a structure distance score between the source entity and the target entity is calculated by subtracting the cosine similarity score between the embedding of them from 1, thereby obtaining a structure distance matrix, denoted as M^sThe method for fusing the text distance matrix and the structure distance matrix of the entity in the step 5 comprises the following steps:

M＝βM^t+(1-β)M^s

where β is a hyperparameter that balances the two distance matrices.

5. The unsupervised knowledge-graph entity alignment method of claim 4, wherein the adaptive threshold adjustment method in the progressive learning process in step 6 is to initially set the threshold in the threshold bi-directional nearest neighbor search to a smaller value θ₀And gradually increasing the threshold by η in subsequent rounds.

6. An unsupervised knowledge-graph entity alignment apparatus, comprising:

a processor;

and a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the entity alignment method of the executable instructions of any of claims 1 to 5.