CN112948597A - Unsupervised knowledge graph entity alignment method and unsupervised knowledge graph entity alignment equipment - Google Patents

Unsupervised knowledge graph entity alignment method and unsupervised knowledge graph entity alignment equipment Download PDF

Info

Publication number
CN112948597A
CN112948597A CN202110404436.7A CN202110404436A CN112948597A CN 112948597 A CN112948597 A CN 112948597A CN 202110404436 A CN202110404436 A CN 202110404436A CN 112948597 A CN112948597 A CN 112948597A
Authority
CN
China
Prior art keywords
entity
alignment
distance matrix
knowledge
graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110404436.7A
Other languages
Chinese (zh)
Inventor
赵翔
曾维新
唐九阳
李欣奕
谭真
谭跃进
姜江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Publication of CN112948597A publication Critical patent/CN112948597A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Library & Information Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Animal Behavior & Ethology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an unsupervised knowledge graph entity alignment method and equipment, wherein the method comprises the following steps: acquiring data of two knowledge maps; generating a text distance matrix by using auxiliary information of entities in the knowledge graph; generating an initial alignment result by utilizing a threshold bidirectional nearest neighbor search as a seed entity pair set; based on the seed entity pair set as the marked data, learning a structure distance matrix of the entity by using a graph convolution network; fusing a text distance matrix and a structural distance matrix of an entity to obtain a fusion distance matrix; performing progressive learning to obtain a newly generated alignment entity pair, merging the newly generated alignment entity pair into a seed entity pair set, and using the merged seed entity pair set to iteratively update structure embedding; and repeating the previous three steps until the number of the newly generated alignment entity pairs is lower than a preset value, and obtaining a final entity alignment result.

Description

Unsupervised knowledge graph entity alignment method and unsupervised knowledge graph entity alignment equipment
Technical Field
The invention relates to the technical field of knowledge graphs in natural language processing, in particular to an unsupervised knowledge graph entity alignment method and unsupervised knowledge graph entity alignment equipment.
Background
In recent years, a large number of knowledge maps (KGs) have emerged, such as YAGO, DBpedia, NELL, and CN-DBpedia, zhishi. Knowledge maps have been used in various fields such as natural language processing and information retrieval. In the knowledge graph construction process, the trade-off between the coverage rate and the accuracy rate is inevitably needed. Any knowledge graph cannot be complete or completely correct.
In order to improve the coverage rate and accuracy of the knowledge graph, one possible method is to introduce relevant knowledge from other knowledge graphs, because the knowledge redundancies and complementation exist among the knowledge graphs constructed in different ways. To integrate knowledge in the external knowledge-graph into the target knowledge-graph, the most important step is to align the different knowledge-graphs. For this reason, an Entity Alignment (EA) task is proposed and receives a wide attention. The task is to find pairs of entities in different knowledge graphs that express the same meaning. And the entity pairs serve as hubs for linking different knowledge maps to serve subsequent tasks.
Currently, the latest entity alignment solutions assume that equivalent entities typically possess similar proximity information. Therefore, they generated structure embedding of entities in each knowledge graph using a knowledge graph embedding model (e.g., TransE) or a Graph Neural Network (GNN) model (e.g., GCN) (non-patent documents: Zhichun Wang, Qingsong Lv, Xiahan Lan, and Yu Zhang. Cross-linking knowledge graph video graph connected networks. in EMNLP, pages 349-. These separate embeddings are then projected into a unified embedding space using the pairs of seed entities as connections, in order to directly compare entities from different knowledge-graphs. Finally, in order to determine the comparison result, most of the current work formally determines the comparison process as a ranking problem; that is, for each entity in the source knowledge-graph, they rank all entities in the target knowledge-graph according to some distance metric, and the closest entity is considered the equivalent target entity.
However, these methods have certain drawbacks: 1) depending on the marking data. Most methods rely on pre-aligned pairs of seed entities to connect two knowledgemaps and use uniform embedding of a knowledgemap structure to align entities. However, this labeled data may not exist in the real-world knowledge map. In this case, the latest method relying only on the structural information will not be realized. 2) All current entity alignment solutions operate under closed domain settings; that is, they assume that each entity in the source knowledge-graph has an equivalent entity in the target knowledge-graph. However, in a real environment, there are always entities with no corresponding matches. To our knowledge, there is currently no entity alignment method aimed at resolving the mismatch.
Disclosure of Invention
The present invention is directed to solving at least one of the problems of the prior art. Therefore, the invention discloses an unsupervised knowledge graph entity alignment method and unsupervised knowledge graph entity alignment equipment. The method first uses a threshold bi-directional nearest neighbor search (TBNNS) to generate alignment results using the auxiliary information of the entity, these preliminary matches are treated as pseudo-labeled data for connecting two separate knowledge-graphs and generating a unified knowledge-graph structure embedding, and then the method combines the signals from the auxiliary information and knowledge-graph structure to provide a more comprehensive alignment view.
The technical scheme of the invention is that an unsupervised knowledge graph entity alignment method comprises the following steps:
step 1, acquiring data of two knowledge maps;
step 2, generating a text distance matrix by using auxiliary information of entities in the knowledge graph;
step 3, generating an initial alignment result by utilizing threshold value bidirectional nearest neighbor search, and using the initial alignment result as a seed entity pair set;
step 4, based on the seed entity pair set as the marked data, learning the structure distance matrix of the entity by using a graph convolution network;
step 5, fusing the text distance matrix and the structure distance matrix of the entity to obtain a fusion distance matrix;
step 6, performing progressive learning, adaptively adjusting a threshold value, obtaining a newly generated alignment entity pair by utilizing threshold value bidirectional nearest neighbor search and the fusion distance matrix, merging the newly generated alignment entity pair into a seed entity pair set, and using the merged seed entity pair set to iteratively update structure embedding;
and 7, repeating the steps 4 to 6 until the number of the newly generated alignment entity pairs is lower than a preset value, and obtaining a final entity alignment result.
Specifically, the auxiliary information in step 2 is an entity name that takes into account auxiliary information at the semantic level where average word embedding is used to capture semantic information for the entity name and at the string level, given the semantic embedding of the source and target entities, a semantic distance score is calculated by subtracting their cosine similarity scores from 1, denoted as MnAt the level of the character string, the data is stored,levenshtein distance is used to measure the information on the degree of difference between two sequences, denoted MlThe text distance matrix is as follows: mt=αMn+(1-α)MlWhere α is a hyper-parameter that balances semantic information and disparity information.
Specifically, the threshold bidirectional nearest neighbor search in step 3 is: given a source entity u and a target entity v, (u, v) is considered as an aligned entity pair if u and v are the nearest neighbors of each other and the distance between them is below a given threshold θ, M (u, v) represents an element in the u-th row and v-th column of the distance matrix M. The bidirectional nearest neighbor search of the threshold can effectively filter inaccurate entity pairs by setting the threshold, thereby avoiding the influence of noise on the subsequent process to a great extent; in addition, the strategy has smaller limitation compared with other methods, and the number of the aligned entity pairs can be ensured.
Specifically, in step 4, a graph convolution network is adopted to obtain the proximity information of the entities, a structure embedding matrix Z of the entities is obtained, and a structure distance score between the source entity and the target entity is calculated by subtracting a cosine similarity score between the embedding of the entities from 1, so that a structure distance matrix is obtained, which is expressed as MsThe method for fusing the text distance matrix and the structure distance matrix of the entity in the step 5 comprises the following steps:
M=βMt+(1-β)Ms
where β is a hyperparameter that balances the two distance matrices.
Further, the adaptive threshold adjustment method in the progressive learning process in step 6 is to set the threshold in the threshold bi-directional nearest neighbor search to a smaller value θ at the beginning0And gradually increasing the threshold by η in subsequent rounds. The progressive learning ensures the precision of the generated aligned entity pair in the initial learning stage and reduces the negative influence on the subsequent learning process by dynamically adjusting the threshold; and more alignment entity pairs are introduced in the middle and later periods of learning, so that more alignment entity pairs can be generated, and the effect of entity alignment is improved.
The invention also discloses an unsupervised knowledge graph entity alignment device, which comprises:
a processor;
and a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the entity alignment method described above via execution of the executable instructions.
Compared with the prior art, the method has the advantages that: (1) an unsupervised entity alignment framework is provided, which can solve the problem of unsupervised entity matching; (2) merging the auxiliary information from the entities and the structural information of the knowledge-graph to provide a more comprehensive alignment view; (3) by utilizing threshold bidirectional nearest neighbor search, unmatched entities can be effectively filtered out and excluded from the alignment result; (4) a progressive learning algorithm is employed that iteratively updates the structure embedding using the comparison results of the previous round, which may result in not only better knowledge map embedding, but also more accurate matching and alignment.
Drawings
FIG. 1 shows a schematic flow diagram of an embodiment of the invention;
FIG. 2 shows an algorithm pseudo-code diagram of an embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Example one
Fig. 1 shows a schematic flow chart of a first embodiment of the present invention. An unsupervised knowledge-graph entity alignment method, comprising the steps of:
step 1, acquiring data of two knowledge maps;
step 2, generating a text distance matrix by using auxiliary information of entities in the knowledge graph;
step 3, generating an initial alignment result by utilizing threshold value bidirectional nearest neighbor search, and using the initial alignment result as a seed entity pair set;
step 4, based on the seed entity pair set as the marked data, learning the structure distance matrix of the entity by using a graph convolution network;
step 5, fusing the text distance matrix and the structure distance matrix of the entity to obtain a fusion distance matrix;
step 6, performing progressive learning, adaptively adjusting a threshold value, obtaining a newly generated alignment entity pair by utilizing threshold value bidirectional nearest neighbor search and the fusion distance matrix, merging the newly generated alignment entity pair into a seed entity pair set, and using the merged seed entity pair set to iteratively update structure embedding;
and 7, repeating the steps 4 to 6 until the number of the newly generated alignment entity pairs is lower than a preset value, and obtaining a final entity alignment result.
The auxiliary information in step 2 is the entity name, which takes into account the auxiliary information at the semantic level where the semantic information of the entity name is captured using mean word embedding, given the semantic embedding of the source and target entities, a semantic distance score is calculated by subtracting their cosine similarity score from 1, denoted as MnAt the string level, Levenshtein distance is used to measure the difference information between two sequences, denoted as MlThe text distance matrix is as follows: mt=αMn+(1-α)MlWhere α is a hyper-parameter that balances semantic information and disparity information.
The threshold value bidirectional nearest neighbor search in the step 3 is as follows: given a source entity u and a target entity v, (u, v) is considered as an aligned entity pair if u and v are the nearest neighbors of each other and the distance between them is below a given threshold θ, M (u, v) represents an element in the u-th row and v-th column of the distance matrix M.
In step 4, a graph convolution network is adopted to obtain the adjacent information of the entity, a structure embedding matrix Z of the entity is obtained, and a structure distance score between the embedding of the source entity and the target entity is calculated by subtracting the cosine similarity score from 1, thereby obtaining a structure distance matrix which is expressed as Ms
The method for fusing the text distance matrix and the structure distance matrix of the entity in the step 5 comprises the following steps:
M=βMt+(1-β)Ms
where β is a hyperparameter that balances the two distance matrices.
In the step 6, the self-adaptive threshold adjusting method in the progressive learning process is that the threshold in the threshold bidirectional nearest neighbor search is set to be a smaller value theta at the beginning0And gradually increasing the threshold by η in subsequent rounds.
The specific algorithm flow is shown in fig. 2, and comprises the following steps:
firstly, generating a text distance matrix (line 1) by using auxiliary information;
the text distance matrix is then used to generate an initial alignment result (row 2) using a threshold two-way nearest neighbor search (TBNNS);
excluding entities in the alignment result from the set of entities (line 3);
using the alignment result as marker data to generate structure embedding, and fusing the distance matrix (row 5);
the fused distance matrix is used to generate more aligned pairs of entities from the remaining entities (row 6);
adding these newly generated entity pairs to the alignment result and removing the entities in the alignment result from the entity set (line 7);
finally, in order to step up the quality of the knowledge-graph embedding and detect more aligned results, lines 5-7 are recursively executed until the number of newly generated entity pairs is below a given threshold γ.
Example two
An unsupervised knowledge-graph entity alignment apparatus, comprising:
a processor;
and a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the entity alignment method of embodiment one via execution of the executable instructions.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Claims (6)

1. An unsupervised knowledge-graph entity alignment method, comprising the steps of:
step 1, acquiring data of two knowledge maps;
step 2, generating a text distance matrix by using auxiliary information of entities in the knowledge graph;
step 3, generating an initial alignment result by utilizing threshold value bidirectional nearest neighbor search, and using the initial alignment result as a seed entity pair set;
step 4, based on the seed entity pair set as the marked data, learning the structure distance matrix of the entity by using a graph convolution network;
step 5, fusing the text distance matrix and the structure distance matrix of the entity to obtain a fusion distance matrix;
step 6, performing progressive learning, adaptively adjusting a threshold value, obtaining a newly generated alignment entity pair by utilizing threshold value bidirectional nearest neighbor search and the fusion distance matrix, merging the newly generated alignment entity pair into a seed entity pair set, and using the merged seed entity pair set to iteratively update structure embedding;
and 7, repeating the steps 4 to 6 until the number of the newly generated alignment entity pairs is lower than a preset value, and obtaining a final entity alignment result.
2. An unsupervised method of knowledgegraph entity alignment according to claim 1, wherein the side information in step 2 is an entity name, the entity name considers side information at the semantic level and at the string level, at the semantic level, the semantic information of the entity name is captured using mean word embedding, given the semantic embedding of the source and target entities, a semantic distance score is calculated by subtracting their cosine similarity score from 1, denoted MnAt the string level, Levenshtein distance is used to measure the difference information between two sequences, denoted as MlThe text distance matrix is as follows: mt=αMn+(1-α)MlWhere α is a hyper-parameter that balances semantic information and disparity information.
3. The unsupervised knowledge-graph entity alignment method of claim 1, wherein the threshold bi-directional nearest neighbor search in step 3 is: given a source entity u and a target entity v, (u, v) is considered as an aligned entity pair if u and v are the nearest neighbors of each other and the distance between them is below a given threshold θ, M (u, v) represents an element in the u-th row and v-th column of the distance matrix M.
4. An unsupervised method for aligning knowledge-graph entities according to claim 2 or 3, characterized in that in step 4, a graph convolution network is used to obtain the proximity information of the entities, a structure embedding matrix Z of the entities is obtained, and a structure distance score between the source entity and the target entity is calculated by subtracting the cosine similarity score between the embedding of them from 1, thereby obtaining a structure distance matrix, denoted as MsThe method for fusing the text distance matrix and the structure distance matrix of the entity in the step 5 comprises the following steps:
M=βMt+(1-β)Ms
where β is a hyperparameter that balances the two distance matrices.
5. The unsupervised knowledge-graph entity alignment method of claim 4, wherein the adaptive threshold adjustment method in the progressive learning process in step 6 is to initially set the threshold in the threshold bi-directional nearest neighbor search to a smaller value θ0And gradually increasing the threshold by η in subsequent rounds.
6. An unsupervised knowledge-graph entity alignment apparatus, comprising:
a processor;
and a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the entity alignment method of the executable instructions of any of claims 1 to 5.
CN202110404436.7A 2021-01-06 2021-04-15 Unsupervised knowledge graph entity alignment method and unsupervised knowledge graph entity alignment equipment Pending CN112948597A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110011250 2021-01-06
CN2021100112505 2021-01-06

Publications (1)

Publication Number Publication Date
CN112948597A true CN112948597A (en) 2021-06-11

Family

ID=76232639

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110404436.7A Pending CN112948597A (en) 2021-01-06 2021-04-15 Unsupervised knowledge graph entity alignment method and unsupervised knowledge graph entity alignment equipment

Country Status (1)

Country Link
CN (1) CN112948597A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115168620A (en) * 2022-09-09 2022-10-11 之江实验室 Self-supervision joint learning method oriented to knowledge graph entity alignment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115168620A (en) * 2022-09-09 2022-10-11 之江实验室 Self-supervision joint learning method oriented to knowledge graph entity alignment

Similar Documents

Publication Publication Date Title
US11138471B2 (en) Augmentation of audiographic images for improved machine learning
CN113407759B (en) Multi-modal entity alignment method based on adaptive feature fusion
CN108764195B (en) Handwriting model training method, handwritten character recognition method, device, equipment and medium
CN113656596B (en) Multi-modal entity alignment method based on triple screening fusion
CN113326731B (en) Cross-domain pedestrian re-identification method based on momentum network guidance
US11487608B2 (en) Entity resolution framework for data matching
JP2016091395A (en) Estimation method, estimation system, computer system, and program
CN108304876A (en) Disaggregated model training method, device and sorting technique and device
CN109002461B (en) Handwriting model training method, text recognition method, device, equipment and medium
JP2018106216A (en) Learning data generating device, development data generating apparatus, model learning apparatus, method thereof, and program
US20190279036A1 (en) End-to-end modelling method and system
US9299035B2 (en) Iterative refinement of pathways correlated with outcomes
US10832036B2 (en) Meta-learning for facial recognition
US20210224647A1 (en) Model training apparatus and method
WO2016095068A1 (en) Pedestrian detection apparatus and method
CN113140254A (en) Meta-learning drug-target interaction prediction system and prediction method
CN112948597A (en) Unsupervised knowledge graph entity alignment method and unsupervised knowledge graph entity alignment equipment
CN114298224B (en) Image classification method, apparatus and computer readable storage medium
CN116091836A (en) Multi-mode visual language understanding and positioning method, device, terminal and medium
CN111309823A (en) Data preprocessing method and device for knowledge graph
CN114819100A (en) Neural network searching method and device of target detection model and electronic equipment
JP7164077B2 (en) Video segment description generation method, apparatus, program, electronic device and storage medium
US11574181B2 (en) Fusion of neural networks
CN115910217B (en) Base determination method, device, computer equipment and storage medium
CN116108893A (en) Self-adaptive fine tuning method, device and equipment for convolutional neural network and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination