CN115618097A

CN115618097A - Entity alignment method for prior data insufficient multi-social media platform knowledge graph

Info

Publication number: CN115618097A
Application number: CN202211075622.1A
Authority: CN
Inventors: 王柱; 刘慧�; 刘囡囡; 徐沛; 郑贺源; 郭斌; 於志文
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2022-09-05
Filing date: 2022-09-05
Publication date: 2023-01-17

Abstract

The invention relates to an entity alignment method for a knowledge graph of a multi-social media platform with insufficient prior data, which utilizes knowledge graph information to supplement the prior information of the multi-social media platform, so that the accuracy of entity alignment is improved; the invention adds the entities which are possible to be aligned into the prior data set in the iterative process by introducing an iterative mechanism, and simultaneously introduces an alignment judgment mechanism to prevent the continuous accumulation of errors in the iterative process; the invention also introduces an embedding distribution alignment mechanism, restricts the shape of the unmarked entity in the embedding space based on the countermeasure network, and enables the entity source embedding and the target embedding distribution to be approximately isomorphic on the premise of not needing more prior data. The invention provides an entity alignment model architecture, which can achieve better effect only by a small amount of prior information and effectively reduce the dependence on data labels.

Description

Entity alignment method for prior data insufficient multi-social media platform knowledge graph

Technical Field

The invention relates to the field of machine learning and the field of natural language processing, in particular to an entity alignment method for a knowledge graph of multiple social media platforms with insufficient prior data.

Background

With the continuous development of the internet in recent years, the activities of users on social media platforms are more and more active. The user can pay attention to activities and hot news of the large V user on Facebook, and can add friend chatting and sharing trends on Twitter, and a knowledge graph formed by social media data of the user comprises various contents such as the user, events, subjects and the like. Due to the different main functions of different social media platforms, the data of the user on different social media has no obvious relevance. The process of judging that different entities in the knowledge graph constructed by the social media platform point to things in the same real world is called entity alignment, and the realization of the entity alignment has important significance for applications such as user portrayal, recommendation systems and the like.

The study of entity alignment based on knowledge graph embedding receives more and more attention, and the entity is embedded into a low-dimensional feature space by a representation learning method based on the assumption that equivalent entities in different knowledge graphs have similar neighborhood structures, and the vector similarity is calculated to find potential aligned entity pairs. Meanwhile, in order to solve the problem of data sparseness, many studies have been made to enhance the performance of entity alignment using additional information.

Entity alignment based on knowledge graph embedding relies on annotated prior information, but data annotation is typically costly and requires a significant amount of time and money. The accuracy of the current semi-supervised entity alignment algorithm cannot be guaranteed under the condition of a small amount of labeled data. Therefore, it becomes more important to maintain a high accuracy while reducing the dependency on the a priori information.

Disclosure of Invention

Technical problem to be solved

Aiming at the problem that the accuracy of the entity alignment result of the knowledge graph of the multiple social media platforms is low under the condition of insufficient prior data, the invention provides an entity alignment method for the knowledge graph of the multiple social media platforms under the condition of insufficient prior data, provides a model for introducing an iteration strategy and an embedding distribution alignment strategy, adds a newly discovered possible aligned entity in the training process into a training set, realizes the dynamic increase of the prior data, and simultaneously restrains the shape of an unmarked entity in an embedding space based on an antagonistic network. The method can supplement the information of the original data, thereby ensuring the accuracy of entity alignment under the condition of insufficient prior data.

Technical scheme

An entity alignment method for a priori data insufficient multi-social media platform knowledge graph is characterized by comprising the following steps:

s1: constructing a knowledge graph according to attributes, data and interaction of a user on a plurality of social media platforms;

s2: obtaining a similarity matrix of three dimensions from the structure, the semantics and the character string according to the knowledge graph, and performing feature fusion to obtain initial entity similarity;

s3: iteratively adding the possibly aligned entities with high entity similarity into a training data set in an automatic marking mode to be used as prior data, and realizing dynamic increase of the prior data;

s4: frequency sampling of unlabeled entities, aligning the embedded distribution of source and target knowledge-maps in generating a countermeasure network, may enable a reduction in the distance between alignable entities;

s5: and recalculating the similarity of the structure, and performing feature fusion to obtain the result of entity alignment.

The further technical scheme of the invention is as follows: in the knowledge graph constructed in the S1, the central words of the users, the attributes and the comment content after word segmentation are all used as entity nodes, and the knowledge graph is constructed by the user and the user, the user and the attributes and the user and the comment central words in a triple < h, r and t >.

The further technical scheme of the invention is as follows: the initial entity similarity calculation of the S2 comprises three similarity matrixes obtained based on the knowledge graph of the S1: and fusing the similarity matrixes of the three dimensions to obtain a unified entity similarity matrix.

The further technical scheme of the invention is as follows: in S3, the iteration strategy specifically includes:

s31: according to the S2 entity initial similarity, for any entity x, when the similarity between the entity y and the entity x is larger than a threshold value, adding the entity y into a candidate alignment entity of the entity x;

s32: constructing a bipartite graph by using all x and y which meet a condition, wherein nodes represent entities, and edges represent probabilities that the nodes can be aligned;

s33: searching the edges with the maximum probability value and no intersection in the bipartite graph to obtain a pair of entities which are one-in-one and most likely to be aligned, and marking alignment labels on the pairs of entities;

s34: in the iteration process, an alignment judgment method is used for preventing the marked entity from being repeatedly marked or becoming unmarked;

s35: automatically tagged pairs of entities are added to pairs of seed entities, the two knowledgemaps are connected using the entities, and the own knowledgemap is enriched with information from another entity.

The further technical scheme of the invention is as follows: in S4, the embedding distribution strategy specifically includes:

s41: frequency sampling is carried out on all entity pairs;

s42: an indicator defining the GAN represents from which knowledge-graph the entity came;

s43: judging domain features among classified entities by a GAN (generic identifier);

s44: the least square generates an antagonistic network LSGAN as an antagonistic loss, which selects the least square loss as a discriminator, and the antagonistic module adopts a 0-1 coding method.

The further technical scheme of the invention is as follows: in S5, re-fusing the similarity matrices of the three dimensions to obtain a uniform entity similarity matrix specifically includes:

s51: calculating the similarity of the obtained sums to obtain a similarity matrix embedded with the structural features;

s52: and setting the same weight of each dimension, and obtaining a uniform entity similarity matrix by weighted average.

Advantageous effects

According to the entity alignment method for the knowledge graph of the multi-social-media platform with insufficient prior data, the prior information of the multi-social-media platform is supplemented by the knowledge graph information, so that the accuracy of entity alignment is improved; the invention adds the entities which are possible to be aligned into the prior data set in the iterative process by introducing an iterative mechanism, and simultaneously introduces an alignment judgment mechanism to prevent the continuous accumulation of errors in the iterative process; the invention also introduces an embedding distribution alignment mechanism, restricts the shape of the unmarked entity in the embedding space based on the countermeasure network, and enables the entity source embedding and the target embedding distribution to be approximately isomorphic on the premise of not needing more prior data. The invention provides an entity alignment model architecture, which can achieve better effect only by a small amount of prior information and effectively reduce the dependence on data labels.

Drawings

The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, wherein like reference numerals are used to designate like parts throughout.

FIG. 1 is an example diagram of a knowledge graph of an entity alignment method for a priori data-deficient multi-social media platform knowledge graph;

FIG. 2 is an embedded distributed alignment motivation diagram for an entity alignment method for a priori data insufficient multi-social media platform knowledge graph;

FIG. 3 is a model flow diagram of an entity alignment method for a priori data-deficient multi-social media platform knowledge graph.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

An entity alignment method for a priori data insufficient multi-social media platform knowledge graph comprises the following steps:

s1, constructing a knowledge graph according to attributes, data and interaction of a user on a plurality of social media platforms;

s2, obtaining a similarity matrix of three dimensions from the structure, the semantics and the character string according to the knowledge graph, and performing feature fusion to obtain initial entity similarity;

s3, iteratively adding entities which are possibly aligned into the training data set in an automatic marking mode to realize dynamic increase of prior data;

s4, performing frequency sampling on the unmarked entities, and aligning the embedded distribution of the source knowledge graph and the target knowledge graph in the generation of the countermeasure network, so that the distance between the entities capable of being aligned is reduced;

and S5, recalculating the similarity of the structure, and performing feature fusion to further obtain the result of entity alignment.

The technical solution of the present invention is described in detail below:

in the constructed knowledge graph, the central words of the users, the attributes and the comment content after word segmentation are all used as entity nodes, and the knowledge graph is constructed among the users, the attributes and the comments in a triple < h, r and t > mode. Specifically, a knowledge graph is constructed by taking a user ID, a user nickname, a user comment subject and a user comment keyword as entity nodes, wherein the nodes are < user, friend, user >, < user, nickname, user nickname >, < user, discussion, user comment subject >, < user, comment, user comment keyword >, < user comment keyword >, comment connection and user comment keyword >.

the specific process for obtaining the initial entity similarity comprises the following steps:

constructing a dual relation graph, wherein the relation in the graph is represented as the weight of the two relations, the weight is represented as the similarity value of the two relations, and the weight calculation formula is as follows:

wherein r is _i Representing a certain relationship in the knowledge-graph, r _j Representing a certain relationship in the knowledge-graph, H _i Representing relationships r in a knowledge graph _i Connected set of head nodes, H _j Representing relationships r in a knowledge graph _j Connected set of head nodes, H represents the relation r _i And r _j Similarity value of connected head nodes, T _i Representing relationships r in a knowledge graph _i Connected tail node set, H _j Representing relationships r in a knowledge graph _j Connected tail node set, T representing the relationship r _i And r _j A similarity value of the connected tail nodes;

wherein

Display toIdentify the relation r in the map _i And relation r _j The weight of (c);

applying a graph attention mechanism to iterate to obtain node representations e of the dual relationship graph and the original graph;

and calculating the similarity among the vectors by using a cosine similarity algorithm to obtain a structural similarity matrix, wherein the cosine similarity calculation formula is as follows:

wherein e _i Vector representation representing node i in the knowledge-graph, e _j Representing vector representation of a node j in the knowledge graph, and sim represents structural similarity;

performing Word segmentation on comment corpora of all users in original data, and inputting the comment corpora into Word2vec, fastText and GLOVE pre-training Word vector models to train models;

inputting entities in the knowledge graph into three word vector models to obtain vector representation;

calculating a similarity matrix between vectors by using a cosine similarity algorithm to obtain a semantic similarity matrix;

measuring the difference between the two character strings by adopting a Levenshtein distance;

calculating the Levenshtein ratio represents the similarity between entity names, and the calculation formula is as follows:

wherein e _i Representing entity name vector representations in a knowledge graph, e _j Representing entity name vector representations in the knowledge graph, lev representing a Levenshtein distance between two character string vector representations, and r representing similarity between entity names;

assuming that each dimension has equal importance to obtain a uniform entity similarity matrix, the calculation formula is as follows:

wherein S ^k Representing a string similarity matrix, a semantic similarity matrix, a structural similarity matrix, mean being a mean function, std being a function of calculating the standard deviation,

is a normalized matrix representing the similarity of character strings, a normalized matrix of semantic similarity, a normalized matrix of structural similarity, S ^* Representing a similarity average matrix;

s3, iteratively adding entities which are possibly aligned into a training data set in an automatic marking mode to realize dynamic increase of prior data;

the automatic marking and adding of the entity to the training data set specifically comprises:

s31: according to the initial entity similarity calculated in the S2, for any entity x, when the similarity between the entity y and the entity x is larger than a threshold value, adding the entity y into a candidate alignment entity of the entity x; for example, in a ground t-round iteration:

π(y|x；Θ ^t )＝σ(sim(x,y))

wherein x represents any entity in the knowledge graph, y represents any entity except x in the knowledge graph, sim represents the similarity between the entity x and the entity y obtained according to the similarity matrix fusing the multidimensional characteristics in S1, and theta ^t The representation is obtained by calculation according to the multidimensional characteristics of the t-th round, and pi represents the probability of distributing labels;

Y _x '＝{y|y∈Y'andπ(y|x；Θ ^t )＞γ ₁ }

wherein psi ^t To representIndicating function, X 'represents entity set, Y' _x Representing candidate aligned entities of entity x, max representing a maximum function;

s32: constructing a bipartite graph by using all x, y satisfying conditions, wherein nodes represent entities, and edges represent probabilities that the nodes can be aligned;

s33: searching the edges with the maximum probability values and without intersection in the bipartite graph to obtain a pair of entities which are one-to-one and most likely to be aligned, and marking alignment labels on the pair of entities;

when the entities conflict, the model expects to obtain an entity pair with higher possibility, and the calculation formula of the similarity difference of the entity pair is as follows:

wherein x represents any entity, y represents one candidate aligned entity of the entity x, y' represents another candidate aligned entity of the entity x, pi represents the above assigned tag probability in S21,

representing differences in entity pair similarity;

s35: the automatically labeled entity pairs are iteratively added to the seed entity pair, the two knowledge-graphs are connected using the entities, and the knowledge-graph itself is enriched with information from another entity.

the method comprises the following specific steps:

s41: frequency sampling is carried out on all entity pairs;

the calculation formula of the frequency sampling vector is as follows:

wherein f is _hi Representing the number of entities i as head entities in all triples, f _ti Representing the number of entities i as tail entities in all triples, P _i Representing the frequency sampling vector of the entity i, and E representing the number of nodes in the knowledge graph;

s43: judging domain features among classified entities by a GAN (generic area network) discriminator;

the definition of the optimal discriminator is as follows:

wherein KG _s Representing source knowledge-graph, KG _t Representing a target knowledge-graph, D ^* Is the best discriminator;

s44: the least square generation countermeasure network (LSGAN) is used as a countermeasure loss, the least square loss is used as a discriminator, and a 0-1 coding method is adopted by a countermeasure module.

D and G are defined as follows:

where δ represents the tag smoothing value, e _s Representing the inputs to generate the countermeasure network module based on the S2 structural feature embedding module described above, e _t Representing inputs to generate a countermeasure network module based on the S2 structural feature embedding module, D representing a discriminator indicating from which knowledge graph the entity came, G representing a generator from which the countermeasure network module was trained, λ ₁ Representing the ratio of loss functions, P, used to correct for between different building blocks _t Representing a target knowledge-graph KG _t Sampling frequency vector of (P) _s Representing a target knowledge-graph KG _s The sampling frequency vector of (a);

s5, recalculating the similarity of the structures, and performing feature fusion;

the similarity calculation formula for the structural feature embedding is as follows:

d(Ge _s ,e _t )＝||Ge _s -e _t ||

wherein e _s Representing the inputs to generate the countermeasure network module based on the S2 structural feature embedding module described above, e _t Representing the input of the generation countermeasure network module based on the S2 structural feature embedding module, G representing a generator which is well trained by the generation countermeasure network module, and d representing the similarity of the structural feature embedding;

and (3) calculating the final similarity of the entities by using the same fusion method in the S1:

the calculation formula is as follows:

and finally, obtaining an entity alignment result according to the similarity matrix between the entities.

The invention provides an entity alignment method for a knowledge graph of multiple social media platforms with insufficient prior data. Since users and user data of multiple social media platforms are not directly related, it takes a lot of time and money to label samples when making entity alignments. The invention provides an entity alignment method introducing an iteration strategy and an embedded distribution alignment strategy. The iteration strategy realizes the dynamic increase of prior data by adding a newly discovered possible alignment entity in the training process into a training set, and simultaneously introduces an alignment judgment mechanism to prevent the continuous accumulation of errors in the iteration process; the embedding distribution alignment strategy is based on that the shape of the unmarked entity in the embedding space is constrained by the countermeasure network, namely the shape of the unmarked entity is required to be similar as much as possible, so as to ensure the accuracy of entity alignment under the condition of insufficient prior data. The invention utilizes knowledge map information, introduces an iteration strategy and an embedded distribution alignment strategy, makes up for insufficient prior data of the original user, and ensures that the accuracy of entity alignment is higher.

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications or substitutions can be easily made by those skilled in the art within the technical scope of the present disclosure.

Claims

1. An entity alignment method for a priori data insufficient multi-social media platform knowledge graph is characterized by comprising the following steps:

s3: the entity with high entity similarity and possible alignment is added into the training data set in an automatic marking mode in an iterative mode and used as prior data, and dynamic increase of the prior data is achieved;

2. The entity alignment method for the a priori data insufficient multi-social media platform knowledge graph according to claim 1, wherein: in the knowledge graph constructed in the S1, the central words of the users, the attributes and the comment content after word segmentation are all used as entity nodes, and the knowledge graph is constructed by the user and the user, the user and the attributes and the user and the comment central words in a triple < h, r and t >.

3. The entity alignment method for the a priori data insufficient multi-social media platform knowledge graph according to claim 2, wherein the entity alignment method comprises the following steps: the initial entity similarity calculation of the S2 comprises three similarity matrixes obtained based on the knowledge graph of the S1: and fusing the similarity matrixes of the three dimensions to obtain a unified entity similarity matrix.

4. The entity alignment method for the priori data insufficient multi-social media platform knowledge graph according to claim 3, wherein in S3, the iteration strategy specifically comprises:

5. The entity alignment method for the a priori data insufficient multi-social media platform knowledge graph according to claim 4, wherein the entity alignment method comprises the following steps: in S4, the embedding distribution strategy specifically includes:

s41: frequency sampling is carried out on all entity pairs;

s42: an indicator defining GAN represents from which knowledge-graph an entity is;

s44: the least square generates the confrontation network LSGAN as the antagonism loss, which selects the least square loss as the discriminator, and the antagonism module adopts the 0-1 coding method.

6. The entity alignment method for a priori data insufficient multi-social media platform knowledge graph according to claim 5, wherein: in S5, re-fusing the similarity matrices of the three dimensions to obtain a uniform entity similarity matrix specifically includes: