CN113886598A

CN113886598A - Knowledge graph representation method based on federal learning

Info

Publication number: CN113886598A
Application number: CN202111134706.3A
Authority: CN
Inventors: 张文; 陈名杨; 姚祯; 陈华钧
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2021-09-27
Filing date: 2021-09-27
Publication date: 2022-01-04

Abstract

The invention discloses a knowledge graph representation learning method based on federal learning, which comprises the following steps: firstly, a central server and a plurality of clients are established; the central server aggregates entity embeddings from different clients and sends the aggregated embeddings back to each client; the client updates the entity and the relationship embedding by using the local triple, and sends the updated entity embedding matrix to the central server. Further, given that the embedding learned by the federated knowledge graph embedding framework is complementary to the training embedding based on only one knowledge graph without federated settings, a knowledge graph fusion step is provided to fuse the embedding learned with and without federated settings. The method can simultaneously utilize a plurality of knowledge graphs to supplement each other, ensures the privacy of data, and has good practical value in the knowledge graph completion task.

Description

Knowledge graph representation method based on federal learning

Technical Field

The invention belongs to the technical field of knowledge graph representation, and particularly relates to a knowledge graph representation learning method based on federal learning.

Background

A knowledge-graph (KG) is a data set connected by a head entity (head entry) and a tail entity (tail entry) in the form of a triple through a relationship (relationship). Triplets are represented in the form of [ head, relation, tail entities ] (abbreviated as (h, r, t)), and many large-scale knowledge maps such as FreeBase, YAGO and WordNet are being built up, which provide an effective basis for many important AI tasks such as semantic search, recommendation, question answering, and the like. Knowledge maps often contain a large amount of information, and two of the more important information are considered to be structural information and text information. The structural information refers to a certain relation existing between an entity and other entities through a relationship, and the structural information of one entity can be often embodied through a neighbor triple of the entity; the text information refers to semantic information of text description of the entities and relations in the knowledge-graph, and is usually embodied by names of the entities and relations, additional word descriptions of the entities and relations, and the like. However, many knowledgemaps are still incomplete, and therefore it is important to predict missing triples, i.e., the knowledgemap completion (KGC) task, on existing triples.

In actual knowledge graph application, it is common that the same entity relates to different knowledge graphs, in this case, the knowledge graph becomes a multi-source knowledge graph, a plurality of knowledge graphs can supplement knowledge, and better performance is obtained in link prediction and other problems, but the knowledge graph often relates to some sensitive fields (such as financial or medical fields) or is restricted by some regulations. Therefore, how to utilize the complementary capabilities of different related knowledge graphs while protecting data privacy is an urgent problem to be solved in practical application.

Patent application publication No. CN112200321A discloses an inference method based on knowledge federation and graph network, comprising: each participant server constructs a knowledge graph for local entity data; generating a low-dimensional knowledge vector according to the initial node characteristics and the structural information of the knowledge graph and a pre-trained graph neural network model; sending each low-dimensional knowledge vector to a trusted third-party server; the credible third-party server fuses the received low-dimensional knowledge vector by using a pre-trained fusion model to obtain a fused feature representation; and (4) aiming at the knowledge inference request, the index fused feature representation is inferred to obtain an inference result.

Patent application publication No. CN111767411A discloses a knowledge graph representation learning optimization method, which determines a training sample set from a local knowledge graph data set; and carrying out federal learning on a local knowledge graph representation learning model based on the training sample set and the other data ends to obtain a target knowledge graph representation learning model, wherein the other data ends participate in federal learning based on the training sample set determined in the local knowledge graph data sets.

In the two technical schemes, the knowledge graph is learned by federal learning, but the problem of human fusion that the entity embedded representation is obtained without participating in the federal learning and participating in the federal learning is not considered, so that the obtained entity embedded representation is inaccurate.

Disclosure of Invention

In view of the above, the invention provides a knowledge graph representation learning method based on federal learning, which performs data privacy protection while obtaining a global entity embedded representation in a federal learning manner, and each client fuses entity embedded representations participating in federal learning and entities not participating in federal learning, thereby further improving the accuracy of entity embedded representation.

The technical scheme provided by the invention is as follows:

a knowledge graph representation learning method based on federal learning comprises the following steps that:

(1) the central server maintains an entity list of all knowledge maps, defines a permutation matrix and a presence vector for each client, initializes an entity embedded matrix, screens the entity embedded matrix according to the permutation matrix to determine an entity embedded matrix corresponding to each client and sends the entity embedded matrix to the clients;

(2) and circulating the federal learning process: the client side performs knowledge graph representation learning by adopting a local knowledge graph according to the received entity embedded matrix to update the entity embedded matrix, and uploads the updated entity embedded matrix to the central server; the central server aggregates all entity embedded matrixes uploaded by all the clients according to the permutation matrixes and the existence vectors, screens the aggregated entity embedded matrixes according to the permutation matrixes to determine entity embedded matrixes corresponding to all the clients and issues the entity embedded matrixes to the clients;

(3) and each client side fuses the entity embedded matrix determined by participating in the federal learning and the entity embedded matrix determined by not participating in the federal learning so as to determine a final entity embedded matrix.

Compared with the prior art, the invention has the beneficial effects that at least:

the method comprises the steps that a local knowledge graph of each client is introduced into federal setting, a traditional knowledge graph completion task is expressed as a new task, namely the federal knowledge graph completion task, privacy of the knowledge graphs of the clients is guaranteed while a plurality of knowledge graphs are utilized, and moreover, the accuracy of an entity embedding matrix participating in federal learning and determining is improved in a federal learning mode.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a flowchart of a knowledge graph representation learning method based on federated learning according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the detailed description and specific examples, while indicating the scope of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

Inspired by federal learning and knowledge graph embedding concepts, in order to fully utilize different related knowledge graphs and ensure the privacy of data, the embodiment provides a knowledge graph representation learning method based on federal learning to learn the embedding of shared entities. The system for realizing the embedding of the learning sharing entity comprises a plurality of clients and a central server which is respectively communicated with each client, wherein each client is provided with a local knowledge graph. Specifically, by learning the embedding of the shared entity, the ability to predict missing triples is embedded using the knowledge graph without collecting triples from different customer servers to a central server. Privacy is protected by re-client training embedding and aggregating embedding at the server side without collecting three clients into a central service area. For each client, the triplets and the relation sets cannot be disclosed to other clients, and the entity sets are only read by the central server, so that data privacy is guaranteed. In addition, a model fusion process on each client was also designed in order to fuse the embedded capabilities based on only one client and learning in the federated environment.

Fig. 1 is a flowchart of a knowledge graph representation learning method based on federated learning according to an embodiment. As shown in fig. 1, the knowledge graph representation learning method based on federated learning provided in the embodiment includes the following steps:

and step 1, establishing a federal learning system.

In the embodiment, the established federated learning system comprises a plurality of clients and 1 central server, wherein the central server respectively performs data transmission with each client, the plurality of clients do not perform data transmission with each other, and each client has a local knowledge graph. E represents an entity set in the knowledge graph, R represents a relation set in the knowledge graph, TP represents a triple set in the knowledge graph,the federal knowledge graph

Wherein G is_c＝{E_c,R_c,TP_cRepresents the knowledge-graph stored at the c-th client. C denotes the total number of clients.

And 2, maintaining an entity list of all knowledge graphs by the central server, and defining a permutation matrix and a presence vector for each client.

In an embodiment, an entity list T is maintained in the central server, and all unique entities included in the knowledge graph of all clients participating in federal learning are recorded in the entity list T, that is, the entities maintained in the entity list T are not duplicated.

In an embodiment, the permutation matrix defined by the central server for each client is represented as

Wherein the permutation matrix P^cEach element of

Representing the correspondence of entities in the entity list of the central server to entities of the client,

indicating that the ith entity in the entity list is the jth entity of the c-th client,

indicating that the ith entity in the entity list is not the jth entity of the C client, C indicating the total number of the clients, n indicating the number of the entities in the entity list, n_cRepresenting the entity number of the c-th client;

the presence vector defined by the central server for each client is represented as

Wherein the vector represents V^cEach element in total

Indicating the presence of the ith entity for the c-th client,

an ith entity indicating the existence of the c-th client,

indicating that the ith entity does not exist for the c-th client.

The permutation matrix and the presence vector are used for subsequent screening and aggregation of entity embedding matrixes corresponding to the clients.

And 3, initializing the entity embedded matrix by the central server, screening the entity embedded matrix according to the permutation matrix to determine the entity embedded matrix corresponding to each client and issuing the entity embedded matrix to the client.

In an embodiment, the central server initialization entity embedding matrix is represented as

Wherein d is_eIs the embedding dimension of the entity. Before being sent to each client, the entity embedding matrix is screened to determine the entity embedding matrix corresponding to each client, specifically, the following formula is adopted to screen the entity embedding matrix according to the permutation matrix to determine the entity embedding matrix corresponding to each client:

wherein E is_tRepresenting entity embedded matrix, P, aggregated by the central server in the t-th round of federal learning process^cRepresenting a permutation matrix, T representing a transposition,

the entity embedding matrix corresponding to the c-th client in the t-th round of federal learning is shown, when t is 0,E₀an entity embedding matrix representing the initialization of the central server,

and the initialized entity embedded matrix corresponding to the c-th client is represented.

In the t-th round, the central server sends the entity-embedded matrix to each client

And the specific entity embedded matrix is returned to the corresponding client, so that the control entity and the relation data can not be leaked.

And 4, training federal learning by each client and the cigarette server based on the entity embedded matrix and the knowledge graph so as to update the entity embedded matrix.

In the examples, the federated learning process is cycled: the client side performs knowledge graph representation learning by adopting a local knowledge graph according to the received entity embedded matrix to update the entity embedded matrix, and uploads the updated entity embedded matrix to the central server; and the central server aggregates all the entity embedded matrixes uploaded by all the clients according to the permutation matrixes and the existence vectors, screens the aggregated entity embedded matrixes according to the permutation matrixes to determine the entity embedded matrixes corresponding to all the clients and issues the entity embedded matrixes to the clients.

In an embodiment, each client receives the entity embedding matrix

And then, updating the entity embedded matrix by adopting the local triple and the relationship embedded matrix, and updating the relationship embedded matrix of the client by the client. When a client side adopts a local knowledge graph to carry out knowledge graph representation learning, for each triple (h, r, t) consisting of a head entity, a relation and a tail entity in the knowledge graph, a scoring function is adopted to construct a loss function, an embedded model is trained by using the loss function, and an entity embedded matrix is updated at the same time;

wherein the loss function is:

where L (h, r, t) is the loss function of the triplet (h, r, t), f_r(h, r, t) represents the scoring function of the triplet (h, r, t), f_r(h,r,t’_i) Denotes a triplet (h, r, t'_i) Score function of (d), triplet (h, r, t'_i) Representing that the tail entity t of the triplet (h, r, t) is negatively sampled as the ith entity t'_iAnd in the obtained triple, m is the number of negative samples, gamma is an edge value, the value range is a real number set, sigma (·) is a sigmoid function, and p (h, r, t'_i) Denotes a triplet (h, r, t'_i) Is defined as follows:

where α represents a negative sample temperature.

In an embodiment, the specific form of the scoring function may vary according to the embedding model selected, and the following table lists the embedding models that may be used and the corresponding scoring functions.

TABLE 1

It is understood that: the embedded model adopts a TransE model and a corresponding scoring function f_r(h, r, t) — h + r-t |; or the embedded model adopts a DistMult model and a corresponding scoring function f_r(h,r,t)＝h^Tdiag (r) t; or the embedded model adopts a ComplEx model and a corresponding scoring function

Or the embedded model adopts a RotaE model and a corresponding scoring function

Wherein d isiag (-) denotes a diagonal matrix,

represents the complex conjugate of the entity t, Re (. cndot.) represents the real part,

representing element multiplication.

After P rounds of training are carried out on the t-1 th round of client side, the client side uploads the trained entity embedded vectors

And when the client side sends the information to the central server, the central server aggregates all entity embedded matrixes uploaded by all the client sides according to the permutation matrix and the existence vector:

wherein, P^cPermutation matrix, V, representing the c-th client^cA presence vector representing the c-th client,

representing an entity embedding matrix corresponding to the c-th client in the t-th round of federal learning process, E_tRepresenting the entity embedding matrix aggregated by the central server in the tth round of federal learning process,

represents a 1 vector, i.e. a vector consisting of elements 1,

the method is expressed by the method of element division,

representing element multiplication, and ← representing valuation. The aggregation formula is understood as the entity embedding matrix from different clients will be permuted by the corresponding permutation matrix and by each of the presence vectorsThe number of entities present calculates a weight vector.

And 5, fusing the entity embedded matrix determined by participating in the federal learning and the entity embedded matrix determined by not participating in the federal learning by each client so as to determine a final entity embedded matrix.

The embedding learned by the federated knowledge graph embedding framework is complementary to training embedding based on only one knowledge graph without a federated setting. Embodiments therefore also design a fusion process to be applied to each client's knowledge graph to fuse entity embedding matrices with and without federal settings learning. The process of fusing the entity embedded matrix determined by participating in the federal learning and the entity embedded matrix determined by not participating in the federal learning by each client is as follows:

after splicing the entity embedded matrix determined by participating in the federal learning and the entity embedded matrix determined by not participating in the federal learning, fusing the splicing result by using a trained linear classifier, which is specifically represented as:

wherein,

[；]indicating that the two score vectors are concatenated by column,

score vectors representing training of a single knowledge-graph not participating in federal learning,

a scoring vector representing federal knowledge graph training participating in federal learning.

In an embodiment, the linear classifier is trained by interval loss ordering, so that the ordering of the positive triples is higher than that of the negative triples, and a loss function adopted in the training process is as follows:

L_f(h,r,t)＝max(0,β-s(h,r,t)+s(h,r,t`))

wherein s (h, r, t) represents the fusion score vector of the positive triplet (h, r, t) on the linear classifier, s (h, r, t ') represents the fusion score vector of the negative triplet (h, r, t') on the linear classifier, β represents the interval parameter, and the value range is a real number set. The training goal of the linear classifier is to minimize the triplet model fusion penalty.

The knowledge graph representation learning method based on the federal learning provided by the embodiment can be applied to financial knowledge graphs and related applications which need privacy protection. Different relationships between entities, including but not limited to customers, companies, financial products, etc., are constructed and maintained, for example, in different financial institutions, which may include purchasing, attention, holdings, etc. relationships. For the task of classifying customers, each financial institution classifies the customers, but if the data of different institutions can be utilized, a better customer classification effect can be obtained, but under the condition of privacy protection, different financial institutions are not willing to directly exchange the data. The federal learning-based knowledge graph representation learning approach presented herein can effectively address this problem, where different financial institutions are considered client-side and a trusted institution, such as the government, is considered server-side. Finally, each client learns the embedded representation of the client, and the information of other clients is fused on the basis of privacy protection, so that a better client classification effect can be obtained.

In the knowledge graph representation learning method based on federated learning provided in the above embodiment, the local knowledge graph of each client is introduced into federated settings, and the conventional knowledge graph completion task is represented as a new task, that is, the federated knowledge graph completion task, so that privacy of the knowledge graphs of each client is guaranteed while a plurality of knowledge graphs are utilized, and furthermore, the accuracy of the entity embedded matrix participating in federated learning is improved by adopting a federated learning manner.

The above-mentioned embodiments are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only the most preferred embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions, equivalents, etc. made within the scope of the principles of the present invention should be included in the scope of the present invention.

Claims

1. A knowledge graph representation learning method based on federal learning is characterized in that a system for realizing the method comprises a plurality of clients and a central server which is respectively communicated with each client, each client is provided with a local knowledge graph, and the method comprises the following steps:

2. The knowledge graph representation learning method based on federal learning of claim 1, wherein the central serverThe permutation matrix defined by the server for each client is represented as

Wherein the permutation matrix P^cEach element of

Wherein the vector represents V^cTotal of each element V_i ^cIndicating the presence of the ith entity of the c-th client, V_i ^c1 denotes the ith entity, V, that the c-th client exists_i ^c0 means that the ith entity does not exist for the c-th client.

3. The knowledge graph representation learning method based on federated learning of claim 1, wherein the entity embedding matrix is filtered according to the permutation matrix to determine the entity embedding matrix corresponding to each client using the following formula:

representing an entity embedding matrix corresponding to the c-th client in the t-th round of federal learning process, and when t is 0, E₀An entity embedding matrix representing the initialization of the central server,

4. The knowledge graph representation learning method based on federated learning of claim 1, wherein, when the client embeds a matrix according to the received entities and performs knowledge graph representation learning by using a local knowledge graph, for each triplet (h, r, t) consisting of a head entity, a relationship and a tail entity in the knowledge graph, a scoring function is used to construct a loss function, an embedding model is trained by using the loss function, and the entity embedding matrix is updated at the same time;

wherein the loss function is:

where L (h, r, t) is the loss function of the triplet (h, r, t), f_r(h, r, t) represents the scoring function of the triplet (h, r, t), f_r(h,r,t’_i) Denotes a triplet (h, r, t'_i) Score function of (d), triplet (h, r, t'_i) Representing that the tail entity t of the triplet (h, r, t) is negatively sampled as the ith entity t'_iThe obtained triad m is the number of negative samples, and the gamma tableRepresenting an edge value, wherein the value range is the whole real number set, sigma (·) represents a sigmoid function, and p (h, r, t'_i) Denotes a triplet (h, r, t'_i) Negative sample weight values of (1).

5. The knowledge graph representation learning method based on federated learning of claim 4, wherein the embedded model employs a TransE model, the corresponding scoring function f_r(h, r, t) — h + r-t |; or the embedded model adopts a DistMult model and a corresponding scoring function f_r(h,r,t)＝h^Tdiag (r) t; or the embedded model adopts a ComplEx model and a corresponding scoring function

Or the embedded model adopts a RotaE model and a corresponding scoring function

Wherein diag (·) denotes a diagonal matrix,

representing element multiplication.

6. The knowledge graph representation learning method based on federated learning of claim 1, wherein the central server aggregates all entity embedded matrices uploaded by all clients according to permutation matrices and presence vectors using the following formula:

a vector of 1 is represented by a vector of 1,

the method is expressed by the method of element division,

representing element multiplication, and ← representing valuation.

7. The knowledge graph representation learning method based on federal learning as claimed in claim 1, wherein the process of fusing the entity embedded matrix determined by participating in federal learning and the entity embedded matrix determined by not participating in federal learning by each client is as follows:

and after splicing the entity embedded matrix determined by participating in the federal learning and the entity embedded matrix determined by not participating in the federal learning, fusing the splicing result by using a trained linear classifier.

8. A knowledge graph representation learning method based on federated learning as defined in claim 7, wherein the linear classifier is trained by interval loss ordering such that positive triples are ordered higher than negative triples, the loss function employed in the training process being:

L_f(h,r,t)＝max(0,β-s(h,r,t)+s(h,r,t`))

wherein s (h, r, t) represents the fusion score vector of the positive triplet (h, r, t) on the linear classifier, s (h, r, t ') represents the fusion score vector of the negative triplet (h, r, t') on the linear classifier, β represents the interval parameter, and the value range is a real number set.