CN111026865A - Relation alignment method, device and equipment of knowledge graph and storage medium - Google Patents

Relation alignment method, device and equipment of knowledge graph and storage medium Download PDF

Info

Publication number
CN111026865A
CN111026865A CN201910992324.0A CN201910992324A CN111026865A CN 111026865 A CN111026865 A CN 111026865A CN 201910992324 A CN201910992324 A CN 201910992324A CN 111026865 A CN111026865 A CN 111026865A
Authority
CN
China
Prior art keywords
relationship
vector
relation
sub
initial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910992324.0A
Other languages
Chinese (zh)
Other versions
CN111026865B (en
Inventor
凌岚
刘嘉伟
于修铭
陈晨
李可
汪伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201910992324.0A priority Critical patent/CN111026865B/en
Priority to PCT/CN2019/119305 priority patent/WO2021072891A1/en
Publication of CN111026865A publication Critical patent/CN111026865A/en
Application granted granted Critical
Publication of CN111026865B publication Critical patent/CN111026865B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the field of artificial intelligence, and discloses a relation alignment method, a device, equipment and a storage medium of a knowledge graph. The method comprises the following steps: acquiring triple group data in the knowledge graph; converting the ternary group data through a preset vector conversion model to obtain a relationship initial vector of each relationship; clustering all the relation initial vectors through a preset clustering algorithm to obtain a relation clustering vector of each relation; clustering initial relation vectors of the same relation in the triple data through a preset clustering algorithm to obtain a relation sub-vector of each relation; calculating the relationship similarity between the relationships in any two triples based on the relationship initial vector, the relationship clustering vector and the relationship sub-vector; and classifying the relation according to the relation similarity.

Description

Relation alignment method, device and equipment of knowledge graph and storage medium
Technical Field
The invention relates to the field of artificial intelligence, in particular to a relation alignment method, a device, equipment and a storage medium of a knowledge graph.
Background
Knowledge graph technology is increasingly becoming the basis of artificial intelligence, and is an important method for machine understanding of natural language and construction of knowledge network. In recent years, different knowledge graph fusion is invisibly created, and the knowledge graph fusion method helps workers in different working fields to retrieve related contents, so that the working quality and the working efficiency are improved.
The legal knowledge system is very complex and is a combination of various logics. In order to construct a complete knowledge system, knowledge from various fields needs to be fused, the knowledge system is continuously updated, new entities and relationships are added in the updating process, and a problem and a challenge are presented in how to correspond the new relationships to the existing knowledge graph. Relationships of the same meaning may have different expressions, such as "country of the country" and "nationality", and the same relationship also has multiple meanings, such as < new york, belong to, usa > and < display, belong to, tv > and partial relationship. However, no relevant research on relation classification exists at present, and the efficiency of relation classification in the knowledge graph is low.
Disclosure of Invention
The invention provides a relation alignment method, a device, equipment and a storage medium of a knowledge graph, which can determine whether each relation represents the same meaning and classify the relation by vectorizing the relation based on the upper and lower hierarchical structure information of the relation and calculating the similarity between the relation vectors, thereby improving the efficiency of the relation alignment in the knowledge graph.
A first aspect of an embodiment of the present invention provides a relation alignment method for a knowledge graph, including: acquiring triple data in a knowledge graph, wherein the triple comprises a main body, a relation and an attribute; converting the triple data through a preset vector conversion model to obtain a relationship initial vector of each relationship; clustering all the relation initial vectors through a preset clustering algorithm to obtain a relation clustering vector of each relation; clustering the initial relation vectors of the same relation in the triple data through the preset clustering algorithm to obtain a relation sub-vector of each relation; calculating the relationship similarity between the relationships in any two triples based on the relationship initial vector, the relationship cluster vector and the relationship sub-vector; and classifying the relation according to the relation similarity.
Optionally, in a first implementation manner of the first aspect of the embodiment of the present invention, the converting the triple data by using a preset vector conversion model to obtain a relationship initial vector of each relationship includes: mapping subjects and objects of a target triplet into low-dimensional vectors
Figure BDA0002238651840000021
And
Figure BDA0002238651840000022
the target triple is any triple in the triple data, and i is the total number of the triples in the triple data; adjustment by preset loss function
Figure BDA0002238651840000023
And
Figure BDA0002238651840000024
up to
Figure BDA0002238651840000025
At a minimum, wherein
Figure BDA0002238651840000026
When said
Figure BDA0002238651840000027
At the minimum, will
Figure BDA0002238651840000028
Said target triplets arranged as vector type, said
Figure BDA0002238651840000029
Is the relationship initial vector.
Optionally, in a second implementation manner of the first aspect of the embodiment of the present invention, the clustering all the relationship initial vectors by using a preset clustering algorithm to obtain a relationship clustering vector of each relationship includes: selecting k relation initial vectors from all the relation initial vectors as a first centroid vector, wherein k is an integer greater than 1; calculating Euclidean distances between each relationship initial vector and k first centroid vectors, and determining a first clustering vector of each relationship initial vector, wherein the first clustering vector is the first centroid vector with the smallest Euclidean distance with one relationship initial vector; classifying each relationship initial vector and the corresponding first clustering vector into a cluster class to obtain k first cluster classes; calculating a second centroid vector of each first cluster class based on a mean principle; calculating Euclidean distances between each relationship initial vector and k second centroid vectors, and determining a second clustering vector of each relationship initial vector, wherein the second clustering vector is the second centroid vector with the smallest Euclidean distance with one relationship initial vector; classifying each relationship initial vector and the corresponding second cluster vector into a cluster to obtain k second clusters; calculating a third centroid vector of each second cluster class based on a mean principle; and repeatedly calculating the (N + 1) th centroid vector of each N-th cluster until the N-th centroid vector is the same as the (N + 1) th centroid vector, taking the N-th cluster as k clusters, taking the (N + 1) th centroid vector as a relational clustering vector, and taking N as an integer greater than 1.
Optionally, in a third implementation manner of the first aspect of the embodiment of the present invention, the clustering the initial relationship vectors of the same relationship in the triple data by using the preset clustering algorithm to obtain a relationship sub-vector of each relationship includes: selecting n initial relation vectors from all the initial relation vectors of a target relation as n first sub-centroid vectors, wherein the target relation is the same at least one relation, and n is an integer greater than 1; calculating Euclidean distances between each relation initial vector of the target relation and n first sub-centroid vectors, and determining a first sub-vector of each relation initial vector of the target relation, wherein the first sub-vector is the first sub-centroid vector with the smallest Euclidean distance with one relation initial vector; taking each relationship initial vector of the target relationship and the corresponding first sub-vector as a sub-cluster class to obtain n first sub-cluster classes; calculating a second sub-centroid vector of each first sub-cluster class based on a mean principle; calculating Euclidean distances between each relationship initial vector of the target relationship and n second sub-centroid vectors, and determining a second sub-vector of each relationship initial vector of the target relationship, wherein the second sub-vector is a second sub-centroid vector with the minimum Euclidean distance from one relationship initial vector; classifying each relationship initial vector of the target relationship and the corresponding second sub-vector into a cluster class to obtain n second sub-cluster classes; calculating a third sub-centroid vector of each second sub-cluster class based on a mean principle; repeatedly calculating the M +1 sub-centroid vector of each Mth sub-cluster until the Mth sub-centroid vector is the same as the M +1 sub-centroid vector, taking n Mth sub-clusters as the subclasses of the target relationship, taking the M +1 sub-centroid vector as the relationship sub-vector of the target relationship, and taking M as an integer greater than 1.
Optionally, in a fourth implementation manner of the first aspect of the embodiment of the present invention, the calculating a relationship similarity between any two relationships based on the relationship initial vector, the relationship cluster vector, and the relationship sub-vector includes: generating a relationship vector of the relationship in each triplet through a preset formula
Figure BDA0002238651840000031
Wherein the content of the first and second substances,
Figure BDA0002238651840000032
the relationship vector is represented by a vector of the relationship,
Figure BDA0002238651840000033
representing the set of relational cluster vectors in a cluster of,
Figure BDA0002238651840000034
the initial vector of said relationship is represented by,
Figure BDA0002238651840000035
representing the relationship subvectors;
calculating the relationship similarity by cosine distance:
Figure BDA0002238651840000036
wherein
Figure BDA0002238651840000037
And
Figure BDA0002238651840000038
the relationship vector representing any two relationships.
Optionally, in a fifth implementation manner of the first aspect of the embodiment of the present invention, the classifying the relationship according to the relationship similarity includes: determining relationship similarity results among all relationships according to the relationship similarity, wherein the relationship similarity results comprise similarity and dissimilarity; if the relationship similarity results of the relationship pairs to be classified are similar, and the relationship pairs to be classified are any two relationships, bisecting the relationship pairs to be classified into a cluster; and if the similar results of the relations of the relation pairs to be classified are not similar, keeping the original clustering of each relation in the relation pairs to be classified.
Optionally, the determining a relationship similarity result between all relationships according to the relationship similarity includes: judging whether the relationship similarity of the relationship pairs to be classified exceeds a preset threshold value; if the relationship similarity of the relationship pair to be classified exceeds a preset threshold, determining that the relationship pair to be classified is similar; and if the relationship similarity of the relationship pair to be classified does not exceed the preset threshold, determining that the relationship pair to be classified is not similar.
A second aspect of an embodiment of the present invention provides a relation alignment apparatus for a knowledge graph, including: the acquisition unit is used for acquiring triple data in the knowledge graph, wherein the triple comprises a main body, a relation and an attribute; the conversion unit is used for converting the triple data through a preset vector conversion model to obtain a relationship initial vector of each relationship; the first clustering unit is used for clustering all the relation initial vectors through a preset clustering algorithm to obtain a relation clustering vector of each relation; the second clustering unit is used for clustering the initial relationship vectors of the same relationship in the triple data through the preset clustering algorithm to obtain a relationship sub-vector of each relationship; the calculating unit is used for calculating the relationship similarity between the relationships in any two triples based on the relationship initial vector, the relationship clustering vector and the relationship sub-vector; and the classification unit is used for classifying the relationship according to the relationship similarity.
Optionally, in a first implementation manner of the second aspect of the embodiment of the present invention, the conversion unit is specifically configured to: mapping subjects and objects of a target triplet into low-dimensional vectors
Figure BDA0002238651840000041
And
Figure BDA0002238651840000042
the target triple is any triple in the triple data, and i is the total number of the triples in the triple data; adjustment by preset loss function
Figure BDA0002238651840000043
And
Figure BDA0002238651840000044
up to
Figure BDA0002238651840000045
At a minimum, wherein
Figure BDA0002238651840000046
When said
Figure BDA0002238651840000047
At the minimum, will
Figure BDA0002238651840000048
Said target triplets arranged as vector type, said
Figure BDA0002238651840000049
Is the relationship initial vector.
Optionally, in a second implementation manner of the second aspect of the embodiment of the present invention, the first clustering unit is specifically configured to: selecting k relation initial vectors from all the relation initial vectors as a first centroid vector, wherein k is an integer greater than 1; calculating Euclidean distances between each relationship initial vector and k first centroid vectors, and determining a first clustering vector of each relationship initial vector, wherein the first clustering vector is the first centroid vector with the smallest Euclidean distance with one relationship initial vector; classifying each relationship initial vector and the corresponding first clustering vector into a cluster class to obtain k first cluster classes; calculating a second centroid vector of each first cluster class based on a mean principle; calculating Euclidean distances between each relationship initial vector and k second centroid vectors, and determining a second clustering vector of each relationship initial vector, wherein the second clustering vector is the second centroid vector with the smallest Euclidean distance with one relationship initial vector; classifying each relationship initial vector and the corresponding second cluster vector into a cluster to obtain k second clusters; calculating a third centroid vector of each second cluster class based on a mean principle; and repeatedly calculating the (N + 1) th centroid vector of each N-th cluster until the N-th centroid vector is the same as the (N + 1) th centroid vector, taking the N-th cluster as k clusters, taking the (N + 1) th centroid vector as a relational clustering vector, and taking N as an integer greater than 1.
Optionally, in a third implementation manner of the second aspect of the embodiment of the present invention, the second clustering unit is specifically configured to: selecting n initial relation vectors from all the initial relation vectors of a target relation as n first sub-centroid vectors, wherein the target relation is the same at least one relation, and n is an integer greater than 1; calculating Euclidean distances between each relation initial vector of the target relation and n first sub-centroid vectors, and determining a first sub-vector of each relation initial vector of the target relation, wherein the first sub-vector is the first sub-centroid vector with the smallest Euclidean distance with one relation initial vector; taking each relationship initial vector of the target relationship and the corresponding first sub-vector as a sub-cluster class to obtain n first sub-cluster classes; calculating a second sub-centroid vector of each first sub-cluster class based on a mean principle; calculating Euclidean distances between each relationship initial vector of the target relationship and n second sub-centroid vectors, and determining a second sub-vector of each relationship initial vector of the target relationship, wherein the second sub-vector is a second sub-centroid vector with the minimum Euclidean distance from one relationship initial vector; classifying each relationship initial vector of the target relationship and the corresponding second sub-vector into a cluster class to obtain n second sub-cluster classes; calculating a third sub-centroid vector of each second sub-cluster class based on a mean principle; repeatedly calculating the M +1 sub-centroid vector of each Mth sub-cluster until the Mth sub-centroid vector is the same as the M +1 sub-centroid vector, taking n Mth sub-clusters as the subclasses of the target relationship, taking the M +1 sub-centroid vector as the relationship sub-vector of the target relationship, and taking M as an integer greater than 1.
Optionally, in a fourth implementation manner of the second aspect of the embodiment of the present invention, the calculation unit includes: a generating module, configured to generate a relationship vector of the relationship in each triplet according to a preset formula
Figure BDA0002238651840000051
Wherein the content of the first and second substances,
Figure BDA0002238651840000052
the relationship vector is represented by a vector of the relationship,
Figure BDA0002238651840000053
representing the set of relational cluster vectors in a cluster of,
Figure BDA0002238651840000054
the initial vector of said relationship is represented by,
Figure BDA0002238651840000055
representing the relationship subvectors; the calculating module is used for calculating the relation similarity through the cosine distance:
Figure BDA0002238651840000056
wherein
Figure BDA0002238651840000057
And
Figure BDA0002238651840000058
the relationship vector representing any two relationships.
Optionally, in a fifth implementation manner of the second aspect of the embodiment of the present invention, the classifying unit includes: the determining module is used for determining relationship similarity results among all relationships according to the relationship similarity, wherein the relationship similarity results comprise similarity and dissimilarity; the classification module is used for bisecting the relation to be classified into a cluster if the relation similarity results of the relation pairs to be classified are similar and the relation pairs to be classified are any two relations; and the classification module is also used for keeping the original clustering of each relation in the relation pairs to be classified if the similar results of the relations in the relation pairs to be classified are not similar.
Optionally, in a sixth implementation manner of the second aspect of the embodiment of the present invention, the determining module is specifically configured to: judging whether the relationship similarity of the relationship pairs to be classified exceeds a preset threshold value; if the relationship similarity of the relationship pair to be classified exceeds a preset threshold, determining that the relationship pair to be classified is similar; and if the relationship similarity of the relationship pair to be classified does not exceed the preset threshold, determining that the relationship pair to be classified is not similar.
A third aspect of the embodiments of the present invention provides a relation alignment apparatus of a knowledge graph, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the relation alignment method of the knowledge graph according to any one of the above embodiments when executing the computer program.
A fourth aspect of an embodiment of the present invention provides a computer-readable storage medium, including instructions, which, when executed on a computer, cause the computer to perform the steps of the relation alignment method of a knowledge-graph according to any one of the above embodiments.
In the technical scheme provided by the embodiment of the invention, triple data in a knowledge graph is obtained, wherein the triple comprises a main body, a relation and an attribute; converting the triple data through a preset vector conversion model to obtain a relationship initial vector of each relationship; clustering all the relation initial vectors through a preset clustering algorithm to obtain a relation clustering vector of each relation; clustering the initial relation vectors of the same relation in the triple data through the preset clustering algorithm to obtain a relation sub-vector of each relation; calculating the relationship similarity between the relationships in any two triples based on the relationship initial vector, the relationship cluster vector and the relationship sub-vector; and classifying the relation according to the relation similarity. According to the embodiment of the invention, the upper and lower hierarchical structures of the relation of the knowledge graph are constructed, the relation is vectorized based on the upper and lower hierarchical structure information, whether each relation represents the same meaning or not is determined by calculating the similarity between the relation vectors, and the relation is classified, so that the efficiency of relation alignment in the knowledge graph is improved.
Drawings
FIG. 1 is a diagram of an embodiment of a relation alignment method of a knowledge graph in an embodiment of the present invention;
FIG. 2 is a diagram of another embodiment of the relation alignment method of the knowledge-graph in the embodiment of the invention;
FIG. 3 is a diagram of an embodiment of a relation alignment apparatus of a knowledge-graph in an embodiment of the present invention;
FIG. 4 is a schematic diagram of another embodiment of a relation alignment apparatus of a knowledge-graph in an embodiment of the invention;
FIG. 5 is a diagram of an embodiment of a relation alignment apparatus of a knowledge-graph in an embodiment of the present invention.
Detailed Description
The embodiment of the invention provides a relation alignment method, a device, equipment and a storage medium of a knowledge graph, which can determine whether each relation represents the same meaning and classify the relation by vectorizing the relation based on the upper and lower hierarchical structure information of the relation and calculating the similarity between relation vectors, thereby improving the efficiency of relation alignment in the knowledge graph.
In order to make the technical field of the invention better understand the scheme of the invention, the embodiment of the invention will be described in conjunction with the attached drawings in the embodiment of the invention.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," or "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Referring to fig. 1, a flowchart of a relation alignment method of a knowledge graph according to an embodiment of the present invention specifically includes:
101. and acquiring triple data in the knowledge graph, wherein the triple comprises a main body, a relation and an attribute.
The server acquires triple data in the legal knowledge graph, wherein the triple comprises a main body, a relation and an attribute. The data structure of the knowledge graph is a triplet, in which in a triplet (h, r, t), h is subject, t is object, and r is a relationship, such as a triplet (new york, belong to, usa), new york is subject, usa is object, belong to relationship. A triplet is an intuitive data structure, where a subject and an object are collectively referred to as an entity (entity), and the relationship has irreversible attributes, and in a triplet, the subject and the object cannot be interchanged.
102. And converting the ternary group data through a preset vector conversion model to obtain a relation initial vector of each relation.
And the server converts the ternary group data through a preset vector conversion model to obtain a relationship initial vector of each relationship. The preset vector conversion model converts character-type triple data into vector-type triple data, converts character-type relational data into relational initial vectors, converts character-type subject data into subject vectors, and converts character-type object data into object vectors. Each triplet corresponds to a set of vectors.
Specifically, the subject and the object of the target triple are mapped into low-dimensional vectors
Figure BDA0002238651840000071
And
Figure BDA0002238651840000072
the target triple is any triple in the triple data, and i is the total number of the triples in the triple data; adjustment by preset loss function
Figure BDA0002238651840000081
And
Figure BDA0002238651840000082
up to
Figure BDA0002238651840000083
At a minimum, wherein
Figure BDA0002238651840000084
When in use
Figure BDA0002238651840000085
At the minimum, will
Figure BDA0002238651840000086
The target triplets are set to be of the vector type,
Figure BDA0002238651840000087
for the relationship initial vector, for example, the triplet (taiwan, belong to, china), the server maps the subject "taiwan" and the object "china" into a low-dimensional vector and adjusts the vectors to be (1, 0) and (0, 1), resulting in that the relationship initial vector of the relationship "belong to" is (1, -1).
It should be noted that, in a knowledge graph, the vector form of a subject, object or relationship in any triplet is unique.
The preset vector conversion model is a conversion embedding (TransE) model, the TransE model is trained based on samples in the legal knowledge graph, and triple data in the legal knowledge graph are converted into a vector form.
103. And clustering all the relation initial vectors through a preset clustering algorithm to obtain a relation clustering vector of each relation.
And the server clusters all the relation initial vectors through a preset clustering algorithm to obtain a relation clustering vector of each relation. In the knowledge graph, different relations may have the same meaning, for example, "country of the country" and "nationality" have the same meaning, the server classifies different relations representing the same meaning into one class, and sets a relation clustering vector of the same class of relations. In the same type of relationship, the Euclidean distance between the relationship initial vector of each relationship and the relationship clustering vector of the type is the minimum.
Specifically, the server selects k relation initial vectors from all relation initial vectors as a first centroid vector, wherein k is an integer greater than 1; the server calculates the Euclidean distance between each relation initial vector and k first centroid vectors, and determines a first clustering vector of each relation initial vector, wherein the first clustering vector is the first centroid vector with the minimum Euclidean distance to one relation initial vector; the server classifies each relationship initial vector and the corresponding first clustering vector into a cluster class to obtain k first cluster classes; the server calculates a second centroid vector of each first cluster class based on a mean principle; the server calculates Euclidean distances between each relation initial vector and k second centroid vectors, and determines a second clustering vector of each relation initial vector, wherein the second clustering vector is the second centroid vector with the minimum Euclidean distance to one relation initial vector; the server classifies each relationship initial vector and the corresponding second cluster vector into a cluster class to obtain k second cluster classes; the server calculates a third centroid vector of each second cluster based on a mean principle; and the server repeatedly calculates the (N + 1) th centroid vector of each N-th cluster until the N-th centroid vector is the same as the (N + 1) th centroid vector, the N-th cluster is used as k clusters, the (N + 1) th centroid vector is used as a relational clustering vector, and N is an integer greater than 1.
For example, when the relation initial vectors include (1, 0), (1, 2), (2, 1), (0, 3), the server selects 2 relation initial vectors (1, 0) and (2, 1) as the first centroid vector, and the euclidean distance between the relation initial vector (1, 2) and the first centroid vector (1, 0) is 2 and the euclidean distance between the relation initial vector (1, 2) and the first centroid vector (2, 1)
Figure BDA0002238651840000092
The first cluster vector of the relationship initial vector (1, 2) is (2, 1), the first cluster vector of the relationship initial vector (0, 3) is (2, 1), then (1, 0) is a first cluster class, (1, 2), (0, 3) and (2, 1) are another cluster class, and the server obtains the second centroid vectors of the first cluster class by calculation as (1/1, 0/1) and (2, 1)
Figure BDA0002238651840000091
I.e., (1, 0) and (1, 2), respectivelyAnd the server determines that the first centroid vector is the same as the second centroid vector, and then the server takes the second centroid vector as a relational clustering vector, and the 2 first clusters are 2 clusters.
104. And clustering the initial relation vectors of the same relation in the triple data through a preset clustering algorithm to obtain a relation sub-vector of each relation.
And the server clusters all the initial relation vectors of each relation through a preset clustering algorithm to obtain a relation sub-vector of each relation. The server converts the multiple triples of the relation through a preset vector conversion model to obtain multiple initial relation vectors of the relation. In a triplet, the subject, the object and the relationship are all converted into vectors through a preset vector conversion model, and the vectors obtained by converting different triples are different.
It should be noted that a relationship exists in different triples, a relationship has the same or different meaning in different triples, and an initial relationship vector converted from a relationship in different triples is different, for example, a "belonging" relationship exists in a triplet (new york, belong to, usa), and in a triplet (new york, belong to, usa), the "belonging" relationship represents a regional dependency relationship, and the initial relationship vector of the "belonging" relationship is (1,0,1, 1); the "belonging" relationship also exists in the triplet (display, belonging, tv), where the "belonging" relationship represents partial and global dependencies, and the "belonging" relationship's initial relationship vector is (1,0,0, 1); the "belonging" relationship also exists in the triplet (beijing, belonged to, china), in which the initial relationship vector of "belonging" is (1,0, 1), and the meaning of "belonging" relationship in the triplet (beijing, belonged to, china) is the same as that in the triplet (new york, belonged to, usa).
Specifically, the server selects n relation initial vectors from all initial relation vectors of a target relation as n first sub-centroid vectors, the target relation is the same at least one relation, and n is an integer greater than 1; the server calculates Euclidean distances between each relation initial vector of the target relation and n first sub-centroid vectors, and determines a first sub-vector of each relation initial vector of the target relation, wherein the first sub-vector is a first sub-centroid vector with the smallest Euclidean distance with one relation initial vector; the server takes each relationship initial vector of the target relationship and the corresponding first sub-vector as a sub-cluster class to obtain n first sub-cluster classes; the server calculates a second sub-centroid vector of each first sub-cluster class based on a mean principle;
the server calculates Euclidean distances between each relation initial vector of the target relation and n second sub-centroid vectors, and determines a second sub-vector of each relation initial vector of the target relation, wherein the second sub-vector is a second sub-centroid vector with the minimum Euclidean distance from one relation initial vector; the server classifies each relationship initial vector of the target relationship and the corresponding second sub-vector into a cluster class to obtain n second sub-cluster classes; the server calculates a third sub-centroid vector of each second sub-cluster class based on a mean principle; and the server repeatedly calculates the M +1 sub-centroid vector of each M sub-cluster until the M sub-centroid vector is the same as the M +1 sub-centroid vector, the n M sub-clusters are taken as the subclasses of the target relationship, the M +1 sub-centroid vector is taken as the relationship sub-vector of the target relationship, and M is an integer greater than 1.
105. And calculating the relationship similarity between the relationships in any two triples based on the relationship initial vector, the relationship cluster vector and the relationship sub-vector.
And the server calculates the relationship similarity between any two relationships based on the relationship initial vector, the relationship clustering vector and the relationship sub-vector. The relationship similarity is calculated by combining the relationship initial vector, the relationship clustering vector and the relationship sub-vector, and whether the meaning of the relationship in each triple is the same or not can be more accurately determined by utilizing the hierarchical structure information of the relationship, so that the relationship can be classified, and the accuracy of identifying the relationship is improved.
106. And classifying the relation according to the relation similarity.
The server classifies the relations according to the relation similarity, the similar relations are divided into a cluster, and if the two relations are a cluster, the similar relations are divided into a subclass.
It can be understood that, when a new relationship is added to the knowledge graph spectrum, the server first divides the distance of the relationship by calculating the euclidean distance between the initial relationship vector of the relationship and each relationship clustering vector; then, according to whether the same relation exists in the knowledge graph or not, the subclass of the relation is divided, if the same relation exists in the knowledge graph, the subclass of the relation is divided by calculating the Euclidean distance between the initial relation vector of the relation and each relation sub-vector of the same relation; if the relationship does not have the same relationship in the knowledge graph, the initial relationship vector of the relationship is used as a subclass of the relationship.
According to the embodiment of the invention, the relations are vectorized based on the upper and lower hierarchical structure information, whether the relations represent the same meaning or not is determined by calculating the similarity between the relation vectors, and the relations are classified, so that the efficiency of relation alignment in the knowledge graph is improved.
Referring to fig. 2, another embodiment of the method for aligning relationships of a knowledge graph according to an embodiment of the present invention includes:
201. and acquiring triple data in the knowledge graph, wherein the triple comprises a main body, a relation and an attribute.
The server acquires triple data in the legal knowledge graph, wherein the triple comprises a main body, a relation and an attribute. The data structure of the knowledge graph is a triplet, in which in a triplet (h, r, t), h is subject, t is object, and r is a relationship, such as a triplet (new york, belong to, usa), new york is subject, usa is object, belong to relationship. A triplet is an intuitive data structure, where the subject and object are collectively referred to as entity, and the relationship has irreversible attributes, and in a triplet, the subject and object cannot be interchanged.
202. And converting the ternary group data through a preset vector conversion model to obtain a relation initial vector of each relation.
And the server converts the ternary group data through a preset vector conversion model to obtain a relationship initial vector of each relationship. The preset vector conversion model converts character-type triple data into vector-type triple data, converts character-type relational data into relational initial vectors, converts character-type subject data into subject vectors, and converts character-type object data into object vectors. Each triplet corresponds to a set of vectors.
Specifically, the subject and the object of the target triple are mapped into low-dimensional vectors
Figure BDA0002238651840000111
And
Figure BDA0002238651840000112
the target triple is any triple in the triple data, and i is the total number of the triples in the triple data; adjustment by preset loss function
Figure BDA0002238651840000113
And
Figure BDA0002238651840000114
up to
Figure BDA0002238651840000115
At a minimum, wherein
Figure BDA0002238651840000116
When in use
Figure BDA0002238651840000117
At the minimum, will
Figure BDA0002238651840000118
The target triplets are set to be of the vector type,
Figure BDA0002238651840000119
for a relationship initial vector, e.g., a triplet (taiwan, of, china), the server maps the subject "taiwan" and the object "china" to a low dimensional vector and adjusts the settings to (1,0) and (0, 1), resulting in a relationship initial vector of (1, -1) for the relationship "belonging".
It should be noted that, in a knowledge graph, the vector form of a subject, object or relationship in any triplet is unique.
The preset vector conversion model is a TransE model, the TransE model is trained based on samples in the legal knowledge graph, and the triple data in the legal knowledge graph are converted into a vector form.
203. And clustering all the relation initial vectors through a preset clustering algorithm to obtain a relation clustering vector of each relation.
And the server clusters all the relation initial vectors through a preset clustering algorithm to obtain a relation clustering vector of each relation. In the knowledge graph, different relations may have the same meaning, for example, "country of the country" and "nationality" have the same meaning, the server classifies different relations representing the same meaning into one class, and sets a relation clustering vector of the same class of relations. In the same type of relationship, the Euclidean distance between the relationship initial vector of each relationship and the relationship clustering vector of the type is the minimum.
Specifically, the server selects k relation initial vectors from all relation initial vectors as a first centroid vector, wherein k is an integer greater than 1; the server calculates the Euclidean distance between each relation initial vector and k first centroid vectors, and determines a first clustering vector of each relation initial vector, wherein the first clustering vector is the first centroid vector with the minimum Euclidean distance to one relation initial vector; the server classifies each relationship initial vector and the corresponding first clustering vector into a cluster class to obtain k first cluster classes; the server calculates a second centroid vector of each first cluster class based on a mean principle; the server calculates Euclidean distances between each relation initial vector and k second centroid vectors, and determines a second clustering vector of each relation initial vector, wherein the second clustering vector is the second centroid vector with the minimum Euclidean distance to one relation initial vector; the server classifies each relationship initial vector and the corresponding second cluster vector into a cluster class to obtain k second cluster classes; the server calculates a third centroid vector of each second cluster based on a mean principle; and the server repeatedly calculates the (N + 1) th centroid vector of each N-th cluster until the N-th centroid vector is the same as the (N + 1) th centroid vector, the N-th cluster is used as k clusters, the (N + 1) th centroid vector is used as a relational clustering vector, and N is an integer greater than 1.
For example, when the relation initial vectors include (1, 0), (1, 2), (2, 1), (0, 3), the server selects 2 relation initial vectors (1, 0) and (2, 1) as the first centroid vector, and the euclidean distance between the relation initial vector (1, 2) and the first centroid vector (1, 0) is 2 and the euclidean distance between the relation initial vector (1, 2) and the first centroid vector (2, 1)
Figure BDA0002238651840000122
The first cluster vector of the relationship initial vector (1, 2) is (2, 1), the first cluster vector of the relationship initial vector (0, 3) is (2, 1), then (1, 0) is a first cluster class, (1, 2), (0, 3) and (2, 1) are another cluster class, and the server obtains the second centroid vectors of the first cluster class by calculation as (1/1, 0/1) and (2, 1)
Figure BDA0002238651840000121
Namely (1, 0) and (1, 2), the server determines that the first centroid vector is the same as the second centroid vector, and then the server takes the second centroid vector as a relational cluster vector, and 2 first clusters are 2 clusters.
204. And clustering the initial relation vectors of the same relation in the triple data through a preset clustering algorithm to obtain a relation sub-vector of each relation.
And the server clusters all the initial relation vectors of each relation through a preset clustering algorithm to obtain a relation sub-vector of each relation. The server converts the multiple triples of the relation through a preset vector conversion model to obtain multiple initial relation vectors of the relation. In a triplet, the subject, the object and the relationship are all converted into vectors through a preset vector conversion model, and the vectors obtained by converting different triples are different.
It should be noted that a relationship exists in different triples, a relationship has the same or different meaning in different triples, and an initial relationship vector converted from a relationship in different triples is different, for example, a "belonging" relationship exists in a triplet (new york, belong to, usa), and in a triplet (new york, belong to, usa), the "belonging" relationship represents a regional dependency relationship, and the initial relationship vector of the "belonging" relationship is (1,0,1, 1); the "belonging" relationship also exists in the triplet (display, belonging, tv), where the "belonging" relationship represents partial and global dependencies, and the "belonging" relationship's initial relationship vector is (1,0,0, 1); the "belonging" relationship also exists in the triplet (beijing, belonged to, china), in which the initial relationship vector of "belonging" is (1,0, 1), and the meaning of "belonging" relationship in the triplet (beijing, belonged to, china) is the same as that in the triplet (new york, belonged to, usa).
Specifically, the server selects n relation initial vectors from all initial relation vectors of a target relation as n first sub-centroid vectors, the target relation is the same at least one relation, and n is an integer greater than 1; the server calculates Euclidean distances between each relation initial vector of the target relation and n first sub-centroid vectors, and determines a first sub-vector of each relation initial vector of the target relation, wherein the first sub-vector is a first sub-centroid vector with the smallest Euclidean distance with one relation initial vector; the server takes each relationship initial vector of the target relationship and the corresponding first sub-vector as a sub-cluster class to obtain n first sub-cluster classes; the server calculates a second sub-centroid vector of each first sub-cluster class based on a mean principle; the server calculates Euclidean distances between each relation initial vector of the target relation and n second sub-centroid vectors, and determines a second sub-vector of each relation initial vector of the target relation, wherein the second sub-vector is a second sub-centroid vector with the minimum Euclidean distance from one relation initial vector; the server classifies each relationship initial vector of the target relationship and the corresponding second sub-vector into a cluster class to obtain n second sub-cluster classes; the server calculates a third sub-centroid vector of each second sub-cluster class based on a mean principle; and the server repeatedly calculates the M +1 sub-centroid vector of each M sub-cluster until the M sub-centroid vector is the same as the M +1 sub-centroid vector, the n M sub-clusters are taken as the subclasses of the target relationship, the M +1 sub-centroid vector is taken as the relationship sub-vector of the target relationship, and M is an integer greater than 1.
205. Generating a relation vector of the relation in each triple through a preset formula
Figure BDA0002238651840000131
Wherein the content of the first and second substances,
Figure BDA0002238651840000132
the relationship vector is represented by a vector of relationships,
Figure BDA0002238651840000133
a cluster vector of the relationships is represented,
Figure BDA0002238651840000134
the initial vector of the relationship is represented,
Figure BDA0002238651840000135
representing the relationship subvectors.
The server generates a relation vector of the relation in each triple through a preset formula
Figure BDA0002238651840000136
Wherein the content of the first and second substances,
Figure BDA0002238651840000137
the relationship vector is represented by a vector of relationships,
Figure BDA0002238651840000138
a cluster vector of the relationships is represented,
Figure BDA0002238651840000139
the initial vector of the relationship is represented,
Figure BDA00022386518400001310
representing the relationship subvectors. The preset formula combines the relation vector with the relation clustering vector, the relation initial vector and the relation sub-vector, so that the accuracy of identifying the relation is improved.
For example, in a triplet (taiwan, belonging, china), the relationship "belonging" has the relationship initial vector of (1, -1), the relationship clustering vector of (1, 0), and the relationship word vector of (0, 1), and then in the triplet (taiwan, belonging, china), the relationship vector of the relationship "belonging" is (2, 0).
206. Calculating the relationship similarity by cosine distance:
Figure BDA0002238651840000141
wherein
Figure BDA0002238651840000142
And
Figure BDA0002238651840000143
a relationship vector representing any two relationships.
And the server calculates the relationship similarity between any two relationships based on the relationship initial vector, the relationship clustering vector and the relationship sub-vector. The relationship similarity is calculated by combining the relationship initial vector, the relationship clustering vector and the relationship sub-vector, and whether the meaning of the relationship in each triple is the same or not can be more accurately determined by utilizing the hierarchical structure information of the relationship, so that the relationship can be classified, and the accuracy of identifying the relationship is improved. The cosine similarity calculation formula of the relationship similarity is as follows:
Figure BDA0002238651840000144
wherein
Figure BDA0002238651840000145
And
Figure BDA0002238651840000146
a relationship vector representing any two relationships.
It is understood that, in addition to calculating the relationship similarity through the cosine distance, the relationship similarity may also be calculated through other distance formulas, and is not limited herein.
207. And determining a relationship similarity result between all relationships according to the relationship similarity, wherein the relationship similarity result comprises similarity and dissimilarity.
And the server determines a relationship similarity result among all the relationships according to the relationship similarity. The server obtains a preset threshold value based on the relation similarity rule of the relation vectors in the database, judges that the relations are similar when the relation similarity exceeds the preset threshold value, and judges that the relations are dissimilar when the relation similarity does not exceed the preset threshold value. Specifically, the server judges whether the relationship similarity of the relationship pair to be classified exceeds a preset threshold value; if the relationship similarity of the relationship pair to be classified exceeds a preset threshold, the server determines that the relationship pair to be classified is similar; and if the relation similarity of the relation pair to be classified does not exceed the preset threshold, the server determines that the relation pair to be classified is not similar.
208. And if the similar results of the relations of the relation pairs to be classified are similar and the relation pairs to be classified are any two relations, bisecting the relations to be classified into a cluster.
And if the relationship similarity results of the relationship pairs to be classified are similar and the relationship pairs to be classified are any two relationships, the server divides the relationship pairs to be classified into a cluster. In the same cluster, the relational clustering vectors of each relation are the same. And if the similar results of the relations of the relation pairs to be classified are not similar, the server keeps the original cluster of each relation in the relation pairs to be classified.
It should be noted that, after determining the cluster of the relationship, the server determines the subclass of the relationship according to whether the same relationship exists in the cluster and the original subclass of the same relationship. If the same relationship of a relationship does not exist in the knowledge graph, the server divides the relationship into a subclass separately.
It can be understood that when a new relationship is added to the knowledge graph, if the server determines that there is no similarity relationship of the relationship in the knowledge graph, the server treats the relationship alone as a cluster.
According to the embodiment of the invention, the relations are vectorized according to the upper and lower hierarchical structure information of the relations, whether the relations represent the same meaning or not is determined by calculating the similarity between the relation vectors, and the relations are classified, so that the efficiency of relation alignment in the knowledge graph is improved.
In the above description of the relation alignment method of the knowledge graph in the embodiment of the present invention, referring to fig. 3, the relation alignment apparatus of the knowledge graph in the embodiment of the present invention is described below, and an embodiment of the relation alignment apparatus of the knowledge graph in the embodiment of the present invention includes:
an obtaining unit 301, configured to obtain triple data in a knowledge graph, where the triple includes a main body, a relationship, and an attribute;
a conversion unit 302, configured to convert the triple data through a preset vector conversion model to obtain a relationship initial vector of each relationship;
a first clustering unit 303, configured to cluster all the relationship initial vectors through a preset clustering algorithm to obtain a relationship clustering vector of each relationship;
a second clustering unit 304, configured to cluster the initial relationship vectors of the same relationship in the triple data through the preset clustering algorithm to obtain a relationship sub-vector of each relationship;
a calculating unit 305, configured to calculate a relationship similarity between relationships in any two triples based on the relationship initial vector, the relationship cluster vector, and the relationship sub-vector;
a classifying unit 306, configured to classify the relationship according to the relationship similarity.
According to the embodiment of the invention, the relations are vectorized and the similarity among the relation vectors is calculated through the upper and lower hierarchical structure information based on the relations, whether the relations represent the same meaning or not is determined, and the relations are classified, so that the relation alignment efficiency in the knowledge graph is improved.
Referring to fig. 4, an embodiment of a relationship alignment apparatus for a knowledge graph according to an embodiment of the present invention includes:
an obtaining unit 301, configured to obtain triple data in a knowledge graph, where the triple includes a main body, a relationship, and an attribute;
a conversion unit 302, configured to convert the triple data through a preset vector conversion model to obtain a relationship initial vector of each relationship;
a first clustering unit 303, configured to cluster all the relationship initial vectors through a preset clustering algorithm to obtain a relationship clustering vector of each relationship;
a second clustering unit 304, configured to cluster the initial relationship vectors of the same relationship in the triple data through the preset clustering algorithm to obtain a relationship sub-vector of each relationship;
a calculating unit 305, configured to calculate a relationship similarity between relationships in any two triples based on the relationship initial vector, the relationship cluster vector, and the relationship sub-vector;
a classifying unit 306, configured to classify the relationship according to the relationship similarity.
Optionally, the conversion unit 302 is specifically configured to:
mapping subjects and objects of a target triplet into low-dimensional vectors
Figure BDA0002238651840000161
And
Figure BDA0002238651840000162
the target triple is any triple in the triple data, and i is the total number of the triples in the triple data; adjustment by preset loss function
Figure BDA0002238651840000163
And
Figure BDA0002238651840000164
up to
Figure BDA0002238651840000165
At a minimum, wherein
Figure BDA0002238651840000166
When said
Figure BDA0002238651840000167
At the minimum, will
Figure BDA0002238651840000168
Said target triplets arranged as vector type, said
Figure BDA0002238651840000169
Is the relationship initial vector.
Optionally, the first clustering unit 303 is specifically configured to:
selecting k relation initial vectors from all the relation initial vectors as a first centroid vector, wherein k is an integer greater than 1; calculating Euclidean distances between each relationship initial vector and k first centroid vectors, and determining a first clustering vector of each relationship initial vector, wherein the first clustering vector is the first centroid vector with the smallest Euclidean distance with one relationship initial vector; classifying each relationship initial vector and the corresponding first clustering vector into a cluster class to obtain k first cluster classes; calculating a second centroid vector of each first cluster class based on a mean principle; calculating Euclidean distances between each relationship initial vector and k second centroid vectors, and determining a second clustering vector of each relationship initial vector, wherein the second clustering vector is the second centroid vector with the smallest Euclidean distance with one relationship initial vector; classifying each relationship initial vector and the corresponding second cluster vector into a cluster to obtain k second clusters; calculating a third centroid vector of each second cluster class based on a mean principle; and repeatedly calculating the (N + 1) th centroid vector of each N-th cluster until the N-th centroid vector is the same as the (N + 1) th centroid vector, taking the N-th cluster as k clusters, taking the (N + 1) th centroid vector as a relational clustering vector, and taking N as an integer greater than 1.
Optionally, the second clustering unit 304 is specifically configured to:
selecting n initial relation vectors from all the initial relation vectors of a target relation as n first sub-centroid vectors, wherein the target relation is the same at least one relation, and n is an integer greater than 1; calculating Euclidean distances between each relation initial vector of the target relation and n first sub-centroid vectors, and determining a first sub-vector of each relation initial vector of the target relation, wherein the first sub-vector is the first sub-centroid vector with the smallest Euclidean distance with one relation initial vector; taking each relationship initial vector of the target relationship and the corresponding first sub-vector as a sub-cluster class to obtain n first sub-cluster classes; calculating a second sub-centroid vector of each first sub-cluster class based on a mean principle; calculating Euclidean distances between each relationship initial vector of the target relationship and n second sub-centroid vectors, and determining a second sub-vector of each relationship initial vector of the target relationship, wherein the second sub-vector is a second sub-centroid vector with the minimum Euclidean distance from one relationship initial vector; classifying each relationship initial vector of the target relationship and the corresponding second sub-vector into a cluster class to obtain n second sub-cluster classes; calculating a third sub-centroid vector of each second sub-cluster class based on a mean principle; repeatedly calculating the M +1 sub-centroid vector of each Mth sub-cluster until the Mth sub-centroid vector is the same as the M +1 sub-centroid vector, taking n Mth sub-clusters as the subclasses of the target relationship, taking the M +1 sub-centroid vector as the relationship sub-vector of the target relationship, and taking M as an integer greater than 1.
Optionally, the calculating unit 305 includes:
a generating module 3051, configured to generate a relationship vector of the relationship in each triplet according to a preset formula
Figure BDA0002238651840000171
Wherein the content of the first and second substances,
Figure BDA0002238651840000172
the relationship vector is represented by a vector of the relationship,
Figure BDA0002238651840000173
representing the set of relational cluster vectors in a cluster of,
Figure BDA0002238651840000174
the initial vector of said relationship is represented by,
Figure BDA0002238651840000175
representing the relationship subvectors; 3052, calculating a relationship similarity by cosine distance:
Figure BDA0002238651840000176
wherein
Figure BDA0002238651840000177
And
Figure BDA0002238651840000178
the relationship vector representing any two relationships.
Optionally, the classifying unit 306 includes:
a determining module 3061, configured to determine relationship similarity results among all relationships according to the relationship similarity, where the relationship similarity results include similarity and dissimilarity; the classification module 3062, configured to divide the relationship pair to be classified into two clusters if the relationship similarity results of the relationship pair to be classified are similar and the relationship pair to be classified is any two relationships; the classifying module 3062 is further configured to keep the original cluster of each relationship in the pair of relationships to be classified if the results of similarity of the relationship of the pair of relationships to be classified are not similar.
Optionally, the determining module 3061 is specifically configured to:
judging whether the relationship similarity of the relationship pairs to be classified exceeds a preset threshold value; if the relationship similarity of the relationship pair to be classified exceeds a preset threshold, determining that the relationship pair to be classified is similar; and if the relationship similarity of the relationship pair to be classified does not exceed the preset threshold, determining that the relationship pair to be classified is not similar.
According to the embodiment of the invention, the relations are vectorized based on the upper and lower hierarchical structure information, whether the relations represent the same meaning or not is determined by calculating the similarity between the relation vectors, and the relations are classified, so that the efficiency of relation alignment in the knowledge graph is improved.
Fig. 3 to 4 describe the relation alignment apparatus of the knowledge graph in the embodiment of the present invention in detail from the perspective of the modular functional entity, and the relation alignment apparatus of the knowledge graph in the embodiment of the present invention is described in detail from the perspective of hardware processing.
Fig. 5 is a schematic structural diagram of a relation alignment apparatus of a knowledge graph 500 according to an embodiment of the present invention, which may generate relatively large differences due to different configurations or performances, and may include one or more processors (CPUs) 501 (e.g., one or more processors) and a memory 509, and one or more storage media 508 (e.g., one or more mass storage devices) storing applications 507 or data 506. Memory 509 and storage medium 508 may be, among other things, transient storage or persistent storage. The program stored on the storage medium 508 may include one or more modules (not shown), each of which may include a series of instruction operations in a relationship alignment facility for a knowledge graph. Still further, the processor 501 may be configured to communicate with the storage medium 508 to execute a series of instruction operations in the storage medium 508 on the knowledge-graph relationship alignment apparatus 500.
The relation alignment apparatus 500 of the knowledge graph may also include one or more power supplies 502, one or more wired or wireless network interfaces 503, one or more input-output interfaces 504, and/or one or more operating systems 505, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, and the like. Those skilled in the art will appreciate that the knowledge-graph relationship alignment apparatus configuration shown in fig. 5 does not constitute a limitation of knowledge-graph relationship alignment apparatus and may include more or fewer components than shown, or some components in combination, or a different arrangement of components. The processor 501 may perform the functions of the acquisition unit 301, the conversion unit 302, the first clustering unit 303, the second clustering unit 304, the calculation unit 305, and the classification unit 306 in the above-described embodiments.
The following describes the components of the relation alignment apparatus of the knowledge-graph in detail with reference to fig. 5:
the processor 501 is a control center of the relation alignment apparatus of the knowledge map, and may perform processing according to a set relation alignment method. The processor 501 connects various portions of the overall knowledge-graph relationship alignment apparatus using various interfaces and lines to perform the knowledge-graph relationship alignment by running or executing software programs and/or modules stored in the memory 509 and invoking data stored in the memory 509 to perform various functions of the knowledge-graph relationship alignment apparatus and process the data. The storage medium 508 and the memory 509 are carriers for storing data, the storage medium 508 may be an internal memory with a small storage capacity but a high storage speed, and the memory 509 may be an external memory with a large storage capacity but a low storage speed.
The memory 509 may be used to store software programs and modules, and the processor 501 executes various functional applications and data processing of the knowledge-graph relationship alignment apparatus 500 by executing the software programs and modules stored in the memory 509. The memory 509 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (classifying a relationship, etc.) required for at least one function, and the like; the storage data area may store data created from use of the relation alignment apparatus of the knowledge map (such as relation similarity, etc.), and the like. Further, the memory 509 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. The relation alignment method program of the knowledge graph provided in the embodiment of the present invention and the received data stream are stored in the memory, and when they need to be used, the processor 501 calls from the memory 509.
When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, optical fiber, twisted pair) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that a computer can store or a data storage device, such as a server, a data center, etc., that is integrated with one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., compact disk), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A relation alignment method of a knowledge graph is characterized by comprising the following steps:
acquiring triple data in a knowledge graph, wherein the triple comprises a main body, a relation and an attribute;
converting the triple data through a preset vector conversion model to obtain a relationship initial vector of each relationship;
clustering all the relation initial vectors through a preset clustering algorithm to obtain a relation clustering vector of each relation;
clustering the initial relation vectors of the same relation in the triple data through the preset clustering algorithm to obtain a relation sub-vector of each relation;
calculating the relationship similarity between the relationships in any two triples based on the relationship initial vector, the relationship cluster vector and the relationship sub-vector;
and classifying the relation according to the relation similarity.
2. The relation alignment method of the knowledge-graph according to claim 1, wherein the converting the triple data by a preset vector conversion model to obtain a relation initial vector of each relation comprises:
mapping subjects and objects of a target triplet into low-dimensional vectors
Figure FDA0002238651830000011
And
Figure FDA0002238651830000012
the target triple is any triple in the triple data, and i is the total number of the triples in the triple data;
adjustment by preset loss function
Figure FDA0002238651830000013
And
Figure FDA0002238651830000014
up to
Figure FDA0002238651830000015
At a minimum, wherein
Figure FDA0002238651830000016
When said
Figure FDA0002238651830000017
At the minimum, will
Figure FDA0002238651830000018
Said target triplets arranged as vector type, said
Figure FDA0002238651830000019
Is the relationship initial vector.
3. The method for aligning relationships of knowledge-graphs according to claim 1, wherein said clustering all of said relationship initial vectors by a preset clustering algorithm to obtain a relationship clustering vector of each relationship comprises:
selecting k relation initial vectors from all the relation initial vectors as a first centroid vector, wherein k is an integer greater than 1;
calculating Euclidean distances between each relationship initial vector and k first centroid vectors, and determining a first clustering vector of each relationship initial vector, wherein the first clustering vector is the first centroid vector with the smallest Euclidean distance with one relationship initial vector;
classifying each relationship initial vector and the corresponding first clustering vector into a cluster class to obtain k first cluster classes;
calculating a second centroid vector of each first cluster class based on a mean principle;
calculating Euclidean distances between each relationship initial vector and k second centroid vectors, and determining a second clustering vector of each relationship initial vector, wherein the second clustering vector is the second centroid vector with the smallest Euclidean distance with one relationship initial vector;
classifying each relationship initial vector and the corresponding second cluster vector into a cluster to obtain k second clusters;
calculating a third centroid vector of each second cluster class based on a mean principle;
and repeatedly calculating the (N + 1) th centroid vector of each N-th cluster until the N-th centroid vector is the same as the (N + 1) th centroid vector, taking the N-th cluster as k clusters, taking the (N + 1) th centroid vector as a relational clustering vector, and taking N as an integer greater than 1.
4. The method of aligning relationships of a knowledge-graph according to claim 1, wherein the clustering the initial relationship vectors of the same relationship in the triplet data by the preset clustering algorithm to obtain a relationship sub-vector of each relationship comprises:
selecting n initial relation vectors from all the initial relation vectors of a target relation as n first sub-centroid vectors, wherein the target relation is the same at least one relation, and n is an integer greater than 1;
calculating Euclidean distances between each relation initial vector of the target relation and n first sub-centroid vectors, and determining a first sub-vector of each relation initial vector of the target relation, wherein the first sub-vector is the first sub-centroid vector with the smallest Euclidean distance with one relation initial vector;
taking each relationship initial vector of the target relationship and the corresponding first sub-vector as a sub-cluster class to obtain n first sub-cluster classes;
calculating a second sub-centroid vector of each first sub-cluster class based on a mean principle;
calculating Euclidean distances between each relationship initial vector of the target relationship and n second sub-centroid vectors, and determining a second sub-vector of each relationship initial vector of the target relationship, wherein the second sub-vector is a second sub-centroid vector with the minimum Euclidean distance from one relationship initial vector;
classifying each relationship initial vector of the target relationship and the corresponding second sub-vector into a cluster class to obtain n second sub-cluster classes;
calculating a third sub-centroid vector of each second sub-cluster class based on a mean principle;
repeatedly calculating the M +1 sub-centroid vector of each Mth sub-cluster until the Mth sub-centroid vector is the same as the M +1 sub-centroid vector, taking n Mth sub-clusters as the subclasses of the target relationship, taking the M +1 sub-centroid vector as the relationship sub-vector of the target relationship, and taking M as an integer greater than 1.
5. The method of knowledge-graph relationship alignment of claim 1, wherein said calculating a relationship similarity between any two relationships based on said relationship initial vector, said relationship cluster vector and said relationship sub-vector comprises:
generating a relationship vector of the relationship in each triplet through a preset formula
Figure FDA0002238651830000031
Wherein the content of the first and second substances,
Figure FDA0002238651830000032
the relationship vector is represented by a vector of the relationship,
Figure FDA0002238651830000033
representing the set of relational cluster vectors in a cluster of,
Figure FDA0002238651830000034
the initial vector of said relationship is represented by,
Figure FDA0002238651830000035
representing the relationship subvectors;
by passingCosine distance calculation relationship similarity:
Figure FDA0002238651830000036
wherein
Figure FDA0002238651830000037
And
Figure FDA0002238651830000038
the relationship vector representing any two relationships.
6. The relation alignment method of the knowledge-graph according to any one of claims 1-5, wherein the classifying the relation according to the relation similarity comprises:
determining relationship similarity results among all relationships according to the relationship similarity, wherein the relationship similarity results comprise similarity and dissimilarity;
if the relationship similarity results of the relationship pairs to be classified are similar, and the relationship pairs to be classified are any two relationships, bisecting the relationship pairs to be classified into a cluster;
and if the similar results of the relations of the relation pairs to be classified are not similar, keeping the original clustering of each relation in the relation pairs to be classified.
7. The relation alignment method of knowledge-graph according to claim 6, wherein said determining relation similarity results between all relations according to said relation similarity comprises:
judging whether the relationship similarity of the relationship pairs to be classified exceeds a preset threshold value or not;
if the relationship similarity of the relationship pair to be classified exceeds a preset threshold, determining that the relationship pair to be classified is similar;
and if the relationship similarity of the relationship pair to be classified does not exceed the preset threshold, determining that the relationship pair to be classified is not similar.
8. A relationship alignment apparatus for a knowledge graph, comprising:
the acquisition unit is used for acquiring triple data in the knowledge graph, wherein the triple comprises a main body, a relation and an attribute;
the conversion unit is used for converting the triple data through a preset vector conversion model to obtain a relationship initial vector of each relationship;
the first clustering unit is used for clustering all the relation initial vectors through a preset clustering algorithm to obtain a relation clustering vector of each relation;
the second clustering unit is used for clustering the initial relationship vectors of the same relationship in the triple data through the preset clustering algorithm to obtain a relationship sub-vector of each relationship;
the calculating unit is used for calculating the relationship similarity between the relationships in any two triples based on the relationship initial vector, the relationship clustering vector and the relationship sub-vector;
and the classification unit is used for classifying the relationship according to the relationship similarity.
9. A relation alignment apparatus of a knowledge graph, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the relation alignment method of a knowledge graph according to any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium comprising instructions that when executed on a computer cause the computer to perform a method of relationship alignment of a knowledge-graph as claimed in any one of claims 1 to 7.
CN201910992324.0A 2019-10-18 2019-10-18 Knowledge graph relationship alignment method, device, equipment and storage medium Active CN111026865B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910992324.0A CN111026865B (en) 2019-10-18 2019-10-18 Knowledge graph relationship alignment method, device, equipment and storage medium
PCT/CN2019/119305 WO2021072891A1 (en) 2019-10-18 2019-11-19 Knowledge graph relationship alignment method, apparatus and device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910992324.0A CN111026865B (en) 2019-10-18 2019-10-18 Knowledge graph relationship alignment method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111026865A true CN111026865A (en) 2020-04-17
CN111026865B CN111026865B (en) 2023-07-21

Family

ID=70200915

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910992324.0A Active CN111026865B (en) 2019-10-18 2019-10-18 Knowledge graph relationship alignment method, device, equipment and storage medium

Country Status (2)

Country Link
CN (1) CN111026865B (en)
WO (1) WO2021072891A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112149400A (en) * 2020-09-23 2020-12-29 腾讯科技(深圳)有限公司 Data processing method, device, equipment and storage medium
WO2022222226A1 (en) * 2021-04-19 2022-10-27 平安科技(深圳)有限公司 Structured-information-based relation alignment method and apparatus, and device and medium

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113282764B (en) * 2021-06-29 2023-05-23 南方电网科学研究院有限责任公司 Method and device for constructing network security data knowledge graph
CN117370583B (en) * 2023-12-08 2024-03-19 湘江实验室 Knowledge-graph entity alignment method and system based on generation of countermeasure network

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107665252A (en) * 2017-09-27 2018-02-06 深圳证券信息有限公司 A kind of method and device of creation of knowledge collection of illustrative plates
CN109783582A (en) * 2018-12-04 2019-05-21 平安科技(深圳)有限公司 A kind of knowledge base alignment schemes, device, computer equipment and storage medium
CN109885692A (en) * 2019-01-11 2019-06-14 平安科技(深圳)有限公司 Knowledge data storage method, device, computer equipment and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101983455B1 (en) * 2017-09-21 2019-05-28 숭실대학교산학협력단 Knowledge Base completion method and server
CN107943874B (en) * 2017-11-13 2019-08-23 平安科技(深圳)有限公司 Knowledge mapping processing method, device, computer equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107665252A (en) * 2017-09-27 2018-02-06 深圳证券信息有限公司 A kind of method and device of creation of knowledge collection of illustrative plates
CN109783582A (en) * 2018-12-04 2019-05-21 平安科技(深圳)有限公司 A kind of knowledge base alignment schemes, device, computer equipment and storage medium
CN109885692A (en) * 2019-01-11 2019-06-14 平安科技(深圳)有限公司 Knowledge data storage method, device, computer equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112149400A (en) * 2020-09-23 2020-12-29 腾讯科技(深圳)有限公司 Data processing method, device, equipment and storage medium
CN112149400B (en) * 2020-09-23 2021-07-27 腾讯科技(深圳)有限公司 Data processing method, device, equipment and storage medium
WO2022222226A1 (en) * 2021-04-19 2022-10-27 平安科技(深圳)有限公司 Structured-information-based relation alignment method and apparatus, and device and medium

Also Published As

Publication number Publication date
CN111026865B (en) 2023-07-21
WO2021072891A1 (en) 2021-04-22

Similar Documents

Publication Publication Date Title
CN111026865A (en) Relation alignment method, device and equipment of knowledge graph and storage medium
US10579661B2 (en) System and method for machine learning and classifying data
JP5503046B2 (en) Image search based on shape
US8781255B2 (en) Methods and apparatus for visual search
CN108804641A (en) A kind of computational methods of text similarity, device, equipment and storage medium
Wei et al. On the spatial distribution of buildings for map generalization
Hahsler et al. Introduction to stream: An extensible framework for data stream clustering research with R
JP6905603B2 (en) Image retrieval methods, devices, equipment and readable storage media
US8849030B2 (en) Image retrieval using spatial bag-of-features
CN109697451B (en) Similar image clustering method and device, storage medium and electronic equipment
CN111046186A (en) Entity alignment method, device and equipment of knowledge graph and storage medium
CN105117407B (en) A kind of image search method of the range direction histogram based on cluster
CN112257801B (en) Incremental clustering method and device for images, electronic equipment and storage medium
CN110825894A (en) Data index establishing method, data index retrieving method, data index establishing device, data index retrieving device, data index establishing equipment and storage medium
Wang et al. Duplicate discovery on 2 billion internet images
WO2023030184A1 (en) Data retrieval method and related device
WO2023221790A1 (en) Image encoder training method and apparatus, device, and medium
WO2015001416A1 (en) Multi-dimensional data clustering
JP2014211730A (en) Image searching system, image searching device, and image searching method
WO2021000674A1 (en) Cell image recognition method and system, computer device, and readable storage medium
CN110209895B (en) Vector retrieval method, device and equipment
CN109446408A (en) Retrieve method, apparatus, equipment and the computer readable storage medium of set of metadata of similar data
CN115617945A (en) Cross-modal data retrieval model establishing method and cross-modal data retrieval method
EP4163803A1 (en) Sample data annotation system, method, and related device
CN108268620A (en) A kind of Document Classification Method based on hadoop data minings

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant