CN109597856B - Data processing method and device, electronic equipment and storage medium - Google Patents

Data processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN109597856B
CN109597856B CN201811485414.2A CN201811485414A CN109597856B CN 109597856 B CN109597856 B CN 109597856B CN 201811485414 A CN201811485414 A CN 201811485414A CN 109597856 B CN109597856 B CN 109597856B
Authority
CN
China
Prior art keywords
node
matrix
target knowledge
graph network
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811485414.2A
Other languages
Chinese (zh)
Other versions
CN109597856A (en
Inventor
曾山松
岳永鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Knownsec Information Technology Co Ltd
Original Assignee
Beijing Knownsec Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Knownsec Information Technology Co Ltd filed Critical Beijing Knownsec Information Technology Co Ltd
Priority to CN201811485414.2A priority Critical patent/CN109597856B/en
Publication of CN109597856A publication Critical patent/CN109597856A/en
Application granted granted Critical
Publication of CN109597856B publication Critical patent/CN109597856B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a data processing method, a data processing device, an electronic device and a storage medium. The method comprises the following steps: converting attribute information of each node in a target knowledge graph network into space vector characteristics represented by numerical values to obtain an entity attribute characteristic matrix; acquiring a Laplace matrix of an entity relationship graph representing each node in a target knowledge graph network; determining final vector space representation of each node in the target knowledge graph network according to the entity attribute feature matrix and the Laplace matrix; calculating the final vector similarity between every two nodes in the target knowledge graph network; and fusing the node pairs with the vector similarity calculation result larger than a preset threshold value. According to the method, the attribute information and the adjacent information of the entity in the knowledge graph are utilized to learn the space representation of the entity vector, more comprehensive and accurate vector representation can be obtained, and the problem of inaccurate calculation of the entity similarity caused by the fact that the entity attribute is lost and the attribute value changes is solved.

Description

Data processing method and device, electronic equipment and storage medium
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a data processing method and device, electronic equipment and a storage medium.
Background
The knowledge graph construction needs to be updated with the continuous update of knowledge, for example, the attributes or relationships of the original entities need to be updated, or new entities and relationships need to be added. The method needs to determine whether a newly added entity already exists in an original map, and if so, the new entity needs to be linked to the original entity, fused into a unique entity, and update the attribute and relationship of the entity.
The existing common method for entity fusion is to determine whether different source entities can be aligned or not by using attribute information of the entities, if the attributes of the entities have unique identifiers, the unique identifiers between the two entities can be used for determination, and if the attributes of the unique identifiers do not exist, the attribute information of the entities can be vectorized and expressed, and the similarity of the two vectors is calculated.
Disclosure of Invention
In view of this, embodiments of the present invention provide a data processing method, an apparatus, an electronic device, and a storage medium, so as to effectively improve the calculation of the similarity affected by the incomplete attribute information or the change of the attribute information, thereby affecting the accuracy of entity fusion.
The embodiment of the invention is realized by the following steps:
in a first aspect, an embodiment of the present invention provides a data processing method, including: acquiring a target knowledge graph network; converting the attribute information of each node in the target knowledge graph network into space vector characteristics represented by numerical values to obtain an entity attribute characteristic matrix; acquiring a Laplace matrix of an entity relationship graph representing each node in the target knowledge graph network; determining final vector space representation of each node in the target knowledge graph network according to the entity attribute feature matrix and the Laplace matrix; calculating the final vector similarity between every two nodes in the target knowledge graph network; and fusing the node pairs with the vector similarity calculation result larger than a preset threshold value.
In the embodiment of the application, the entity vector space representation is learned by utilizing the attribute information and the adjacent information of the entity in the knowledge graph, so that more comprehensive and accurate vector representation can be obtained, the problem of inaccurate calculation of entity similarity caused by attribute value change due to entity attribute loss is avoided, and the accuracy and the reliability of entity fusion are improved.
With reference to a possible implementation manner of the embodiment of the first aspect, the obtaining a laplacian matrix of an entity relationship graph representing each node in the target knowledge-graph network includes: acquiring a degree matrix for representing the degree of each node in the target knowledge graph network; acquiring an adjacency matrix representing each node connection object in the target knowledge graph network; determining the Laplace matrix according to the degree matrix and the adjacency matrix.
With reference to still another possible implementation manner of the embodiment of the first aspect, the target knowledge-graph network includes n nodes, where n is an integer greater than 1; determining a final vector space representation of each node in the target knowledge-graph network according to the entity attribute feature matrix and the laplacian matrix, including:
calculating vector similarity between the ith node and each node in the n nodes in the target knowledge graph network based on the entity attribute feature matrix to obtain a similarity matrix, wherein the ith row in the similarity matrix represents the vector similarity between the ith node and each node in the n nodes, and i is greater than or equal to 1 and less than or equal to n; and determining final vector space representation of each node in the target knowledge graph network according to the similarity matrix and the Laplace matrix.
With reference to yet another possible implementation manner of the embodiment of the first aspect, determining a final vector space representation of each node in the target knowledge-graph network according to the similarity matrix and the laplacian matrix includes: determining a final vector space representation of each node in the target knowledge graph network according to a final vector space representation function, the similarity matrix and the Laplace matrix, wherein the final vector space representation function is as follows:
Figure BDA0001893598160000031
wherein
Figure BDA0001893598160000032
S represents the similarity matrix, W represents the Laplace matrix, H represents a final vector space representation matrix, lambda is an adjustment coefficient, is greater than or equal to 0 and less than or equal to 1, and H represents the final vector space representation of each node.
With reference to yet another possible implementation manner of the embodiment of the first aspect, calculating a vector similarity between every two nodes in the target knowledge-graph network includes: clustering vector features corresponding to each node in the target knowledge graph network through a clustering algorithm; and calculating the final vector similarity between every two nodes which belong to the same cluster.
In a second aspect, an embodiment of the present invention further provides a data processing apparatus, including: the device comprises a first acquisition module, a conversion module, a second acquisition module, a determination module, a calculation module and a fusion module; the first acquisition module is used for acquiring a target knowledge graph network; the conversion module is used for converting the attribute information of each node in the target knowledge graph network into space vector characteristics represented by numerical values to obtain an entity attribute characteristic matrix; the second acquisition module is used for representing a Laplace matrix of an entity relationship graph of each node in the target knowledge graph network; a determining module, configured to determine a final vector space representation of each node in the target knowledge-graph network according to the entity attribute feature matrix and the laplacian matrix; the calculation module is used for calculating the vector similarity between every two nodes in the target knowledge graph network; and the fusion module is used for fusing the node pairs with the vector similarity calculation result larger than the preset threshold value.
With reference to a possible implementation manner of the embodiment of the second aspect, the second obtaining module is further configured to: acquiring a degree matrix for representing the degree of each node in the target knowledge graph network; acquiring an adjacency matrix representing each node connection object in the target knowledge graph network; determining the Laplace matrix according to the degree matrix and the adjacency matrix.
In combination with yet another possible implementation manner of the embodiment of the second aspect, the target knowledge-graph network includes n nodes, where n is an integer greater than 1; the determining module is further configured to: calculating vector similarity between the ith node and each node in the n nodes in the target knowledge graph network based on the entity attribute feature matrix to obtain a similarity matrix, wherein the ith row in the similarity matrix represents the vector similarity between the ith node and each node in the n nodes, and i is greater than or equal to 1 and less than or equal to n; and determining final vector space representation of each node in the target knowledge graph network according to the similarity matrix and the Laplace matrix.
With reference to still another possible implementation manner of the embodiment of the second aspect, the determining module is further configured to: determining a final vector space representation of each node in the target knowledge graph network according to a final vector space representation function, the similarity matrix and the Laplace matrix, wherein the final vector space representation function is as follows:
Figure BDA0001893598160000041
wherein
Figure BDA0001893598160000042
S represents the similarity matrix, W represents the Laplace matrix, H represents a final vector space representation matrix, lambda is an adjustment coefficient, is greater than or equal to 0 and less than or equal to 1, and H represents the final vector space representation of each node.
With reference to still another possible implementation manner of the embodiment of the second aspect, the calculating module is further configured to: clustering vector features corresponding to each node in the target knowledge graph network through a clustering algorithm; and calculating the final vector similarity between every two nodes which belong to the same cluster.
In a third aspect, an embodiment of the present invention further provides an electronic device, including: a memory and a processor, the memory and the processor connected; the memory is used for storing programs; the processor is configured to invoke a program stored in the memory to perform the first aspect embodiment and/or a method provided in connection with any possible implementation manner of the first aspect embodiment.
In a fourth aspect, an embodiment of the present invention further provides a storage medium, where the storage medium includes a computer program, and the computer program is executed by a computer to perform the method provided in the embodiment of the first aspect and/or in connection with any one of the possible implementation manners of the embodiment of the first aspect.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts. The above and other objects, features and advantages of the present invention will become more apparent from the accompanying drawings. Like reference numerals refer to like parts throughout the drawings. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention.
Fig. 1 shows a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Fig. 2 is a schematic flow chart illustrating a data processing method according to an embodiment of the present invention.
FIG. 3 shows a schematic diagram of a target knowledge-graph network provided by an embodiment of the invention.
Fig. 4 shows a block diagram of a data processing apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
In the description of the present invention, it should be noted that the terms "first", "second", "third", and the like are used only for distinguishing the description, and are not intended to indicate or imply relative importance. Further, the term "and/or" in the present application is only one kind of association relationship describing the associated object, and means that three kinds of relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone.
As shown in fig. 1, fig. 1 is a block diagram illustrating a structure of an electronic device 100 according to an embodiment of the present invention. The electronic device 100 includes: data processing device 110, memory 120, memory controller 130, and processor 140.
The memory 120, the memory controller 130, and the processor 140 are electrically connected to each other directly or indirectly to realize data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The data processing device 110 includes at least one software function module which can be stored in the memory 120 in the form of software or firmware (firmware) or solidified in an Operating System (OS) of the electronic device 100. The processor 140 is used to execute executable modules stored in the memory 120, such as software functional modules or computer programs included in the data processing apparatus 110.
The Memory 120 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like. The memory 120 is configured to store a program, and the processor 140 executes the program after receiving an execution instruction, and a method executed by the electronic device 100 defined by a flow disclosed in any embodiment of the invention described later may be applied to the processor 140, or implemented by the processor 140.
The processor 140 may be an integrated circuit chip having signal processing capabilities. The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
In the embodiment of the present invention, the electronic device 100 may be, but is not limited to, a web server, a database server, a cloud server, and the like.
Referring to fig. 2, steps included in a data processing method applied to the electronic device 100 according to an embodiment of the present invention will be described with reference to fig. 2.
Step S101: and acquiring a target knowledge graph network.
When a given knowledge-graph network is analyzed, the network is used as a target knowledge-graph network for analysis. The knowledge-graph network is a graph data structure representing entity relations, each node in the graph represents an entity existing in the real world, and each edge is a relation between the entities. Briefly, the knowledge-graph network is a method of using a relationship network to connect various entities in the world together through their interrelations, and helps analysts perform association analysis and reasoning between the entities.
Step S102: and converting the attribute information of each node in the target knowledge graph network into space vector characteristics represented by numerical values to obtain an entity attribute characteristic matrix.
Learning a vector space representation of an entity in a target knowledge-graph network through a network representation. The Network Representation Learning is a distributed Representation Learning technology and is used for Learning low-dimensional social Network node vector Representation. It helps many analysis tasks to make link prediction, node clustering. In general, in a static flat network graph, network representation learning learns the vector representation of nodes in a vector space by using the adjacent information of the nodes, but in a knowledge graph network, each node in the graph is often provided with abundant attribute information, for example, a node with an entity type of Person may comprise attribute information of name, birth year and month, native place, occupation and the like. The representation of the nodes in the network space cannot be comprehensively learned by using the adjacent information of the nodes in the graph alone, which is also the reason that the similarity calculation result is inaccurate under the existing fusion mode.
It should be noted that, in the existing common method for entity fusion, it is determined whether different source entities can be aligned by using attribute information of the entities, if there is a unique identifier in attributes of the entities, it can be determined by using the unique identifier between the two entities, and if there is no unique identifier attribute, vectorization representation can be performed on the attribute information of the entities, and the similarity of the two vectors is calculated. The inventor of the application finds out in the process of the invention application that: due to the fact that the attribute information of the entity may not be comprehensively collected in engineering implementation, certain dimensionality of the entity attribute is lost, and the similarity calculation result according to the entity attribute information is inaccurate. In addition, since the attributes of entities are dynamically changing over time, merely disambiguating an entity based on the attribute information of the entity may result in misinterpretation as a different entity because the same entity possesses different attributes at different times.
The defects existing in the prior art are the results obtained after the inventor practices and researches, so that the discovery process of the above problems and the solution proposed by the following embodiments of the invention to the above problems should be the contribution of the inventor to the invention in the process of the invention.
Therefore, in the embodiment of the application, the defects existing in the existing fusion mode are solved by learning the vector space representation of the attribute information and the adjacent information of each node (entity) in the knowledge graph.
The vector space representation of the attribute information of each node (entity) in the knowledge graph can be learned through the following steps, for example, the attribute information of each node in the target knowledge graph network is converted into the space vector feature represented by a numerical value by using feature engineering, so as to obtain the entity attribute feature matrix of the knowledge graph. The text information in the attribute can be converted into a numerical characteristic vector in a word2vec mode, and the category information in the attribute can be coded into numerical characteristics by one-hot coding. For example, the entity attribute information of each node shown in table 1 may be converted into the entity attribute feature matrix shown in table 2 in the above manner. Wherein the name of each entity is learned by character embedding (such as word2vec), and is coded by one-hot method by China and profession.
TABLE 1
Entity Name (I) Height of a person Body weight Native place Occupation of the world
1 Plum steel 173 56 Henan province Doctor
2 Li Jing 168 72 Henan province Doctor
3 Plum steel 166 77 Henan province Doctor
4 Plum steel 179 63 Hebei river Doctor
5 Plum jade steel 180 66 Hebei river Teacher's teacher
6 Zhang Xin 172 64 North of a lake Engineer(s)
7 Plum steel 177 63 Hunan province Officer
8 King and honour 185 69 Hunan province Officer
TABLE 2
[0.2,0.3] 173 56 [0,0,0,1] [0,0,0,1]
[0.4,0.3] 168 72 [0,0,0,1] [0,0,0,1]
[0.2,0.3] 166 77 [0,0,0,1] [0,0,0,1]
[0.2,0.3] 179 63 [0,0,1,0] [0,0,0,1]
[0.3,0.3] 180 66 [0,0,1,0] [0,0,1,0]
[0.2,0.3] 172 64 [0,1,0,0] [0,1,0,0]
[0.2,0.3] 177 63 [1,0,0,0] [1,0,0,0]
[0.5,0.3] 185 69 [1,0,0,0] [1,0,0,0]
Step S103: and acquiring a Laplace matrix of an entity relation graph representing each node in the target knowledge graph network.
After a target knowledge graph network to be analyzed is obtained, a Laplace matrix of an entity relation graph representing each node in the target knowledge graph network is obtained. The Laplace matrix can be determined according to the degree matrix and the adjacency matrix of the target knowledge graph network. Therefore, obtaining a laplacian matrix of an entity relationship graph representing each node in the target knowledge-graph network includes: acquiring a degree matrix for representing the degree of each node in the target knowledge graph network; acquiring an adjacency matrix representing each node connection object in the target knowledge graph network; determining the Laplace matrix according to the degree matrix and the adjacency matrix. The calculation formula of the Laplace matrix is as follows:
l ═ D-C, where L matrix is the laplacian matrix to be calculated, D matrix is the degree matrix of the graph, and C is the adjacency matrix of the graph.
For example, the degree matrix, adjacency matrix, laplace matrix of the target knowledge-graph network shown in fig. 3 can be represented by the following tables:
table 3 (degree matrix of figure 3)
1 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0
0 0 5 0 0 0 0 0
0 0 0 1 0 0 0 0
0 0 0 0 2 0 0 0
0 0 0 0 0 3 0 0
0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 2
TABLE 4 (adjacency matrix of FIG. 3)
0 0 1 0 0 0 0 0
0 0 1 0 0 0 0 0
1 1 0 1 1 1 0 0
0 0 1 0 0 0 0 0
0 0 1 0 0 0 0 1
0 0 1 0 0 0 0 1
0 0 0 0 0 1 0 0
0 0 0 0 1 1 0 0
Table 5 (Laplace matrix of FIG. 3)
1 0 -1 0 0 0 0 0
0 1 -1 0 0 0 0 0
-1 -1 5 1 -1 -1 0 0
0 0 -1 1 0 0 0 0
0 0 -1 0 2 0 0 -1
0 0 -1 0 0 3 0 -1
0 0 0 0 0 -1 1 0
0 0 0 0 -1 -1 0 2
Where table 3 is the degree matrix of fig. 3, table 4 is the adjacency matrix of fig. 3, and table 5 is the laplacian matrix of fig. 3. Where each row in the table corresponds to a respective node in fig. 3, e.g., the first row in table 3 corresponds to node 1 in fig. 3, the second row in table 3 corresponds to node 2 in fig. 3, and the rest is similar.
Wherein, for node 1, there is only one edge, so the value in the corresponding degree matrix is 1; similarly, there is only one edge for node 2, so the value in the corresponding degree matrix is 1; similarly, there are 5 edges for node 3, and thus the corresponding degree matrix has a value of 5, which is similar for the rest of the cases.
Wherein, for node 1, the node connected to it is 3, so the value of the 3 rd column in the corresponding adjacency matrix is 1; similarly, for node 2, the node connected to it is 3, and therefore, the value of column 3 in the corresponding adjacency matrix is 1; similarly, for node 3, the nodes connected to it are 1, 2, 4, 5, 6, so the value of the 1 st, 2 nd, 4 th, 5 th, 6 th column in the corresponding adjacency matrix is 1, and the rest is similar.
Step S104: and determining final vector space representation of each node in the target knowledge graph network according to the entity attribute feature matrix and the Laplace matrix.
After the entity attribute feature matrix and the Laplace matrix of the target knowledge graph network are learned, the final vector space representation of each node in the target knowledge graph network can be determined according to the two matrices. As an alternative implementation, the final vector space representation of each node in the target knowledge-graph network is calculated, for example, by the formula V ═ f (T, L). Wherein T represents an entity attribute feature matrix, L represents a Laplacian matrix of an entity relation graph, V represents a final vector space representation of an entity in a target knowledge graph, and f is a function for calculating the final vector space representation, which is usually a convolutional neural network. For example, when the final vector space representation of the node 1 is calculated, the entity attribute feature of the node 1 in the entity attribute feature matrix and the relationship feature of the node 1 in the laplace matrix are input into the convolutional neural network, so that the final vector space representation of the node 1 can be obtained, and the obtaining processes of the final vector space representations of the other nodes are similar to each other.
As another optional implementation, after the entity attribute feature matrix of the target knowledge graph network is learned, the vector similarity between each node in the target knowledge graph network and each node in all nodes is further calculated, so as to obtain a similarity matrix. For ease of understanding, it is assumed that the target knowledge-graph network includes n nodes, n being an integer greater than 1. At this time, the above process is to calculate the vector similarity between the ith node and each of the n nodes in the target knowledge graph network based on the entity attribute feature matrix to obtain a similarity matrix, where the ith row in the similarity matrix represents the vector similarity between the ith node and each of the n nodes, and i is greater than or equal to 1 and less than or equal to n. That is, a target knowledge graph network containing n nodes can obtain an n-dimensional similarity matrix by pairwise calculating the similarity of the attribute feature vectors. For convenience of understanding, the target knowledge-graph network shown in fig. 3 is taken as an example for explanation, that is, after the entity attribute feature matrix of fig. 3 is obtained, for the node 1, the vector similarity between the node 1 and the node 2, the vector similarity between the node 1 and the node 3, the vector similarity between the node 1 and the node 4, the vector similarity between the node 1 and the node 5, the vector similarity between the node 1 and the node 6, the vector similarity between the node 1 and the node 7, and the vector similarity between the node 1 and the node 8 need to be calculated. Similarly, for node 2, the vector similarity between node 2 and node 1, the vector similarity between node 2 and node 2, the vector similarity between node 2 and node 3, the vector similarity between node 2 and node 4, the vector similarity between node 2 and node 5, the vector similarity between node 2 and node 6, the vector similarity between node 2 and node 7, and the vector similarity between node 2 and node 8 need to be calculated. The calculation of each of the other nodes is similar. This results in a n x n-dimensional similarity matrix.
The calculation formula of the vector similarity is as follows:
Figure BDA0001893598160000141
wherein the attribute characteristics of one node are represented by a vector a and the attributes of the other node are represented by a vector B. Each item in the attribute vector is a feature value corresponding to each attribute of the entity, for example, for node 1, 5 attributes are included, such as (name, height, weight, native place, occupation).
And after a similarity matrix is obtained, determining the final vector space representation of each node in the target knowledge graph network according to the similarity matrix and the Laplace matrix. For example, the final vector space representation of each node in the target knowledge-graph network may be determined according to a final vector space representation function, the similarity matrix, and the laplacian matrix, where the final vector space representation function is:
Figure BDA0001893598160000142
wherein
Figure BDA0001893598160000143
S represents the similarity matrix, W represents the Laplace matrix, H represents a final vector space representation matrix, lambda is an adjustment coefficient, is greater than or equal to 0 and less than or equal to 1, and H represents the final vector space representation of each node. By solving the above representation functions, the final vector space representation h of each entity can be found.
The calculation of the final vector space indicates that the final vector space should satisfy (1) two entities that are originally close in attribute space attribute value are also similar in the finally obtained vector space (first condition), (2) two entities that are originally adjacent in the knowledge-graph network are also similar in the finally obtained vector space (second condition). That is, the first term in the above-mentioned representing function corresponds to the above-mentioned first condition, and the second term in the representing function corresponds to the above-mentioned second condition.
Step S105: and calculating final vector similarity between every two nodes in the target knowledge-graph network.
After the final vector space representation of each node in the target knowledge-graph network is obtained, calculating the final vector similarity between every two nodes in the target knowledge-graph network, and obtaining the final vector similarity of each node pair, such as obtaining the final vector similarity of the node 1 and the node 2, the final vector similarity of the node 1 and the node 3, the final vector similarity of the node 1 and the node 4, the final vector similarity of the node 1 and the node 5, the final vector similarity of the node 1 and the node 6, the final vector similarity of the node 2 and the node 3, and the like.
As an optional implementation manner, in order to reduce the calculation difficulty, the vector features corresponding to each node in the target knowledge graph network may be clustered through a clustering algorithm, so that each node in the target knowledge graph network may be divided into a plurality of different clusters, the node attribute association inside each cluster is strong, and the node attribute association between clusters is weak. Therefore, when the final vector similarity between every two nodes in the calculation is calculated, the final vector similarity between every two nodes in the same cluster can be calculated, the calculation difficulty can be reduced, and the time can be saved.
The clustering algorithm may be a current clustering algorithm, such as a K-Means clustering algorithm.
Step S106: and fusing the node pairs with the vector similarity calculation result larger than a preset threshold value.
After the final vector similarity of each node pair is obtained, the node pairs with the vector similarity calculation results larger than a preset threshold value are screened out, and the node pairs meeting the conditions, namely the attributes are fused. For example, if the vector similarity calculation result of the node 1 and the node 2 is greater than the preset threshold, the node 1 and the node 2 are merged, and at this time, a new node is formed.
To sum up, in the data processing method provided in the embodiment of the present application, the attribute information of each node in the target knowledge-graph network is converted into a space vector feature represented by a numerical value, so as to obtain an entity attribute feature matrix, obtain a laplacian matrix of an entity relationship graph representing each node in the target knowledge-graph network, determine a final vector space representation of each node in the target knowledge-graph network according to the entity attribute feature matrix and the laplacian matrix, and calculate a final vector similarity between every two nodes based on the final vector space representation of each node in the target knowledge-graph network, so as to merge nodes whose vector similarity calculation results are greater than a preset threshold. By fully utilizing the attribute information and the adjacent information of the entity in the knowledge graph to learn the space representation of the entity vector, more comprehensive and accurate vector representation can be obtained, and the problem of inaccurate calculation of the entity similarity caused by the fact that the attribute value is changed due to the fact that the entity attribute is lost is solved. In addition, the space vector of the entity is divided into different subspaces by using a clustering algorithm, and finally, the pairwise similarity is calculated in the subspaces, so that the performance problem caused by pairwise calculation of the entity of the whole graph under the condition of large data volume is avoided.
The embodiment of the present application further provides a data processing apparatus 110, as shown in fig. 4. The data processing apparatus 110 includes: a first obtaining module 111, a converting module 112, a second obtaining module 113, a determining module 114, a calculating module 115, and a fusing module 116.
The first obtaining module 111 is configured to obtain a target knowledge-graph network.
A converting module 112, configured to convert the attribute information of each node in the target intellectual graph network into a space vector feature represented by a numerical value, so as to obtain an entity attribute feature matrix.
A second obtaining module 113, configured to obtain a laplacian matrix of an entity relationship graph representing each node in the target knowledge-graph network.
A determining module 114, configured to determine a final vector space representation of each node in the target knowledge-graph network according to the entity attribute feature matrix and the laplacian matrix.
A calculating module 115, configured to calculate a vector similarity between every two nodes in the target knowledge-graph network.
And the fusion module 116 is configured to fuse the node pairs whose vector similarity calculation result is greater than the preset threshold.
It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other.
The data processing apparatus 110 according to the embodiment of the present invention has the same implementation principle and technical effect as the foregoing method embodiments, and for the sake of brief description, reference may be made to the corresponding contents in the foregoing method embodiments for the parts of the apparatus embodiments that are not mentioned.
The embodiment of the present application further provides a storage medium, where the storage medium includes a computer program, and the computer program is executed by a computer to perform the data processing method.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, the functional modules in the embodiments of the present invention may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a notebook computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes. It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A data processing method, comprising:
acquiring a target knowledge graph network representing attribute relations of people;
converting the attribute information of each node in the target knowledge graph network into space vector features represented by numerical values to obtain an entity attribute feature matrix, wherein when the entity represented by the node is a person, the attribute information comprises: name, height, native place, occupation, weight;
acquiring a Laplace matrix of an entity relationship graph representing each node in the target knowledge graph network;
determining final vector space representation of each node in the target knowledge graph network according to the entity attribute feature matrix and the Laplace matrix;
calculating the final vector similarity between every two nodes in the target knowledge graph network;
and fusing the node pairs with the vector similarity calculation result larger than a preset threshold value.
2. The method of claim 1, wherein obtaining the laplacian matrix of the entity relationship graph characterizing each node in the target knowledge-graph network comprises:
acquiring a degree matrix for representing the degree of each node in the target knowledge graph network;
acquiring an adjacency matrix representing each node connection object in the target knowledge graph network;
determining the Laplace matrix according to the degree matrix and the adjacency matrix.
3. The method of claim 1, wherein the target knowledge-graph network comprises n nodes, n being an integer greater than 1; determining a final vector space representation of each node in the target knowledge-graph network according to the entity attribute feature matrix and the laplacian matrix, including:
calculating vector similarity between the ith node and each node in the n nodes in the target knowledge graph network based on the entity attribute feature matrix to obtain a similarity matrix, wherein the ith row in the similarity matrix represents the vector similarity between the ith node and each node in the n nodes, and i is greater than or equal to 1 and less than or equal to n;
and determining final vector space representation of each node in the target knowledge graph network according to the similarity matrix and the Laplace matrix.
4. The method of claim 3, wherein determining a final vector space representation for each node in the target knowledge-graph network based on the similarity matrix and the Laplace matrix comprises:
determining a final vector space representation of each node in the target knowledge graph network according to a final vector space representation function, the similarity matrix and the Laplace matrix, wherein the final vector space representation function is as follows:
Figure FDA0002768494340000021
wherein
Figure FDA0002768494340000022
S represents the similarity matrix, W represents the Laplace matrix, H represents a final vector space representation matrix, and lambda is an adjustment coefficient and is greater than or equal to 0 and less than or equal to 0Equal to 1, h is the final vector space representation of each node.
5. The method of claim 1, wherein calculating the vector similarity between each two nodes in the target knowledge-graph network comprises:
clustering vector features corresponding to each node in the target knowledge graph network through a clustering algorithm;
and calculating the final vector similarity between every two nodes which belong to the same cluster.
6. A data processing apparatus, comprising:
the first acquisition module is used for acquiring a target knowledge graph network representing the attribute relationship of a person;
a conversion module, configured to convert attribute information of each node in the target knowledge-graph network into a space vector feature represented by a numerical value, so as to obtain an entity attribute feature matrix, where, when an entity represented by a node is a person, the attribute information includes: name, height, native place, occupation, weight;
the second acquisition module is used for representing a Laplace matrix of an entity relationship graph of each node in the target knowledge graph network;
a determining module, configured to determine a final vector space representation of each node in the target knowledge-graph network according to the entity attribute feature matrix and the laplacian matrix;
the calculation module is used for calculating the vector similarity between every two nodes in the target knowledge graph network;
and the fusion module is used for fusing the node pairs with the vector similarity calculation result larger than the preset threshold value.
7. The apparatus of claim 6, wherein the second obtaining module is further configured to: acquiring a degree matrix for representing the degree of each node in the target knowledge graph network;
acquiring an adjacency matrix representing each node connection object in the target knowledge graph network;
determining the Laplace matrix according to the degree matrix and the adjacency matrix.
8. The apparatus of claim 6, wherein the target knowledge-graph network comprises n nodes, n being an integer greater than 1; the determining module is further configured to:
calculating vector similarity between the ith node and each node in the n nodes in the target knowledge graph network based on the entity attribute feature matrix to obtain a similarity matrix, wherein the ith row in the similarity matrix represents the vector similarity between the ith node and each node in the n nodes, and i is greater than or equal to 1 and less than or equal to n;
and determining final vector space representation of each node in the target knowledge graph network according to the similarity matrix and the Laplace matrix.
9. An electronic device, comprising: a memory and a processor, the memory and the processor connected;
the memory is used for storing programs;
the processor is configured to invoke a program stored in the memory to perform the method of any of claims 1-5.
10. A storage medium, characterized in that the storage medium comprises a computer program which, when executed by a computer, performs the method according to any one of claims 1-5.
CN201811485414.2A 2018-12-05 2018-12-05 Data processing method and device, electronic equipment and storage medium Active CN109597856B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811485414.2A CN109597856B (en) 2018-12-05 2018-12-05 Data processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811485414.2A CN109597856B (en) 2018-12-05 2018-12-05 Data processing method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN109597856A CN109597856A (en) 2019-04-09
CN109597856B true CN109597856B (en) 2020-12-25

Family

ID=65962131

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811485414.2A Active CN109597856B (en) 2018-12-05 2018-12-05 Data processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN109597856B (en)

Families Citing this family (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110263324B (en) * 2019-05-16 2021-02-12 华为技术有限公司 Text processing method, model training method and device
CN112118278B (en) * 2019-06-04 2023-07-04 杭州海康威视***技术有限公司 Computing node access method, device, electronic equipment and machine-readable storage medium
CN110580294B (en) * 2019-09-11 2022-11-29 腾讯科技(深圳)有限公司 Entity fusion method, device, equipment and storage medium
CN112651764B (en) * 2019-10-12 2023-03-31 武汉斗鱼网络科技有限公司 Target user identification method, device, equipment and storage medium
CN111046186A (en) * 2019-10-30 2020-04-21 平安科技(深圳)有限公司 Entity alignment method, device and equipment of knowledge graph and storage medium
CN111160847B (en) * 2019-12-09 2023-08-25 中国建设银行股份有限公司 Method and device for processing flow information
CN111125376B (en) * 2019-12-23 2023-08-29 秒针信息技术有限公司 Knowledge graph generation method and device, data processing equipment and storage medium
CN111191462B (en) * 2019-12-30 2022-02-22 北京航空航天大学 Method and system for realizing cross-language knowledge space entity alignment based on link prediction
CN111241095B (en) * 2020-01-03 2023-06-23 北京百度网讯科技有限公司 Method and apparatus for generating vector representations of nodes
CN111353002B (en) * 2020-02-03 2024-05-03 中国人民解放军国防科技大学 Training method and device for network representation learning model, electronic equipment and medium
CN111392538A (en) * 2020-03-17 2020-07-10 浙江新再灵科技股份有限公司 Elevator comprehensive fault early warning method based on multi-dimensional Internet of things atlas big data
CN111460234B (en) * 2020-03-26 2023-06-09 平安科技(深圳)有限公司 Graph query method, device, electronic equipment and computer readable storage medium
CN113553436A (en) * 2020-04-23 2021-10-26 广东博智林机器人有限公司 Knowledge graph updating method and device, electronic equipment and storage medium
CN111599472B (en) * 2020-05-14 2023-10-24 重庆大学 Method and device for identifying psychological state of student and computer
CN111581467B (en) * 2020-05-15 2024-04-02 北京交通大学 Partial mark learning method based on subspace representation and global disambiguation method
CN111625435B (en) * 2020-05-21 2022-06-10 苏州浪潮智能科技有限公司 Server analysis method, device and equipment and computer readable storage medium
CN111522968B (en) * 2020-06-22 2023-09-08 中国银行股份有限公司 Knowledge graph fusion method and device
CN111563191A (en) * 2020-07-07 2020-08-21 成都数联铭品科技有限公司 Data processing system based on graph network
CN111538895A (en) * 2020-07-07 2020-08-14 成都数联铭品科技有限公司 Data processing system based on graph network
CN112149759A (en) * 2020-10-26 2020-12-29 北京明略软件***有限公司 Event map matching method and device, electronic equipment and storage medium
CN112000718B (en) * 2020-10-28 2021-05-18 成都数联铭品科技有限公司 Attribute layout-based knowledge graph display method, system, medium and equipment
CN112364181B (en) * 2020-11-27 2024-05-28 深圳市慧择时代科技有限公司 Insurance product matching degree determining method and apparatus
CN112785350B (en) * 2021-02-24 2023-09-19 深圳市慧择时代科技有限公司 Product vector determining method and device
CN113269248B (en) * 2021-05-24 2023-06-23 平安科技(深圳)有限公司 Method, device, equipment and storage medium for data standardization
CN113505214B (en) * 2021-06-30 2024-06-14 北京明略软件***有限公司 Content recommendation method, device, computer equipment and storage medium
CN113590846B (en) * 2021-09-24 2021-12-17 天津汇智星源信息技术有限公司 Legal knowledge map construction method and related equipment
CN113868438B (en) * 2021-11-30 2022-03-04 平安科技(深圳)有限公司 Information reliability calibration method and device, computer equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103699663A (en) * 2013-12-27 2014-04-02 中国科学院自动化研究所 Hot event mining method based on large-scale knowledge base
CN103729402A (en) * 2013-11-22 2014-04-16 浙江大学 Method for establishing mapping knowledge domain based on book catalogue
CN105468605A (en) * 2014-08-25 2016-04-06 济南中林信息科技有限公司 Entity information map generation method and device
CN105786980A (en) * 2016-02-14 2016-07-20 广州神马移动信息科技有限公司 Method and apparatus for combining different examples for describing same entity and equipment
CN107357846A (en) * 2017-06-26 2017-11-17 北京金堤科技有限公司 The methods of exhibiting and device of relation map
CN108280062A (en) * 2018-01-19 2018-07-13 北京邮电大学 Entity based on deep learning and entity-relationship recognition method and device

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8332333B2 (en) * 2006-10-19 2012-12-11 Massachusetts Institute Of Technology Learning algorithm for ranking on graph data
US8805653B2 (en) * 2010-08-11 2014-08-12 Seiko Epson Corporation Supervised nonnegative matrix factorization
CN103093239B (en) * 2013-01-18 2016-04-13 上海交通大学 A kind of merged point to neighborhood information build drawing method
CN104809176B (en) * 2015-04-13 2018-08-07 中央民族大学 Tibetan language entity relation extraction method
CN105005594B (en) * 2015-06-29 2018-07-13 嘉兴慧康智能科技有限公司 Abnormal microblog users recognition methods
CN106777274B (en) * 2016-06-16 2018-05-29 北京理工大学 A kind of Chinese tour field knowledge mapping construction method and system
CN107391906B (en) * 2017-06-19 2020-04-28 华南理工大学 Healthy diet knowledge network construction method based on neural network and map structure
CN107943874B (en) * 2017-11-13 2019-08-23 平安科技(深圳)有限公司 Knowledge mapping processing method, device, computer equipment and storage medium
CN108563710B (en) * 2018-03-27 2021-02-02 腾讯科技(深圳)有限公司 Knowledge graph construction method and device and storage medium
CN108874957B (en) * 2018-06-06 2022-02-01 华东师范大学 Interactive music recommendation method based on Meta-graph knowledge graph representation
CN108920678A (en) * 2018-07-10 2018-11-30 福州大学 A kind of overlapping community discovery method based on spectral clustering with fuzzy set

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103729402A (en) * 2013-11-22 2014-04-16 浙江大学 Method for establishing mapping knowledge domain based on book catalogue
CN103699663A (en) * 2013-12-27 2014-04-02 中国科学院自动化研究所 Hot event mining method based on large-scale knowledge base
CN105468605A (en) * 2014-08-25 2016-04-06 济南中林信息科技有限公司 Entity information map generation method and device
CN105786980A (en) * 2016-02-14 2016-07-20 广州神马移动信息科技有限公司 Method and apparatus for combining different examples for describing same entity and equipment
CN107357846A (en) * 2017-06-26 2017-11-17 北京金堤科技有限公司 The methods of exhibiting and device of relation map
CN108280062A (en) * 2018-01-19 2018-07-13 北京邮电大学 Entity based on deep learning and entity-relationship recognition method and device

Also Published As

Publication number Publication date
CN109597856A (en) 2019-04-09

Similar Documents

Publication Publication Date Title
CN109597856B (en) Data processing method and device, electronic equipment and storage medium
CN109948641B (en) Abnormal group identification method and device
CN112650855B (en) Knowledge graph engineering construction method and device, computer equipment and storage medium
CN110597870A (en) Enterprise relation mining method
CN111768285A (en) Credit wind control model construction system and method, wind control system and storage medium
CN104239553A (en) Entity recognition method based on Map-Reduce framework
CN114491084B (en) Self-encoder-based relation network information mining method, device and equipment
CN115546525A (en) Multi-view clustering method and device, electronic equipment and storage medium
CN112257959A (en) User risk prediction method and device, electronic equipment and storage medium
CN117391313A (en) Intelligent decision method, system, equipment and medium based on AI
Song et al. Asymptotic distribution-free changepoint detection for data with repeated observations
Paul et al. An analysis of the most accident prone regions within the Dhaka Metropolitan Region using clustering
CN109977131A (en) A kind of house type matching system
CN113434672A (en) Text type intelligent identification method, device, equipment and medium
Renjith et al. A comparative analysis of clustering quality based on internal validation indices for dimensionally reduced social media data
Takeuchi et al. Higher order fused regularization for supervised learning with grouped parameters
García et al. Benchmarking research performance at the university level with information theoretic measures
Sawarkar et al. Automated metadata harmonization using entity resolution and contextual embedding
Ruwet et al. Impact of contamination on training and test error rates in statistical clustering
CN112287005A (en) Data processing method, device, server and medium
Juditsky et al. Near-optimal recovery of linear and N-convex functions on unions of convex sets
Broniatowski A weighted bootstrap procedure for divergence minimization problems
CN113806452B (en) Information processing method, information processing device, electronic equipment and storage medium
Ndung'u Data Preparation for Machine Learning Modelling
Rica et al. On-line learning the edit costs based on an embedded model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: Room 311501, Unit 1, Building 5, Courtyard 1, Futong East Street, Chaoyang District, Beijing

Applicant after: Beijing Zhichuangyu Information Technology Co., Ltd.

Address before: Room 311501, Unit 1, Building 5, Courtyard 1, Futong East Street, Chaoyang District, Beijing

Applicant before: Beijing Knows Chuangyu Information Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant