CN109597856B

CN109597856B - Data processing method and device, electronic equipment and storage medium

Info

Publication number: CN109597856B
Application number: CN201811485414.2A
Authority: CN
Inventors: 曾山松; 岳永鹏
Original assignee: Beijing Knownsec Information Technology Co Ltd
Current assignee: Beijing Knownsec Information Technology Co Ltd
Priority date: 2018-12-05
Filing date: 2018-12-05
Publication date: 2020-12-25
Anticipated expiration: 2038-12-05
Also published as: CN109597856A

Abstract

The invention relates to a data processing method, a data processing device, an electronic device and a storage medium. The method comprises the following steps: converting attribute information of each node in a target knowledge graph network into space vector characteristics represented by numerical values to obtain an entity attribute characteristic matrix; acquiring a Laplace matrix of an entity relationship graph representing each node in a target knowledge graph network; determining final vector space representation of each node in the target knowledge graph network according to the entity attribute feature matrix and the Laplace matrix; calculating the final vector similarity between every two nodes in the target knowledge graph network; and fusing the node pairs with the vector similarity calculation result larger than a preset threshold value. According to the method, the attribute information and the adjacent information of the entity in the knowledge graph are utilized to learn the space representation of the entity vector, more comprehensive and accurate vector representation can be obtained, and the problem of inaccurate calculation of the entity similarity caused by the fact that the entity attribute is lost and the attribute value changes is solved.

Description

Data processing method and device, electronic equipment and storage medium

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to a data processing method and device, electronic equipment and a storage medium.

Background

The knowledge graph construction needs to be updated with the continuous update of knowledge, for example, the attributes or relationships of the original entities need to be updated, or new entities and relationships need to be added. The method needs to determine whether a newly added entity already exists in an original map, and if so, the new entity needs to be linked to the original entity, fused into a unique entity, and update the attribute and relationship of the entity.

The existing common method for entity fusion is to determine whether different source entities can be aligned or not by using attribute information of the entities, if the attributes of the entities have unique identifiers, the unique identifiers between the two entities can be used for determination, and if the attributes of the unique identifiers do not exist, the attribute information of the entities can be vectorized and expressed, and the similarity of the two vectors is calculated.

Disclosure of Invention

In view of this, embodiments of the present invention provide a data processing method, an apparatus, an electronic device, and a storage medium, so as to effectively improve the calculation of the similarity affected by the incomplete attribute information or the change of the attribute information, thereby affecting the accuracy of entity fusion.

The embodiment of the invention is realized by the following steps:

in a first aspect, an embodiment of the present invention provides a data processing method, including: acquiring a target knowledge graph network; converting the attribute information of each node in the target knowledge graph network into space vector characteristics represented by numerical values to obtain an entity attribute characteristic matrix; acquiring a Laplace matrix of an entity relationship graph representing each node in the target knowledge graph network; determining final vector space representation of each node in the target knowledge graph network according to the entity attribute feature matrix and the Laplace matrix; calculating the final vector similarity between every two nodes in the target knowledge graph network; and fusing the node pairs with the vector similarity calculation result larger than a preset threshold value.

In the embodiment of the application, the entity vector space representation is learned by utilizing the attribute information and the adjacent information of the entity in the knowledge graph, so that more comprehensive and accurate vector representation can be obtained, the problem of inaccurate calculation of entity similarity caused by attribute value change due to entity attribute loss is avoided, and the accuracy and the reliability of entity fusion are improved.

With reference to a possible implementation manner of the embodiment of the first aspect, the obtaining a laplacian matrix of an entity relationship graph representing each node in the target knowledge-graph network includes: acquiring a degree matrix for representing the degree of each node in the target knowledge graph network; acquiring an adjacency matrix representing each node connection object in the target knowledge graph network; determining the Laplace matrix according to the degree matrix and the adjacency matrix.

With reference to still another possible implementation manner of the embodiment of the first aspect, the target knowledge-graph network includes n nodes, where n is an integer greater than 1; determining a final vector space representation of each node in the target knowledge-graph network according to the entity attribute feature matrix and the laplacian matrix, including:

calculating vector similarity between the ith node and each node in the n nodes in the target knowledge graph network based on the entity attribute feature matrix to obtain a similarity matrix, wherein the ith row in the similarity matrix represents the vector similarity between the ith node and each node in the n nodes, and i is greater than or equal to 1 and less than or equal to n; and determining final vector space representation of each node in the target knowledge graph network according to the similarity matrix and the Laplace matrix.

With reference to yet another possible implementation manner of the embodiment of the first aspect, determining a final vector space representation of each node in the target knowledge-graph network according to the similarity matrix and the laplacian matrix includes: determining a final vector space representation of each node in the target knowledge graph network according to a final vector space representation function, the similarity matrix and the Laplace matrix, wherein the final vector space representation function is as follows:

wherein

S represents the similarity matrix, W represents the Laplace matrix, H represents a final vector space representation matrix, lambda is an adjustment coefficient, is greater than or equal to 0 and less than or equal to 1, and H represents the final vector space representation of each node.

With reference to yet another possible implementation manner of the embodiment of the first aspect, calculating a vector similarity between every two nodes in the target knowledge-graph network includes: clustering vector features corresponding to each node in the target knowledge graph network through a clustering algorithm; and calculating the final vector similarity between every two nodes which belong to the same cluster.

In a second aspect, an embodiment of the present invention further provides a data processing apparatus, including: the device comprises a first acquisition module, a conversion module, a second acquisition module, a determination module, a calculation module and a fusion module; the first acquisition module is used for acquiring a target knowledge graph network; the conversion module is used for converting the attribute information of each node in the target knowledge graph network into space vector characteristics represented by numerical values to obtain an entity attribute characteristic matrix; the second acquisition module is used for representing a Laplace matrix of an entity relationship graph of each node in the target knowledge graph network; a determining module, configured to determine a final vector space representation of each node in the target knowledge-graph network according to the entity attribute feature matrix and the laplacian matrix; the calculation module is used for calculating the vector similarity between every two nodes in the target knowledge graph network; and the fusion module is used for fusing the node pairs with the vector similarity calculation result larger than the preset threshold value.

With reference to a possible implementation manner of the embodiment of the second aspect, the second obtaining module is further configured to: acquiring a degree matrix for representing the degree of each node in the target knowledge graph network; acquiring an adjacency matrix representing each node connection object in the target knowledge graph network; determining the Laplace matrix according to the degree matrix and the adjacency matrix.

In combination with yet another possible implementation manner of the embodiment of the second aspect, the target knowledge-graph network includes n nodes, where n is an integer greater than 1; the determining module is further configured to: calculating vector similarity between the ith node and each node in the n nodes in the target knowledge graph network based on the entity attribute feature matrix to obtain a similarity matrix, wherein the ith row in the similarity matrix represents the vector similarity between the ith node and each node in the n nodes, and i is greater than or equal to 1 and less than or equal to n; and determining final vector space representation of each node in the target knowledge graph network according to the similarity matrix and the Laplace matrix.

With reference to still another possible implementation manner of the embodiment of the second aspect, the determining module is further configured to: determining a final vector space representation of each node in the target knowledge graph network according to a final vector space representation function, the similarity matrix and the Laplace matrix, wherein the final vector space representation function is as follows:

wherein

With reference to still another possible implementation manner of the embodiment of the second aspect, the calculating module is further configured to: clustering vector features corresponding to each node in the target knowledge graph network through a clustering algorithm; and calculating the final vector similarity between every two nodes which belong to the same cluster.

In a third aspect, an embodiment of the present invention further provides an electronic device, including: a memory and a processor, the memory and the processor connected; the memory is used for storing programs; the processor is configured to invoke a program stored in the memory to perform the first aspect embodiment and/or a method provided in connection with any possible implementation manner of the first aspect embodiment.

In a fourth aspect, an embodiment of the present invention further provides a storage medium, where the storage medium includes a computer program, and the computer program is executed by a computer to perform the method provided in the embodiment of the first aspect and/or in connection with any one of the possible implementation manners of the embodiment of the first aspect.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts. The above and other objects, features and advantages of the present invention will become more apparent from the accompanying drawings. Like reference numerals refer to like parts throughout the drawings. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention.

Fig. 1 shows a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Fig. 2 is a schematic flow chart illustrating a data processing method according to an embodiment of the present invention.

FIG. 3 shows a schematic diagram of a target knowledge-graph network provided by an embodiment of the invention.

Fig. 4 shows a block diagram of a data processing apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

In the description of the present invention, it should be noted that the terms "first", "second", "third", and the like are used only for distinguishing the description, and are not intended to indicate or imply relative importance. Further, the term "and/or" in the present application is only one kind of association relationship describing the associated object, and means that three kinds of relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone.

As shown in fig. 1, fig. 1 is a block diagram illustrating a structure of an electronic device 100 according to an embodiment of the present invention. The electronic device 100 includes: data processing device 110, memory 120, memory controller 130, and processor 140.

The memory 120, the memory controller 130, and the processor 140 are electrically connected to each other directly or indirectly to realize data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The data processing device 110 includes at least one software function module which can be stored in the memory 120 in the form of software or firmware (firmware) or solidified in an Operating System (OS) of the electronic device 100. The processor 140 is used to execute executable modules stored in the memory 120, such as software functional modules or computer programs included in the data processing apparatus 110.

The Memory 120 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like. The memory 120 is configured to store a program, and the processor 140 executes the program after receiving an execution instruction, and a method executed by the electronic device 100 defined by a flow disclosed in any embodiment of the invention described later may be applied to the processor 140, or implemented by the processor 140.

The processor 140 may be an integrated circuit chip having signal processing capabilities. The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

In the embodiment of the present invention, the electronic device 100 may be, but is not limited to, a web server, a database server, a cloud server, and the like.

Referring to fig. 2, steps included in a data processing method applied to the electronic device 100 according to an embodiment of the present invention will be described with reference to fig. 2.

Step S101: and acquiring a target knowledge graph network.

When a given knowledge-graph network is analyzed, the network is used as a target knowledge-graph network for analysis. The knowledge-graph network is a graph data structure representing entity relations, each node in the graph represents an entity existing in the real world, and each edge is a relation between the entities. Briefly, the knowledge-graph network is a method of using a relationship network to connect various entities in the world together through their interrelations, and helps analysts perform association analysis and reasoning between the entities.

Step S102: and converting the attribute information of each node in the target knowledge graph network into space vector characteristics represented by numerical values to obtain an entity attribute characteristic matrix.

Learning a vector space representation of an entity in a target knowledge-graph network through a network representation. The Network Representation Learning is a distributed Representation Learning technology and is used for Learning low-dimensional social Network node vector Representation. It helps many analysis tasks to make link prediction, node clustering. In general, in a static flat network graph, network representation learning learns the vector representation of nodes in a vector space by using the adjacent information of the nodes, but in a knowledge graph network, each node in the graph is often provided with abundant attribute information, for example, a node with an entity type of Person may comprise attribute information of name, birth year and month, native place, occupation and the like. The representation of the nodes in the network space cannot be comprehensively learned by using the adjacent information of the nodes in the graph alone, which is also the reason that the similarity calculation result is inaccurate under the existing fusion mode.

It should be noted that, in the existing common method for entity fusion, it is determined whether different source entities can be aligned by using attribute information of the entities, if there is a unique identifier in attributes of the entities, it can be determined by using the unique identifier between the two entities, and if there is no unique identifier attribute, vectorization representation can be performed on the attribute information of the entities, and the similarity of the two vectors is calculated. The inventor of the application finds out in the process of the invention application that: due to the fact that the attribute information of the entity may not be comprehensively collected in engineering implementation, certain dimensionality of the entity attribute is lost, and the similarity calculation result according to the entity attribute information is inaccurate. In addition, since the attributes of entities are dynamically changing over time, merely disambiguating an entity based on the attribute information of the entity may result in misinterpretation as a different entity because the same entity possesses different attributes at different times.

The defects existing in the prior art are the results obtained after the inventor practices and researches, so that the discovery process of the above problems and the solution proposed by the following embodiments of the invention to the above problems should be the contribution of the inventor to the invention in the process of the invention.

Therefore, in the embodiment of the application, the defects existing in the existing fusion mode are solved by learning the vector space representation of the attribute information and the adjacent information of each node (entity) in the knowledge graph.

The vector space representation of the attribute information of each node (entity) in the knowledge graph can be learned through the following steps, for example, the attribute information of each node in the target knowledge graph network is converted into the space vector feature represented by a numerical value by using feature engineering, so as to obtain the entity attribute feature matrix of the knowledge graph. The text information in the attribute can be converted into a numerical characteristic vector in a word2vec mode, and the category information in the attribute can be coded into numerical characteristics by one-hot coding. For example, the entity attribute information of each node shown in table 1 may be converted into the entity attribute feature matrix shown in table 2 in the above manner. Wherein the name of each entity is learned by character embedding (such as word2vec), and is coded by one-hot method by China and profession.

TABLE 1

Entity	Name (I)	Height of a person	Body weight	Native place	Occupation of the world
						1	Plum steel	173	56	Henan province	Doctor
2	Li Jing	168	72	Henan province	Doctor
						3	Plum steel	166	77	Henan province	Doctor
4	Plum steel	179	63	Hebei river	Doctor
						5	Plum jade steel	180	66	Hebei river	Teacher's teacher
6	Zhang Xin	172	64	North of a lake	Engineer(s)
						7	Plum steel	177	63	Hunan province	Officer
8	King and honour	185	69	Hunan province	Officer

TABLE 2

[0.2，0.3]	173	56	[0，0，0，1]	[0，0，0，1]
					[0.4，0.3]	168	72	[0，0，0，1]	[0，0，0，1]
[0.2，0.3]	166	77	[0，0，0，1]	[0，0，0，1]
					[0.2，0.3]	179	63	[0，0，1，0]	[0，0，0，1]
[0.3，0.3]	180	66	[0，0，1，0]	[0，0，1，0]
					[0.2，0.3]	172	64	[0，1，0，0]	[0，1，0，0]
[0.2，0.3]	177	63	[1，0，0，0]	[1，0，0，0]
					[0.5，0.3]	185	69	[1，0，0，0]	[1，0，0，0]

Step S103: and acquiring a Laplace matrix of an entity relation graph representing each node in the target knowledge graph network.

After a target knowledge graph network to be analyzed is obtained, a Laplace matrix of an entity relation graph representing each node in the target knowledge graph network is obtained. The Laplace matrix can be determined according to the degree matrix and the adjacency matrix of the target knowledge graph network. Therefore, obtaining a laplacian matrix of an entity relationship graph representing each node in the target knowledge-graph network includes: acquiring a degree matrix for representing the degree of each node in the target knowledge graph network; acquiring an adjacency matrix representing each node connection object in the target knowledge graph network; determining the Laplace matrix according to the degree matrix and the adjacency matrix. The calculation formula of the Laplace matrix is as follows:

l ═ D-C, where L matrix is the laplacian matrix to be calculated, D matrix is the degree matrix of the graph, and C is the adjacency matrix of the graph.

For example, the degree matrix, adjacency matrix, laplace matrix of the target knowledge-graph network shown in fig. 3 can be represented by the following tables:

table 3 (degree matrix of figure 3)

1	0	0	0	0	0	0	0
								0	1	0	0	0	0	0	0
0	0	5	0	0	0	0	0
								0	0	0	1	0	0	0	0
0	0	0	0	2	0	0	0
								0	0	0	0	0	3	0	0
0	0	0	0	0	0	1	0
								0	0	0	0	0	0	0	2

TABLE 4 (adjacency matrix of FIG. 3)

0	0	1	0	0	0	0	0
								0	0	1	0	0	0	0	0
1	1	0	1	1	1	0	0
								0	0	1	0	0	0	0	0
0	0	1	0	0	0	0	1
								0	0	1	0	0	0	0	1
0	0	0	0	0	1	0	0
								0	0	0	0	1	1	0	0

Table 5 (Laplace matrix of FIG. 3)

1	0	－1	0	0	0	0	0
								0	1	－1	0	0	0	0	0
－1	－1	5	1	－1	－1	0	0
								0	0	－1	1	0	0	0	0
0	0	－1	0	2	0	0	－1
								0	0	－1	0	0	3	0	－1
0	0	0	0	0	－1	1	0
								0	0	0	0	－1	－1	0	2

Where table 3 is the degree matrix of fig. 3, table 4 is the adjacency matrix of fig. 3, and table 5 is the laplacian matrix of fig. 3. Where each row in the table corresponds to a respective node in fig. 3, e.g., the first row in table 3 corresponds to node 1 in fig. 3, the second row in table 3 corresponds to node 2 in fig. 3, and the rest is similar.

Wherein, for node 1, there is only one edge, so the value in the corresponding degree matrix is 1; similarly, there is only one edge for node 2, so the value in the corresponding degree matrix is 1; similarly, there are 5 edges for node 3, and thus the corresponding degree matrix has a value of 5, which is similar for the rest of the cases.

Wherein, for node 1, the node connected to it is 3, so the value of the 3 rd column in the corresponding adjacency matrix is 1; similarly, for node 2, the node connected to it is 3, and therefore, the value of column 3 in the corresponding adjacency matrix is 1; similarly, for node 3, the nodes connected to it are 1, 2, 4, 5, 6, so the value of the 1 st, 2 nd, 4 th, 5 th, 6 th column in the corresponding adjacency matrix is 1, and the rest is similar.

Step S104: and determining final vector space representation of each node in the target knowledge graph network according to the entity attribute feature matrix and the Laplace matrix.

After the entity attribute feature matrix and the Laplace matrix of the target knowledge graph network are learned, the final vector space representation of each node in the target knowledge graph network can be determined according to the two matrices. As an alternative implementation, the final vector space representation of each node in the target knowledge-graph network is calculated, for example, by the formula V ═ f (T, L). Wherein T represents an entity attribute feature matrix, L represents a Laplacian matrix of an entity relation graph, V represents a final vector space representation of an entity in a target knowledge graph, and f is a function for calculating the final vector space representation, which is usually a convolutional neural network. For example, when the final vector space representation of the node 1 is calculated, the entity attribute feature of the node 1 in the entity attribute feature matrix and the relationship feature of the node 1 in the laplace matrix are input into the convolutional neural network, so that the final vector space representation of the node 1 can be obtained, and the obtaining processes of the final vector space representations of the other nodes are similar to each other.

As another optional implementation, after the entity attribute feature matrix of the target knowledge graph network is learned, the vector similarity between each node in the target knowledge graph network and each node in all nodes is further calculated, so as to obtain a similarity matrix. For ease of understanding, it is assumed that the target knowledge-graph network includes n nodes, n being an integer greater than 1. At this time, the above process is to calculate the vector similarity between the ith node and each of the n nodes in the target knowledge graph network based on the entity attribute feature matrix to obtain a similarity matrix, where the ith row in the similarity matrix represents the vector similarity between the ith node and each of the n nodes, and i is greater than or equal to 1 and less than or equal to n. That is, a target knowledge graph network containing n nodes can obtain an n-dimensional similarity matrix by pairwise calculating the similarity of the attribute feature vectors. For convenience of understanding, the target knowledge-graph network shown in fig. 3 is taken as an example for explanation, that is, after the entity attribute feature matrix of fig. 3 is obtained, for the node 1, the vector similarity between the node 1 and the node 2, the vector similarity between the node 1 and the node 3, the vector similarity between the node 1 and the node 4, the vector similarity between the node 1 and the node 5, the vector similarity between the node 1 and the node 6, the vector similarity between the node 1 and the node 7, and the vector similarity between the node 1 and the node 8 need to be calculated. Similarly, for node 2, the vector similarity between node 2 and node 1, the vector similarity between node 2 and node 2, the vector similarity between node 2 and node 3, the vector similarity between node 2 and node 4, the vector similarity between node 2 and node 5, the vector similarity between node 2 and node 6, the vector similarity between node 2 and node 7, and the vector similarity between node 2 and node 8 need to be calculated. The calculation of each of the other nodes is similar. This results in a n x n-dimensional similarity matrix.

The calculation formula of the vector similarity is as follows:

wherein the attribute characteristics of one node are represented by a vector a and the attributes of the other node are represented by a vector B. Each item in the attribute vector is a feature value corresponding to each attribute of the entity, for example, for

node

1, 5 attributes are included, such as (name, height, weight, native place, occupation).

And after a similarity matrix is obtained, determining the final vector space representation of each node in the target knowledge graph network according to the similarity matrix and the Laplace matrix. For example, the final vector space representation of each node in the target knowledge-graph network may be determined according to a final vector space representation function, the similarity matrix, and the laplacian matrix, where the final vector space representation function is:

，

wherein

S represents the similarity matrix, W represents the Laplace matrix, H represents a final vector space representation matrix, lambda is an adjustment coefficient, is greater than or equal to 0 and less than or equal to 1, and H represents the final vector space representation of each node. By solving the above representation functions, the final vector space representation h of each entity can be found.

The calculation of the final vector space indicates that the final vector space should satisfy (1) two entities that are originally close in attribute space attribute value are also similar in the finally obtained vector space (first condition), (2) two entities that are originally adjacent in the knowledge-graph network are also similar in the finally obtained vector space (second condition). That is, the first term in the above-mentioned representing function corresponds to the above-mentioned first condition, and the second term in the representing function corresponds to the above-mentioned second condition.

Step S105: and calculating final vector similarity between every two nodes in the target knowledge-graph network.

After the final vector space representation of each node in the target knowledge-graph network is obtained, calculating the final vector similarity between every two nodes in the target knowledge-graph network, and obtaining the final vector similarity of each node pair, such as obtaining the final vector similarity of the node 1 and the node 2, the final vector similarity of the node 1 and the node 3, the final vector similarity of the node 1 and the node 4, the final vector similarity of the node 1 and the node 5, the final vector similarity of the node 1 and the node 6, the final vector similarity of the node 2 and the node 3, and the like.

As an optional implementation manner, in order to reduce the calculation difficulty, the vector features corresponding to each node in the target knowledge graph network may be clustered through a clustering algorithm, so that each node in the target knowledge graph network may be divided into a plurality of different clusters, the node attribute association inside each cluster is strong, and the node attribute association between clusters is weak. Therefore, when the final vector similarity between every two nodes in the calculation is calculated, the final vector similarity between every two nodes in the same cluster can be calculated, the calculation difficulty can be reduced, and the time can be saved.

The clustering algorithm may be a current clustering algorithm, such as a K-Means clustering algorithm.

Step S106: and fusing the node pairs with the vector similarity calculation result larger than a preset threshold value.

After the final vector similarity of each node pair is obtained, the node pairs with the vector similarity calculation results larger than a preset threshold value are screened out, and the node pairs meeting the conditions, namely the attributes are fused. For example, if the vector similarity calculation result of the node 1 and the node 2 is greater than the preset threshold, the node 1 and the node 2 are merged, and at this time, a new node is formed.

To sum up, in the data processing method provided in the embodiment of the present application, the attribute information of each node in the target knowledge-graph network is converted into a space vector feature represented by a numerical value, so as to obtain an entity attribute feature matrix, obtain a laplacian matrix of an entity relationship graph representing each node in the target knowledge-graph network, determine a final vector space representation of each node in the target knowledge-graph network according to the entity attribute feature matrix and the laplacian matrix, and calculate a final vector similarity between every two nodes based on the final vector space representation of each node in the target knowledge-graph network, so as to merge nodes whose vector similarity calculation results are greater than a preset threshold. By fully utilizing the attribute information and the adjacent information of the entity in the knowledge graph to learn the space representation of the entity vector, more comprehensive and accurate vector representation can be obtained, and the problem of inaccurate calculation of the entity similarity caused by the fact that the attribute value is changed due to the fact that the entity attribute is lost is solved. In addition, the space vector of the entity is divided into different subspaces by using a clustering algorithm, and finally, the pairwise similarity is calculated in the subspaces, so that the performance problem caused by pairwise calculation of the entity of the whole graph under the condition of large data volume is avoided.

The embodiment of the present application further provides a data processing apparatus 110, as shown in fig. 4. The data processing apparatus 110 includes: a first obtaining module 111, a converting module 112, a second obtaining module 113, a determining module 114, a calculating module 115, and a fusing module 116.

The first obtaining module 111 is configured to obtain a target knowledge-graph network.

A converting module 112, configured to convert the attribute information of each node in the target intellectual graph network into a space vector feature represented by a numerical value, so as to obtain an entity attribute feature matrix.

A second obtaining module 113, configured to obtain a laplacian matrix of an entity relationship graph representing each node in the target knowledge-graph network.

A determining module 114, configured to determine a final vector space representation of each node in the target knowledge-graph network according to the entity attribute feature matrix and the laplacian matrix.

A calculating module 115, configured to calculate a vector similarity between every two nodes in the target knowledge-graph network.

And the fusion module 116 is configured to fuse the node pairs whose vector similarity calculation result is greater than the preset threshold.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other.

The data processing apparatus 110 according to the embodiment of the present invention has the same implementation principle and technical effect as the foregoing method embodiments, and for the sake of brief description, reference may be made to the corresponding contents in the foregoing method embodiments for the parts of the apparatus embodiments that are not mentioned.

The embodiment of the present application further provides a storage medium, where the storage medium includes a computer program, and the computer program is executed by a computer to perform the data processing method.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, the functional modules in the embodiments of the present invention may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a notebook computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes. It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A data processing method, comprising:

acquiring a target knowledge graph network representing attribute relations of people;

converting the attribute information of each node in the target knowledge graph network into space vector features represented by numerical values to obtain an entity attribute feature matrix, wherein when the entity represented by the node is a person, the attribute information comprises: name, height, native place, occupation, weight;

acquiring a Laplace matrix of an entity relationship graph representing each node in the target knowledge graph network;

determining final vector space representation of each node in the target knowledge graph network according to the entity attribute feature matrix and the Laplace matrix;

calculating the final vector similarity between every two nodes in the target knowledge graph network;

and fusing the node pairs with the vector similarity calculation result larger than a preset threshold value.

2. The method of claim 1, wherein obtaining the laplacian matrix of the entity relationship graph characterizing each node in the target knowledge-graph network comprises:

acquiring a degree matrix for representing the degree of each node in the target knowledge graph network;

acquiring an adjacency matrix representing each node connection object in the target knowledge graph network;

determining the Laplace matrix according to the degree matrix and the adjacency matrix.

3. The method of claim 1, wherein the target knowledge-graph network comprises n nodes, n being an integer greater than 1; determining a final vector space representation of each node in the target knowledge-graph network according to the entity attribute feature matrix and the laplacian matrix, including:

calculating vector similarity between the ith node and each node in the n nodes in the target knowledge graph network based on the entity attribute feature matrix to obtain a similarity matrix, wherein the ith row in the similarity matrix represents the vector similarity between the ith node and each node in the n nodes, and i is greater than or equal to 1 and less than or equal to n;

and determining final vector space representation of each node in the target knowledge graph network according to the similarity matrix and the Laplace matrix.

4. The method of claim 3, wherein determining a final vector space representation for each node in the target knowledge-graph network based on the similarity matrix and the Laplace matrix comprises:

determining a final vector space representation of each node in the target knowledge graph network according to a final vector space representation function, the similarity matrix and the Laplace matrix, wherein the final vector space representation function is as follows:

，

wherein

S represents the similarity matrix, W represents the Laplace matrix, H represents a final vector space representation matrix, and lambda is an adjustment coefficient and is greater than or equal to 0 and less than or equal to 0Equal to 1, h is the final vector space representation of each node.

5. The method of claim 1, wherein calculating the vector similarity between each two nodes in the target knowledge-graph network comprises:

clustering vector features corresponding to each node in the target knowledge graph network through a clustering algorithm;

and calculating the final vector similarity between every two nodes which belong to the same cluster.

6. A data processing apparatus, comprising:

the first acquisition module is used for acquiring a target knowledge graph network representing the attribute relationship of a person;

a conversion module, configured to convert attribute information of each node in the target knowledge-graph network into a space vector feature represented by a numerical value, so as to obtain an entity attribute feature matrix, where, when an entity represented by a node is a person, the attribute information includes: name, height, native place, occupation, weight;

the second acquisition module is used for representing a Laplace matrix of an entity relationship graph of each node in the target knowledge graph network;

a determining module, configured to determine a final vector space representation of each node in the target knowledge-graph network according to the entity attribute feature matrix and the laplacian matrix;

the calculation module is used for calculating the vector similarity between every two nodes in the target knowledge graph network;

and the fusion module is used for fusing the node pairs with the vector similarity calculation result larger than the preset threshold value.

7. The apparatus of claim 6, wherein the second obtaining module is further configured to: acquiring a degree matrix for representing the degree of each node in the target knowledge graph network;

8. The apparatus of claim 6, wherein the target knowledge-graph network comprises n nodes, n being an integer greater than 1; the determining module is further configured to:

9. An electronic device, comprising: a memory and a processor, the memory and the processor connected;

the memory is used for storing programs;

the processor is configured to invoke a program stored in the memory to perform the method of any of claims 1-5.

10. A storage medium, characterized in that the storage medium comprises a computer program which, when executed by a computer, performs the method according to any one of claims 1-5.