CN114118310A

CN114118310A - Clustering method and device based on comprehensive similarity

Info

Publication number: CN114118310A
Application number: CN202210103985.5A
Authority: CN
Inventors: 张家华; 郑重; 经小川; 郑俊康; 诗博雅; 李瑞群
Original assignee: Aerospace Hongkang Intelligent Technology Beijing Co ltd
Current assignee: Aerospace Hongkang Intelligent Technology Beijing Co ltd
Priority date: 2022-01-28
Filing date: 2022-01-28
Publication date: 2022-03-01

Abstract

A clustering method and device based on comprehensive similarity are disclosed. The method comprises the following steps: obtaining an entity data set comprising an entity; calculating the comprehensive similarity of every two entities in the entities; and carrying out hierarchical clustering on the entities according to the comprehensive similarity, wherein for a first entity and a second entity in any two entities, the first entity comprises a first name and a plurality of first attributes, the second entity comprises a second name and a plurality of second attributes, and the comprehensive similarity is calculated by adopting the following steps: calculating the editing distance between the first name and the second name to obtain the similarity of the editing distance; extracting word vectors of the first name and the second name based on a pre-training language model, and calculating cosine similarity of the word vectors; extracting attribute word vectors of a plurality of first attributes and a plurality of second attributes based on a pre-training language model, and calculating attribute similarity based on the attribute word vectors; and weighting and adding the editing distance similarity, the word vector cosine similarity and the attribute similarity to obtain the comprehensive similarity.

Description

Clustering method and device based on comprehensive similarity

Technical Field

The present disclosure relates to solving entity mapping and category identification problems in the field of natural language processing entity mapping, and more particularly, to a clustering method and apparatus based on comprehensive similarity.

Background

The ontology in the information processing field can be regarded as a resource set of general knowledge and special knowledge, and provides rich knowledge for artificial intelligence application such as information extraction and natural language processing, and lays a solid foundation. The ontology representation is a collection of a class or set of entities within a particular domain that have common attributes. An ontology includes conceptual definitions of a field of knowledge or technology and may include relationships between concepts. In order to facilitate knowledge sharing and information propagation, different ontologies in related fields are required to be effectively linked or fused, and an information user can conveniently and accurately master knowledge in the fields on the whole. Since different ontologies are based on different construction criteria (heterogeneous ontologies from one another), naming or description of the same concept in different ontologies often has a large difference. This hinders the identification of entities describing the same concept, making it difficult to perform a fusion operation of heterogeneous ontologies.

For entity information obtained by natural language processing, how to perform ontology mapping is a key technology. In order to realize the interoperation between heterogeneous ontologies, various methods for discovering the mapping relationship between ontologies have been proposed in recent years. In various mapping methods, the concept similarity calculation based on the word constitution characteristics basically does not need other corpus resource support except the words, and the calculation has the characteristics of directness, rapidness and the like, so the method is widely applied. However, the existing correlation method still has the problems that the computation of similarity of synonyms and synonyms is difficult for variant bodies with the same semantics but not completely consistent in writing, the distribution strategy of the weight of the constituent words of the concept terms to be matched is incomplete, and the like.

Disclosure of Invention

The disclosure provides a clustering method and device based on comprehensive similarity.

According to a first aspect of the embodiments of the present disclosure, there is provided a clustering method based on comprehensive similarity, the method including: obtaining an entity data set comprising a plurality of entities; calculating the comprehensive similarity of every two entities in the plurality of entities; and performing hierarchical clustering on the plurality of entities according to the comprehensive similarity, wherein for a first entity and a second entity in any two entities, the first entity comprises a first name and a plurality of first attributes, the second entity comprises a second name and a plurality of second attributes, and the comprehensive similarity is calculated by adopting the following steps: calculating the editing distance between the first name and the second name to obtain the similarity of the editing distance; extracting a first name word vector of the first name and a second name word vector of the second name based on a pre-training language model, and calculating the cosine similarity of the word vectors of the first name word vector and the second name word vector; extracting a first attribute word vector of a plurality of first attributes and a second attribute word vector of a plurality of second attributes based on a pre-training language model, and calculating attribute similarity based on the first attribute word vector and the second attribute word vector; and weighting and adding the editing distance similarity, the word vector cosine similarity and the attribute similarity to obtain the comprehensive similarity.

Alternatively, the edit distance similarity may be expressed as

Where x is a first name, y is a second name, D (x, y) is an edit distance of x to y, and | x | and | y | represent string lengths of the first name and the second name, respectively.

Alternatively, the cosine similarity of the word vector may be expressed as

Wherein A is the first name word directionQuantity, B is the second name word vector, A_iThe ith word vector being the first name word vector, B_iAn ith word vector that is the second name word vector.

Alternatively, the attribute similarity may be expressed as

,

Wherein, U_iAn ith first attribute word vector representing a plurality of first attributes, m being the number of the plurality of first attributes, V_jA jth second attribute word vector representing the plurality of second attributes, n being the number of the plurality of second attributes, { e | e = sim_vect(U_i,V_j) J ∈ {1.. n } } and { e | e = sim_vect(U_i,V_j) I ∈ {1.. m } } respectively denote a set made up of elements e.

Optionally, the clustering method may include: and judging the entities with the comprehensive similarity smaller than or equal to a first threshold as the same ontology, and judging the entities with the comprehensive similarity larger than the first threshold and smaller than or equal to a second threshold as the relatives, wherein the ontology represents a set of entities with common attributes in a specific field.

According to a second aspect of the embodiments of the present disclosure, there is provided a clustering apparatus based on integrated similarity, the apparatus including: a dataset acquisition unit configured to acquire an entity dataset including a plurality of entities; a similarity calculation unit configured to calculate a comprehensive similarity of each two entities of the plurality of entities; the hierarchical clustering unit is configured to perform hierarchical clustering on the plurality of entities according to the comprehensive similarity; wherein, for a first entity and a second entity in any two entities, the first entity includes a first name and a plurality of first attributes, the second entity includes a second name and a plurality of second attributes, the similarity calculation unit executes the following steps: calculating the editing distance between the first name and the second name to obtain the similarity of the editing distance; extracting a first name word vector of the first name and a second name word vector of the second name based on a pre-training language model, and calculating the cosine similarity of the word vectors of the first name word vector and the second name word vector; extracting a plurality of first attribute word vectors of first attributes and a plurality of second attribute word vectors of second attributes based on a pre-training language model, and calculating attribute similarity based on the first attribute word vectors and the second attribute word vectors; and weighting and adding the editing distance similarity, the word vector cosine similarity and the attribute similarity to obtain the comprehensive similarity.

Optionally, the clustering device may further include a mapping relation determining unit. The mapping relation determination unit is configured to determine entities with integrated similarity smaller than or equal to a first threshold as the same ontology, and determine entities with integrated similarity larger than the first threshold and smaller than or equal to a second threshold as a relativity class, wherein the ontology represents a set of entities with common attributes in a specific domain.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform a comprehensive similarity-based clustering method as described above.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform the integrated similarity-based clustering method as described above.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

compared with the prior art, in the ontology similarity measurement technology in the field of natural language processing, according to one or more exemplary embodiments of the present disclosure, similarity measurement is performed based on a pre-trained language model, so that the model speed is faster and the accuracy is higher, for example, the cosine similarity of a word vector obtained by the pre-trained language model has higher accuracy, and in addition, the editing distance similarity, the cosine similarity of the word vector and the attribute similarity are comprehensively considered to have higher accuracy. Compared with the existing ontology mapping method, the hierarchical clustering algorithm is used for clustering the similarity of different entities, possible relatives among different ontologies can be obtained while ontology mapping is achieved, and developers can conveniently construct the child-parent relationship of the ontologies.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a flow chart illustrating a comprehensive similarity-based clustering method according to the present disclosure;

FIG. 2 is a flow chart illustrating another clustering method based on integrated similarity according to the present disclosure;

FIG. 3 is an example illustrating a clustering method according to the present disclosure;

FIG. 4 is a schematic diagram illustrating a clustering apparatus according to the present disclosure; and

fig. 5 is a block diagram illustrating an electronic device according to an exemplary embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The embodiments described in the following examples do not represent all embodiments consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In this case, the expression "at least one of the items" in the present disclosure means a case where three types of parallel expressions "any one of the items", "a combination of any plural ones of the items", and "the entirety of the items" are included. For example, "include at least one of a and B" includes the following three cases in parallel: (1) comprises A; (2) comprises B; (3) including a and B. For another example, "at least one of the first step and the second step is performed", which means that the following three cases are juxtaposed: (1) executing the step one; (2) executing the step two; (3) and executing the step one and the step two.

FIG. 1 is a flow chart illustrating a comprehensive similarity-based clustering method according to the present disclosure.

Specifically, the clustering method based on the comprehensive similarity can solve the problem of how to measure the similarity of the ontologies in the ontology mapping field, and also solve the classification of the categories among the ontologies on a certain level. For example, whether the objects belong to the same ontology or not, and whether the relatives such as siblings, child parents, and the like are related to each other, are determined.

Referring to fig. 1, in step S100, an entity data set including a plurality of entities, each of which includes a name (concept name) and a plurality of attributes, is acquired.

In step S200, a comprehensive similarity of each two entities of the plurality of entities is calculated.

In an example embodiment, for a first entity and a second entity of any two entities, the first entity including a first name and a plurality of first attributes, and the second entity including a second name and a plurality of second attributes, the calculation of the integrated similarity may be performed by the following sub-steps.

In sub-step S210, an edit distance of the first name and the second name is calculated to obtain an edit distance similarity. By way of example, the edit distance may be to the smallest operand that translates from the first name to the second name (operations include insert, delete, replace, etc.). Generally, the smaller the edit distance between concepts, the greater the similarity of their names.

In an embodiment, the edit distance similarity may be expressed as

Where x is a first name, y is a second name, D (x, y) is an edit distance from x to y, and | x | and | y | represent string lengths of the first name and the second name, respectively.

As an example, if x is "horse", y is "ros", horse → horse (replace 'h' with 'r'), horse → rose (delete 'r'), rose → ros (delete 'e'), and D (x, y) is 3.

In the sub-step S220, a first name word vector of the first name and a second name word vector of the second name are extracted based on the pre-training language model, and the cosine similarity of the word vectors of the first name word vector and the second name word vector is calculated.

For example, in an embodiment, a first name word vector for a first name and a second name word vector for a second name may be extracted using an embedding layer of the ERNIE pre-training language model.

In an embodiment, the cosine similarity of the word vector may be expressed as

Wherein A is a first name word vector, B is a second name word vector, A_iThe ith word vector being the first name word vector, B_iThe ith word vector is the second name word vector.

In sub-step S230, a first attribute word vector of a plurality of first attributes and a second attribute word vector of a plurality of second attributes are extracted based on the pre-trained language model, and attribute similarity is calculated based on the first attribute word vector and the second attribute word vector. For example, a set of attribute word vectors may be generated for each of the plurality of first attributes, and if the number of first attributes is m, a plurality of m first attribute word vectors corresponding to the plurality of first attributes in a one-to-one correspondence may be collected based on the pre-trained language model.

In an embodiment, the attribute similarity is expressed as

Wherein, U_iAn ith first attribute word vector representing a plurality of first attributes, m being the number of the plurality of first attributes, V_jA jth second attribute word vector representing a plurality of second attributes, n being the number of the plurality of second attributes, wherein { e | e = sim }_vect(U_i,V_j) J ∈ {1.. n } } and { e | e = sim_vect(U_i,V_j) I ∈ {1.. m } } respectively denote a set made up of elements e.

Function sim in addition to substituting the first attribute word vector and the second attribute word vector_vect(U_i, V_j) The same as or similar to the cosine similarity calculation method described with reference to step S220, and redundant description is omitted here. Expression max ({ e | e = sim)_vect(U_i,V_j) J e {1.. n } }) represents one having the largest similarity among the set of similarities of the ith first attribute word vector and the jth second attribute word vector. That is, the expression max is equivalent to matching a plurality of first attributes with a plurality of second attributes, and when the attribute similarity is evaluated, firstly, the best matching pair is found through matching, and the similarity of the attribute pairs matched with each other is obtained by calculating the cosine similarity of the attribute word vectors.

In sub-step S240, the edit distance similarity, the word vector cosine similarity, and the attribute similarity are weighted and added to obtain a comprehensive similarity.

In step S300, the multiple entities are hierarchically clustered according to the integrated similarity. The method of hierarchical clustering will be further described later with reference to fig. 3.

FIG. 2 is a flow chart illustrating another clustering method based on integrated similarity according to the present disclosure.

Except for step S400, the clustering method illustrated in fig. 2 is substantially the same as or similar to the clustering method described with reference to fig. 1, and redundant description is omitted herein.

In step S400, entities having an integrated similarity equal to or less than a first threshold are determined as the same ontology, and entities having an integrated similarity greater than the first threshold and equal to or less than a second threshold are determined as a family, where the ontology represents a set of entities having a common attribute in a specific domain. The method of hierarchical clustering will be further described later with reference to fig. 3.

Fig. 3 is an example illustrating a clustering method according to the present disclosure.

In an embodiment, Hierarchical Clustering (Hierarchical Clustering) is a kind of Clustering algorithm, and a Hierarchical nested cluster tree is created by calculating similarities between data points of different classes. In a cluster tree, the original data points of different classes are the lowest level of the tree, and the top level of the tree is the root node of a cluster. The clustering tree is created by two methods of bottom-up combination and top-down division. The merging algorithm of hierarchical clustering combines two most similar data points of all data points by calculating the similarity between the two types of data points, and iterates the process repeatedly. In brief, the merging algorithm of hierarchical clustering determines the similarity between data points of each category by calculating the distance between them, and the smaller the distance, the higher the similarity. And combining the two data points or categories with the closest distance to generate a clustering tree.

Specifically, referring to the illustration in fig. 3, the similarity is greatest for entity 1 and entity 14, followed by entity 13 and entity 16, and entity 0 and entity 12. Thus, entity 1 and entity 14 are merged first, then entity 13 and entity 16 can be merged, and further entity 0 and entity 12. Finding out two clusters (entities) with the minimum integrated similarity distance and combining the two clusters into a new cluster, wherein in the clustering process, the method of average link is adopted for judging the similarity between two categories by combining the small clusters into a large cluster, namely, the distances between every two sample points in two sets are put together to calculate the average value, and the similarity of the two entity clusters is measured by the average value. And recalculating the distance between the new cluster and other clusters, and repeating the iteration until all the clusters are merged into a large class.

In an embodiment, a suitable threshold may be further selected to determine the constructed hierarchical clustering tree. For example, a first threshold (ontology mapping threshold) and a second threshold (relatives discrimination threshold) are selected. And performing ontology mapping on the entity with the comprehensive similarity distance below the ontology mapping threshold as the same ontology. And judging that the entities with the comprehensive similarity clustering between the first threshold and the second threshold have the relativity relationship. More specifically, a parent class may be defined as hierarchically high and a child class may be defined as hierarchically low. In another embodiment, the entity pairs (or clusters) may be ordered by similarity, and the first threshold may be the entities ranked less than or equal to a first percentage of all entity pairs (or clusters) (e.g., the entities with the top 10% similarity order). The second threshold may be entities ranked greater than a first percentage of all entity pairs (or clusters) and less than or equal to a second percentage (e.g., entities with a 10% to 20% similarity ranking). As shown in fig. 3, if the first threshold is 10%, the entities (entity pairs) satisfying the first threshold are entities (1, 14), (13, 16), and (0, 12), that is, the

entities

1 and 14 can be determined as the same entity, the

entities

13 and 16 are the same entity, and the

entities

0 and 12 are the same entity. In the entities (entity pairs or clusters) of more than 10% and 20% or less, the similarity between the entity 18 and the entities (1, 14) is greater than the ontology mapping threshold, so that it can be determined that different ontologies exist, but the similarity is less than or equal to the relatives determination threshold, so that it can be determined that the relatives relationship may exist. Similarly, entity 4 may have a familial relationship with entity (13, 16).

According to one or more exemplary embodiments of the present disclosure, the similarity measurement based on the pre-trained language model is used, so that the model speed is faster and the accuracy is higher, for example, the word vector cosine similarity obtained through the pre-trained language model has higher accuracy, and in addition, the editing distance similarity, the word vector cosine similarity and the attribute similarity are comprehensively considered to have higher accuracy. Compared with the existing ontology mapping method, the hierarchical clustering algorithm is used for clustering the similarity of different entities, possible relatives among different ontologies can be obtained while ontology mapping is achieved, and developers can conveniently construct the child-parent relationship of the ontologies.

Fig. 4 is a schematic diagram illustrating the integrated similarity-based clustering apparatus 10 according to the present disclosure.

Referring to fig. 4, the integrated similarity-based clustering apparatus 10 includes: a data set acquisition unit 100, a similarity calculation unit 200, and a hierarchical clustering unit 300.

The dataset acquisition unit 100 is configured to acquire an entity dataset comprising a plurality of entities. The data set acquisition unit 100 is configured to perform the method described with reference to step S100 of fig. 1.

The similarity calculation unit 200 is configured to calculate a comprehensive similarity of each two entities of the plurality of entities. The similarity calculation unit 200 is configured to perform the method described with reference to step S200 of fig. 1, in particular, to perform sub-step S210 to sub-step S240.

The hierarchical clustering unit 300 is configured to hierarchically cluster the plurality of entities according to the integrated similarity. The hierarchical clustering unit 300 is configured to perform the method described with reference to step S300 of fig. 1.

Furthermore, the clustering apparatus 10 based on the integrated similarity may further include a mapping relation determination unit 240. The mapping relation determination unit 240 is configured to determine entities having an integrated similarity equal to or less than a first threshold as the same ontology, and determine entities having an integrated similarity greater than the first threshold and equal to or less than a second threshold as a family. The mapping relation determining unit 240 is configured to perform the method described with reference to step S400 of fig. 2.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module/unit performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated herein.

Fig. 5 is a block diagram illustrating an electronic device 500 according to an example embodiment of the present disclosure.

Referring to fig. 5, an electronic device 500 includes at least one memory 501 and at least one processor 502, the at least one memory 501 storing computer-executable instructions that, when executed by the at least one processor 502, cause the at least one processor 502 to perform a comprehensive similarity-based clustering method according to embodiments of the present disclosure.

By way of example, the electronic device 500 may be a PC computer, tablet device, personal digital assistant, smart phone, or other device capable of executing the instructions described above. Here, the electronic device 500 need not be a single electronic device, but can be any arrangement or collection of circuits that can individually or jointly execute the instructions (or sets of instructions) described above. The electronic device 500 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).

In the electronic device 500, the processor 502 may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special-purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

The processor 502 may execute instructions or code stored in the memory 501, wherein the memory 501 may also store data. The instructions and data may also be transmitted or received over a network via a network interface device, which may employ any known transmission protocol.

The memory 501 may be integrated with the processor 502, for example, by having RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, memory 501 may comprise a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The memory 501 and the processor 502 may be operatively coupled or may communicate with each other, e.g., through I/O ports, network connections, etc., such that the processor 502 is able to read files stored in the memory.

In addition, the electronic device 500 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device 500 may be connected to each other via a bus and/or a network.

According to an embodiment of the present disclosure, there may also be provided a computer-readable storage medium, wherein when executed by at least one processor, instructions stored in the computer-readable storage medium cause the at least one processor to perform a comprehensive similarity-based clustering method according to an embodiment of the present disclosure. Examples of the computer-readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD + RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD + RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or compact disc memory, Hard Disk Drive (HDD), solid-state drive (SSD), card-type memory (such as a multimedia card, a Secure Digital (SD) card or a extreme digital (XD) card), magnetic tape, a floppy disk, a magneto-optical data storage device, an optical data storage device, a hard disk, a magnetic tape, a magneto-optical data storage device, a hard disk, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, A solid state disk, and any other device configured to store and provide a computer program and any associated data, data files, and data structures to a processor or computer in a non-transitory manner such that the processor or computer can execute the computer program. The computer program in the computer-readable storage medium described above can be run in an environment deployed in a computer apparatus, such as a client, a host, a proxy device, a server, and the like, and further, in one example, the computer program and any associated data, data files, and data structures are distributed across a networked computer system such that the computer program and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A clustering method based on comprehensive similarity is characterized by comprising the following steps:

obtaining an entity data set comprising a plurality of entities;

calculating the comprehensive similarity of every two entities in the plurality of entities; and

performing hierarchical clustering on the plurality of entities according to the comprehensive similarity,

wherein, aiming at a first entity and a second entity in any two entities, the first entity comprises a first name and a plurality of first attributes, the second entity comprises a second name and a plurality of second attributes, and the comprehensive similarity is calculated by adopting the following steps:

calculating the editing distance between the first name and the second name to obtain the similarity of the editing distance;

extracting a first name word vector of the first name and a second name word vector of the second name based on a pre-training language model, and calculating the cosine similarity of the word vectors of the first name word vector and the second name word vector;

extracting a first attribute word vector of the plurality of first attributes and a second attribute word vector of the plurality of second attributes based on the pre-training language model, and calculating attribute similarity based on the first attribute word vector and the second attribute word vector; and

and weighting and adding the editing distance similarity, the word vector cosine similarity and the attribute similarity to obtain comprehensive similarity.

2. The method of claim 1, wherein the edit distance similarity is expressed as

Where x is the first name, y is the second name, D (x, y) is an edit distance from x to y, and | x | and | y | represent string lengths of the first name and the second name, respectively.

3. The method of claim 1, wherein the cosine similarity of the word vector is expressed as

Wherein A is the first name word vector, B is the second name word vector, A_iThe ith word vector being the first name word vector, B_iAn ith word vector that is the second name word vector.

4. The method of claim 1, wherein the attribute similarity is expressed as

Wherein, U_iAn ith first attribute word vector representing the plurality of first attributes, m being the number of the plurality of first attributes, V_jA jth second attribute word vector representing the plurality of second attributes, n being a number of the plurality of second attributes,

wherein, { e | e = sim_vect(U_i,V_j), j∈{1...n}And { e | e = sim }_vect(U_i,V_j) I ∈ {1.. m } } respectively denote a set made up of elements e.

5. The method of claim 1, wherein the clustering method comprises: and judging the entities with the comprehensive similarity smaller than or equal to a first threshold as the same ontology, and judging the entities with the comprehensive similarity larger than the first threshold and smaller than or equal to a second threshold as the relatives, wherein the ontology represents the set of the entities with common attributes in a specific field.

6. An apparatus for clustering based on integrated similarity, the apparatus comprising:

a dataset acquisition unit configured to acquire an entity dataset including a plurality of entities;

a similarity calculation unit configured to calculate a comprehensive similarity of each two entities of the plurality of entities; and

a hierarchical clustering unit configured to perform hierarchical clustering on the plurality of entities according to the comprehensive similarity;

wherein, for a first entity and a second entity in any two entities, the first entity includes a first name and a plurality of first attributes, the second entity includes a second name and a plurality of second attributes, the similarity calculation unit executes the following steps:

7. The apparatus of claim 6, wherein the edit distance similarity is expressed as

8. The apparatus of claim 6, wherein the cosine similarity of the word vector is expressed as

9. The apparatus of claim 6, wherein the attribute similarity is expressed as

wherein，{e|e=sim_vect(U_i,V_j) J ∈ {1.. n } } and { e | e = sim_vect(U_i,V_j) I ∈ {1.. m } } respectively denote a set made up of elements e.

10. The apparatus of claim 6, wherein the clustering means further comprises:

and the mapping relation judging unit is configured to judge the entities with the comprehensive similarity smaller than or equal to a first threshold as the same ontology, and judge the entities with the comprehensive similarity larger than the first threshold and smaller than or equal to a second threshold as the relatives, wherein the ontology represents a set of entities with common attributes in a specific field.

11. An electronic device, comprising:

at least one processor;

at least one memory storing computer-executable instructions,

wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform the clustering method of any one of claims 1 to 5.

12. A computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform the clustering method of any one of claims 1 to 5.