CN114118310A - Clustering method and device based on comprehensive similarity - Google Patents

Clustering method and device based on comprehensive similarity Download PDF

Info

Publication number
CN114118310A
CN114118310A CN202210103985.5A CN202210103985A CN114118310A CN 114118310 A CN114118310 A CN 114118310A CN 202210103985 A CN202210103985 A CN 202210103985A CN 114118310 A CN114118310 A CN 114118310A
Authority
CN
China
Prior art keywords
name
similarity
word vector
entities
attributes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210103985.5A
Other languages
Chinese (zh)
Inventor
张家华
郑重
经小川
郑俊康
诗博雅
李瑞群
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Aerospace Hongkang Intelligent Technology Beijing Co ltd
Original Assignee
Aerospace Hongkang Intelligent Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Aerospace Hongkang Intelligent Technology Beijing Co ltd filed Critical Aerospace Hongkang Intelligent Technology Beijing Co ltd
Priority to CN202210103985.5A priority Critical patent/CN114118310A/en
Publication of CN114118310A publication Critical patent/CN114118310A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A clustering method and device based on comprehensive similarity are disclosed. The method comprises the following steps: obtaining an entity data set comprising an entity; calculating the comprehensive similarity of every two entities in the entities; and carrying out hierarchical clustering on the entities according to the comprehensive similarity, wherein for a first entity and a second entity in any two entities, the first entity comprises a first name and a plurality of first attributes, the second entity comprises a second name and a plurality of second attributes, and the comprehensive similarity is calculated by adopting the following steps: calculating the editing distance between the first name and the second name to obtain the similarity of the editing distance; extracting word vectors of the first name and the second name based on a pre-training language model, and calculating cosine similarity of the word vectors; extracting attribute word vectors of a plurality of first attributes and a plurality of second attributes based on a pre-training language model, and calculating attribute similarity based on the attribute word vectors; and weighting and adding the editing distance similarity, the word vector cosine similarity and the attribute similarity to obtain the comprehensive similarity.

Description

Clustering method and device based on comprehensive similarity
Technical Field
The present disclosure relates to solving entity mapping and category identification problems in the field of natural language processing entity mapping, and more particularly, to a clustering method and apparatus based on comprehensive similarity.
Background
The ontology in the information processing field can be regarded as a resource set of general knowledge and special knowledge, and provides rich knowledge for artificial intelligence application such as information extraction and natural language processing, and lays a solid foundation. The ontology representation is a collection of a class or set of entities within a particular domain that have common attributes. An ontology includes conceptual definitions of a field of knowledge or technology and may include relationships between concepts. In order to facilitate knowledge sharing and information propagation, different ontologies in related fields are required to be effectively linked or fused, and an information user can conveniently and accurately master knowledge in the fields on the whole. Since different ontologies are based on different construction criteria (heterogeneous ontologies from one another), naming or description of the same concept in different ontologies often has a large difference. This hinders the identification of entities describing the same concept, making it difficult to perform a fusion operation of heterogeneous ontologies.
For entity information obtained by natural language processing, how to perform ontology mapping is a key technology. In order to realize the interoperation between heterogeneous ontologies, various methods for discovering the mapping relationship between ontologies have been proposed in recent years. In various mapping methods, the concept similarity calculation based on the word constitution characteristics basically does not need other corpus resource support except the words, and the calculation has the characteristics of directness, rapidness and the like, so the method is widely applied. However, the existing correlation method still has the problems that the computation of similarity of synonyms and synonyms is difficult for variant bodies with the same semantics but not completely consistent in writing, the distribution strategy of the weight of the constituent words of the concept terms to be matched is incomplete, and the like.
Disclosure of Invention
The disclosure provides a clustering method and device based on comprehensive similarity.
According to a first aspect of the embodiments of the present disclosure, there is provided a clustering method based on comprehensive similarity, the method including: obtaining an entity data set comprising a plurality of entities; calculating the comprehensive similarity of every two entities in the plurality of entities; and performing hierarchical clustering on the plurality of entities according to the comprehensive similarity, wherein for a first entity and a second entity in any two entities, the first entity comprises a first name and a plurality of first attributes, the second entity comprises a second name and a plurality of second attributes, and the comprehensive similarity is calculated by adopting the following steps: calculating the editing distance between the first name and the second name to obtain the similarity of the editing distance; extracting a first name word vector of the first name and a second name word vector of the second name based on a pre-training language model, and calculating the cosine similarity of the word vectors of the first name word vector and the second name word vector; extracting a first attribute word vector of a plurality of first attributes and a second attribute word vector of a plurality of second attributes based on a pre-training language model, and calculating attribute similarity based on the first attribute word vector and the second attribute word vector; and weighting and adding the editing distance similarity, the word vector cosine similarity and the attribute similarity to obtain the comprehensive similarity.
Alternatively, the edit distance similarity may be expressed as
Figure 526537DEST_PATH_IMAGE001
Where x is a first name, y is a second name, D (x, y) is an edit distance of x to y, and | x | and | y | represent string lengths of the first name and the second name, respectively.
Alternatively, the cosine similarity of the word vector may be expressed as
Figure 418401DEST_PATH_IMAGE002
Wherein A is the first name word directionQuantity, B is the second name word vector, AiThe ith word vector being the first name word vector, BiAn ith word vector that is the second name word vector.
Alternatively, the attribute similarity may be expressed as
Figure 916378DEST_PATH_IMAGE003
,
Wherein, UiAn ith first attribute word vector representing a plurality of first attributes, m being the number of the plurality of first attributes, VjA jth second attribute word vector representing the plurality of second attributes, n being the number of the plurality of second attributes, { e | e = simvect(Ui,Vj) J ∈ {1.. n } } and { e | e = simvect(Ui,Vj) I ∈ {1.. m } } respectively denote a set made up of elements e.
Optionally, the clustering method may include: and judging the entities with the comprehensive similarity smaller than or equal to a first threshold as the same ontology, and judging the entities with the comprehensive similarity larger than the first threshold and smaller than or equal to a second threshold as the relatives, wherein the ontology represents a set of entities with common attributes in a specific field.
According to a second aspect of the embodiments of the present disclosure, there is provided a clustering apparatus based on integrated similarity, the apparatus including: a dataset acquisition unit configured to acquire an entity dataset including a plurality of entities; a similarity calculation unit configured to calculate a comprehensive similarity of each two entities of the plurality of entities; the hierarchical clustering unit is configured to perform hierarchical clustering on the plurality of entities according to the comprehensive similarity; wherein, for a first entity and a second entity in any two entities, the first entity includes a first name and a plurality of first attributes, the second entity includes a second name and a plurality of second attributes, the similarity calculation unit executes the following steps: calculating the editing distance between the first name and the second name to obtain the similarity of the editing distance; extracting a first name word vector of the first name and a second name word vector of the second name based on a pre-training language model, and calculating the cosine similarity of the word vectors of the first name word vector and the second name word vector; extracting a plurality of first attribute word vectors of first attributes and a plurality of second attribute word vectors of second attributes based on a pre-training language model, and calculating attribute similarity based on the first attribute word vectors and the second attribute word vectors; and weighting and adding the editing distance similarity, the word vector cosine similarity and the attribute similarity to obtain the comprehensive similarity.
Optionally, the clustering device may further include a mapping relation determining unit. The mapping relation determination unit is configured to determine entities with integrated similarity smaller than or equal to a first threshold as the same ontology, and determine entities with integrated similarity larger than the first threshold and smaller than or equal to a second threshold as a relativity class, wherein the ontology represents a set of entities with common attributes in a specific domain.
According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform a comprehensive similarity-based clustering method as described above.
According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform the integrated similarity-based clustering method as described above.
The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:
compared with the prior art, in the ontology similarity measurement technology in the field of natural language processing, according to one or more exemplary embodiments of the present disclosure, similarity measurement is performed based on a pre-trained language model, so that the model speed is faster and the accuracy is higher, for example, the cosine similarity of a word vector obtained by the pre-trained language model has higher accuracy, and in addition, the editing distance similarity, the cosine similarity of the word vector and the attribute similarity are comprehensively considered to have higher accuracy. Compared with the existing ontology mapping method, the hierarchical clustering algorithm is used for clustering the similarity of different entities, possible relatives among different ontologies can be obtained while ontology mapping is achieved, and developers can conveniently construct the child-parent relationship of the ontologies.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.
FIG. 1 is a flow chart illustrating a comprehensive similarity-based clustering method according to the present disclosure;
FIG. 2 is a flow chart illustrating another clustering method based on integrated similarity according to the present disclosure;
FIG. 3 is an example illustrating a clustering method according to the present disclosure;
FIG. 4 is a schematic diagram illustrating a clustering apparatus according to the present disclosure; and
fig. 5 is a block diagram illustrating an electronic device according to an exemplary embodiment of the present disclosure.
Detailed Description
In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The embodiments described in the following examples do not represent all embodiments consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
In this case, the expression "at least one of the items" in the present disclosure means a case where three types of parallel expressions "any one of the items", "a combination of any plural ones of the items", and "the entirety of the items" are included. For example, "include at least one of a and B" includes the following three cases in parallel: (1) comprises A; (2) comprises B; (3) including a and B. For another example, "at least one of the first step and the second step is performed", which means that the following three cases are juxtaposed: (1) executing the step one; (2) executing the step two; (3) and executing the step one and the step two.
FIG. 1 is a flow chart illustrating a comprehensive similarity-based clustering method according to the present disclosure.
Specifically, the clustering method based on the comprehensive similarity can solve the problem of how to measure the similarity of the ontologies in the ontology mapping field, and also solve the classification of the categories among the ontologies on a certain level. For example, whether the objects belong to the same ontology or not, and whether the relatives such as siblings, child parents, and the like are related to each other, are determined.
Referring to fig. 1, in step S100, an entity data set including a plurality of entities, each of which includes a name (concept name) and a plurality of attributes, is acquired.
In step S200, a comprehensive similarity of each two entities of the plurality of entities is calculated.
In an example embodiment, for a first entity and a second entity of any two entities, the first entity including a first name and a plurality of first attributes, and the second entity including a second name and a plurality of second attributes, the calculation of the integrated similarity may be performed by the following sub-steps.
In sub-step S210, an edit distance of the first name and the second name is calculated to obtain an edit distance similarity. By way of example, the edit distance may be to the smallest operand that translates from the first name to the second name (operations include insert, delete, replace, etc.). Generally, the smaller the edit distance between concepts, the greater the similarity of their names.
In an embodiment, the edit distance similarity may be expressed as
Figure 750342DEST_PATH_IMAGE004
Where x is a first name, y is a second name, D (x, y) is an edit distance from x to y, and | x | and | y | represent string lengths of the first name and the second name, respectively.
As an example, if x is "horse", y is "ros", horse → horse (replace 'h' with 'r'), horse → rose (delete 'r'), rose → ros (delete 'e'), and D (x, y) is 3.
In the sub-step S220, a first name word vector of the first name and a second name word vector of the second name are extracted based on the pre-training language model, and the cosine similarity of the word vectors of the first name word vector and the second name word vector is calculated.
For example, in an embodiment, a first name word vector for a first name and a second name word vector for a second name may be extracted using an embedding layer of the ERNIE pre-training language model.
In an embodiment, the cosine similarity of the word vector may be expressed as
Figure 700981DEST_PATH_IMAGE002
Wherein A is a first name word vector, B is a second name word vector, AiThe ith word vector being the first name word vector, BiThe ith word vector is the second name word vector.
In sub-step S230, a first attribute word vector of a plurality of first attributes and a second attribute word vector of a plurality of second attributes are extracted based on the pre-trained language model, and attribute similarity is calculated based on the first attribute word vector and the second attribute word vector. For example, a set of attribute word vectors may be generated for each of the plurality of first attributes, and if the number of first attributes is m, a plurality of m first attribute word vectors corresponding to the plurality of first attributes in a one-to-one correspondence may be collected based on the pre-trained language model.
In an embodiment, the attribute similarity is expressed as
Figure 396535DEST_PATH_IMAGE003
Wherein, UiAn ith first attribute word vector representing a plurality of first attributes, m being the number of the plurality of first attributes, VjA jth second attribute word vector representing a plurality of second attributes, n being the number of the plurality of second attributes, wherein { e | e = sim }vect(Ui,Vj) J ∈ {1.. n } } and { e | e = simvect(Ui,Vj) I ∈ {1.. m } } respectively denote a set made up of elements e.
Function sim in addition to substituting the first attribute word vector and the second attribute word vector vect (Ui, Vj) The same as or similar to the cosine similarity calculation method described with reference to step S220, and redundant description is omitted here. Expression max ({ e | e = sim)vect(Ui,Vj) J e {1.. n } }) represents one having the largest similarity among the set of similarities of the ith first attribute word vector and the jth second attribute word vector. That is, the expression max is equivalent to matching a plurality of first attributes with a plurality of second attributes, and when the attribute similarity is evaluated, firstly, the best matching pair is found through matching, and the similarity of the attribute pairs matched with each other is obtained by calculating the cosine similarity of the attribute word vectors.
In sub-step S240, the edit distance similarity, the word vector cosine similarity, and the attribute similarity are weighted and added to obtain a comprehensive similarity.
In step S300, the multiple entities are hierarchically clustered according to the integrated similarity. The method of hierarchical clustering will be further described later with reference to fig. 3.
FIG. 2 is a flow chart illustrating another clustering method based on integrated similarity according to the present disclosure.
Except for step S400, the clustering method illustrated in fig. 2 is substantially the same as or similar to the clustering method described with reference to fig. 1, and redundant description is omitted herein.
In step S400, entities having an integrated similarity equal to or less than a first threshold are determined as the same ontology, and entities having an integrated similarity greater than the first threshold and equal to or less than a second threshold are determined as a family, where the ontology represents a set of entities having a common attribute in a specific domain. The method of hierarchical clustering will be further described later with reference to fig. 3.
Fig. 3 is an example illustrating a clustering method according to the present disclosure.
In an embodiment, Hierarchical Clustering (Hierarchical Clustering) is a kind of Clustering algorithm, and a Hierarchical nested cluster tree is created by calculating similarities between data points of different classes. In a cluster tree, the original data points of different classes are the lowest level of the tree, and the top level of the tree is the root node of a cluster. The clustering tree is created by two methods of bottom-up combination and top-down division. The merging algorithm of hierarchical clustering combines two most similar data points of all data points by calculating the similarity between the two types of data points, and iterates the process repeatedly. In brief, the merging algorithm of hierarchical clustering determines the similarity between data points of each category by calculating the distance between them, and the smaller the distance, the higher the similarity. And combining the two data points or categories with the closest distance to generate a clustering tree.
Specifically, referring to the illustration in fig. 3, the similarity is greatest for entity 1 and entity 14, followed by entity 13 and entity 16, and entity 0 and entity 12. Thus, entity 1 and entity 14 are merged first, then entity 13 and entity 16 can be merged, and further entity 0 and entity 12. Finding out two clusters (entities) with the minimum integrated similarity distance and combining the two clusters into a new cluster, wherein in the clustering process, the method of average link is adopted for judging the similarity between two categories by combining the small clusters into a large cluster, namely, the distances between every two sample points in two sets are put together to calculate the average value, and the similarity of the two entity clusters is measured by the average value. And recalculating the distance between the new cluster and other clusters, and repeating the iteration until all the clusters are merged into a large class.
In an embodiment, a suitable threshold may be further selected to determine the constructed hierarchical clustering tree. For example, a first threshold (ontology mapping threshold) and a second threshold (relatives discrimination threshold) are selected. And performing ontology mapping on the entity with the comprehensive similarity distance below the ontology mapping threshold as the same ontology. And judging that the entities with the comprehensive similarity clustering between the first threshold and the second threshold have the relativity relationship. More specifically, a parent class may be defined as hierarchically high and a child class may be defined as hierarchically low. In another embodiment, the entity pairs (or clusters) may be ordered by similarity, and the first threshold may be the entities ranked less than or equal to a first percentage of all entity pairs (or clusters) (e.g., the entities with the top 10% similarity order). The second threshold may be entities ranked greater than a first percentage of all entity pairs (or clusters) and less than or equal to a second percentage (e.g., entities with a 10% to 20% similarity ranking). As shown in fig. 3, if the first threshold is 10%, the entities (entity pairs) satisfying the first threshold are entities (1, 14), (13, 16), and (0, 12), that is, the entities 1 and 14 can be determined as the same entity, the entities 13 and 16 are the same entity, and the entities 0 and 12 are the same entity. In the entities (entity pairs or clusters) of more than 10% and 20% or less, the similarity between the entity 18 and the entities (1, 14) is greater than the ontology mapping threshold, so that it can be determined that different ontologies exist, but the similarity is less than or equal to the relatives determination threshold, so that it can be determined that the relatives relationship may exist. Similarly, entity 4 may have a familial relationship with entity (13, 16).
According to one or more exemplary embodiments of the present disclosure, the similarity measurement based on the pre-trained language model is used, so that the model speed is faster and the accuracy is higher, for example, the word vector cosine similarity obtained through the pre-trained language model has higher accuracy, and in addition, the editing distance similarity, the word vector cosine similarity and the attribute similarity are comprehensively considered to have higher accuracy. Compared with the existing ontology mapping method, the hierarchical clustering algorithm is used for clustering the similarity of different entities, possible relatives among different ontologies can be obtained while ontology mapping is achieved, and developers can conveniently construct the child-parent relationship of the ontologies.
Fig. 4 is a schematic diagram illustrating the integrated similarity-based clustering apparatus 10 according to the present disclosure.
Referring to fig. 4, the integrated similarity-based clustering apparatus 10 includes: a data set acquisition unit 100, a similarity calculation unit 200, and a hierarchical clustering unit 300.
The dataset acquisition unit 100 is configured to acquire an entity dataset comprising a plurality of entities. The data set acquisition unit 100 is configured to perform the method described with reference to step S100 of fig. 1.
The similarity calculation unit 200 is configured to calculate a comprehensive similarity of each two entities of the plurality of entities. The similarity calculation unit 200 is configured to perform the method described with reference to step S200 of fig. 1, in particular, to perform sub-step S210 to sub-step S240.
The hierarchical clustering unit 300 is configured to hierarchically cluster the plurality of entities according to the integrated similarity. The hierarchical clustering unit 300 is configured to perform the method described with reference to step S300 of fig. 1.
Furthermore, the clustering apparatus 10 based on the integrated similarity may further include a mapping relation determination unit 240. The mapping relation determination unit 240 is configured to determine entities having an integrated similarity equal to or less than a first threshold as the same ontology, and determine entities having an integrated similarity greater than the first threshold and equal to or less than a second threshold as a family. The mapping relation determining unit 240 is configured to perform the method described with reference to step S400 of fig. 2.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module/unit performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated herein.
Fig. 5 is a block diagram illustrating an electronic device 500 according to an example embodiment of the present disclosure.
Referring to fig. 5, an electronic device 500 includes at least one memory 501 and at least one processor 502, the at least one memory 501 storing computer-executable instructions that, when executed by the at least one processor 502, cause the at least one processor 502 to perform a comprehensive similarity-based clustering method according to embodiments of the present disclosure.
By way of example, the electronic device 500 may be a PC computer, tablet device, personal digital assistant, smart phone, or other device capable of executing the instructions described above. Here, the electronic device 500 need not be a single electronic device, but can be any arrangement or collection of circuits that can individually or jointly execute the instructions (or sets of instructions) described above. The electronic device 500 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).
In the electronic device 500, the processor 502 may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special-purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.
The processor 502 may execute instructions or code stored in the memory 501, wherein the memory 501 may also store data. The instructions and data may also be transmitted or received over a network via a network interface device, which may employ any known transmission protocol.
The memory 501 may be integrated with the processor 502, for example, by having RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, memory 501 may comprise a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The memory 501 and the processor 502 may be operatively coupled or may communicate with each other, e.g., through I/O ports, network connections, etc., such that the processor 502 is able to read files stored in the memory.
In addition, the electronic device 500 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device 500 may be connected to each other via a bus and/or a network.
According to an embodiment of the present disclosure, there may also be provided a computer-readable storage medium, wherein when executed by at least one processor, instructions stored in the computer-readable storage medium cause the at least one processor to perform a comprehensive similarity-based clustering method according to an embodiment of the present disclosure. Examples of the computer-readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD + RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD + RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or compact disc memory, Hard Disk Drive (HDD), solid-state drive (SSD), card-type memory (such as a multimedia card, a Secure Digital (SD) card or a extreme digital (XD) card), magnetic tape, a floppy disk, a magneto-optical data storage device, an optical data storage device, a hard disk, a magnetic tape, a magneto-optical data storage device, a hard disk, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, A solid state disk, and any other device configured to store and provide a computer program and any associated data, data files, and data structures to a processor or computer in a non-transitory manner such that the processor or computer can execute the computer program. The computer program in the computer-readable storage medium described above can be run in an environment deployed in a computer apparatus, such as a client, a host, a proxy device, a server, and the like, and further, in one example, the computer program and any associated data, data files, and data structures are distributed across a networked computer system such that the computer program and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (12)

1. A clustering method based on comprehensive similarity is characterized by comprising the following steps:
obtaining an entity data set comprising a plurality of entities;
calculating the comprehensive similarity of every two entities in the plurality of entities; and
performing hierarchical clustering on the plurality of entities according to the comprehensive similarity,
wherein, aiming at a first entity and a second entity in any two entities, the first entity comprises a first name and a plurality of first attributes, the second entity comprises a second name and a plurality of second attributes, and the comprehensive similarity is calculated by adopting the following steps:
calculating the editing distance between the first name and the second name to obtain the similarity of the editing distance;
extracting a first name word vector of the first name and a second name word vector of the second name based on a pre-training language model, and calculating the cosine similarity of the word vectors of the first name word vector and the second name word vector;
extracting a first attribute word vector of the plurality of first attributes and a second attribute word vector of the plurality of second attributes based on the pre-training language model, and calculating attribute similarity based on the first attribute word vector and the second attribute word vector; and
and weighting and adding the editing distance similarity, the word vector cosine similarity and the attribute similarity to obtain comprehensive similarity.
2. The method of claim 1, wherein the edit distance similarity is expressed as
Figure 713938DEST_PATH_IMAGE001
Where x is the first name, y is the second name, D (x, y) is an edit distance from x to y, and | x | and | y | represent string lengths of the first name and the second name, respectively.
3. The method of claim 1, wherein the cosine similarity of the word vector is expressed as
Figure 409361DEST_PATH_IMAGE002
Wherein A is the first name word vector, B is the second name word vector, AiThe ith word vector being the first name word vector, BiAn ith word vector that is the second name word vector.
4. The method of claim 1, wherein the attribute similarity is expressed as
Figure 119828DEST_PATH_IMAGE003
Wherein, UiAn ith first attribute word vector representing the plurality of first attributes, m being the number of the plurality of first attributes, VjA jth second attribute word vector representing the plurality of second attributes, n being a number of the plurality of second attributes,
wherein, { e | e = simvect(Ui,Vj), j∈{1...n}And { e | e = sim }vect(Ui,Vj) I ∈ {1.. m } } respectively denote a set made up of elements e.
5. The method of claim 1, wherein the clustering method comprises: and judging the entities with the comprehensive similarity smaller than or equal to a first threshold as the same ontology, and judging the entities with the comprehensive similarity larger than the first threshold and smaller than or equal to a second threshold as the relatives, wherein the ontology represents the set of the entities with common attributes in a specific field.
6. An apparatus for clustering based on integrated similarity, the apparatus comprising:
a dataset acquisition unit configured to acquire an entity dataset including a plurality of entities;
a similarity calculation unit configured to calculate a comprehensive similarity of each two entities of the plurality of entities; and
a hierarchical clustering unit configured to perform hierarchical clustering on the plurality of entities according to the comprehensive similarity;
wherein, for a first entity and a second entity in any two entities, the first entity includes a first name and a plurality of first attributes, the second entity includes a second name and a plurality of second attributes, the similarity calculation unit executes the following steps:
calculating the editing distance between the first name and the second name to obtain the similarity of the editing distance;
extracting a first name word vector of the first name and a second name word vector of the second name based on a pre-training language model, and calculating the cosine similarity of the word vectors of the first name word vector and the second name word vector;
extracting a first attribute word vector of the plurality of first attributes and a second attribute word vector of the plurality of second attributes based on the pre-training language model, and calculating attribute similarity based on the first attribute word vector and the second attribute word vector; and
and weighting and adding the editing distance similarity, the word vector cosine similarity and the attribute similarity to obtain comprehensive similarity.
7. The apparatus of claim 6, wherein the edit distance similarity is expressed as
Figure 739160DEST_PATH_IMAGE001
Where x is the first name, y is the second name, D (x, y) is an edit distance from x to y, and | x | and | y | represent string lengths of the first name and the second name, respectively.
8. The apparatus of claim 6, wherein the cosine similarity of the word vector is expressed as
Figure 585893DEST_PATH_IMAGE002
Wherein A is the first name word vector, B is the second name word vector, AiThe ith word vector being the first name word vector, BiAn ith word vector that is the second name word vector.
9. The apparatus of claim 6, wherein the attribute similarity is expressed as
Figure 514535DEST_PATH_IMAGE003
Wherein, UiAn ith first attribute word vector representing the plurality of first attributes, m being the number of the plurality of first attributes, VjA jth second attribute word vector representing the plurality of second attributes, n being a number of the plurality of second attributes,
wherein,{e|e=simvect(Ui,Vj) J ∈ {1.. n } } and { e | e = simvect(Ui,Vj) I ∈ {1.. m } } respectively denote a set made up of elements e.
10. The apparatus of claim 6, wherein the clustering means further comprises:
and the mapping relation judging unit is configured to judge the entities with the comprehensive similarity smaller than or equal to a first threshold as the same ontology, and judge the entities with the comprehensive similarity larger than the first threshold and smaller than or equal to a second threshold as the relatives, wherein the ontology represents a set of entities with common attributes in a specific field.
11. An electronic device, comprising:
at least one processor;
at least one memory storing computer-executable instructions,
wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform the clustering method of any one of claims 1 to 5.
12. A computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform the clustering method of any one of claims 1 to 5.
CN202210103985.5A 2022-01-28 2022-01-28 Clustering method and device based on comprehensive similarity Pending CN114118310A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210103985.5A CN114118310A (en) 2022-01-28 2022-01-28 Clustering method and device based on comprehensive similarity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210103985.5A CN114118310A (en) 2022-01-28 2022-01-28 Clustering method and device based on comprehensive similarity

Publications (1)

Publication Number Publication Date
CN114118310A true CN114118310A (en) 2022-03-01

Family

ID=80361937

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210103985.5A Pending CN114118310A (en) 2022-01-28 2022-01-28 Clustering method and device based on comprehensive similarity

Country Status (1)

Country Link
CN (1) CN114118310A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112966027A (en) * 2021-03-22 2021-06-15 青岛科技大学 Entity association mining method based on dynamic probe
CN116226541A (en) * 2023-05-11 2023-06-06 湖南工商大学 Knowledge graph-based network hotspot information recommendation method, system and equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109543189A (en) * 2018-11-28 2019-03-29 重庆邮电大学 Robot data interoperability domain body mapping method based on semantic similarity
CN112527938A (en) * 2020-12-17 2021-03-19 安徽迪科数金科技有限公司 Chinese POI matching method based on natural language understanding

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109543189A (en) * 2018-11-28 2019-03-29 重庆邮电大学 Robot data interoperability domain body mapping method based on semantic similarity
CN112527938A (en) * 2020-12-17 2021-03-19 安徽迪科数金科技有限公司 Chinese POI matching method based on natural language understanding

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112966027A (en) * 2021-03-22 2021-06-15 青岛科技大学 Entity association mining method based on dynamic probe
CN112966027B (en) * 2021-03-22 2022-10-21 青岛科技大学 Entity association mining method based on dynamic probe
CN116226541A (en) * 2023-05-11 2023-06-06 湖南工商大学 Knowledge graph-based network hotspot information recommendation method, system and equipment

Similar Documents

Publication Publication Date Title
US11397753B2 (en) Scalable topological summary construction using landmark point selection
Souravlas et al. A classification of community detection methods in social networks: a survey
CN108292310A (en) For the relevant technology of digital entities
US10891315B2 (en) Landmark point selection
JP5749279B2 (en) Join embedding for item association
JP6686628B2 (en) Discovery informatics system, method, and computer program
CN110647904B (en) Cross-modal retrieval method and system based on unmarked data migration
US7805010B2 (en) Cross-ontological analytics for alignment of different classification schemes
US20210319054A1 (en) Encoding entity representations for cross-document coreference
CN114118310A (en) Clustering method and device based on comprehensive similarity
KR102046692B1 (en) Method and System for Entity summarization based on multilingual projected entity space
WO2019245887A1 (en) Taxonomic tree generation
CN117744784B (en) Medical scientific research knowledge graph construction and intelligent retrieval method and system
Yuan et al. Measurement of clustering effectiveness for document collections
US11704345B2 (en) Inferring location attributes from data entries
Nashipudimath et al. An efficient integration and indexing method based on feature patterns and semantic analysis for big data
CN115017315A (en) Leading edge theme identification method and system and computer equipment
Prasanth et al. Effective big data retrieval using deep learning modified neural networks
Park et al. Automatic extraction of user’s search intention from web search logs
Liu et al. A Multi-View–Based Collective Entity Linking Method
Danesh et al. Ensemble-based clustering of large probabilistic graphs using neighborhood and distance metric learning
Olech et al. Hierarchical gaussian mixture model with objects attached to terminal and non-terminal dendrogram nodes
Su et al. Semantically guided projection for zero-shot 3D model classification and retrieval
US20220284501A1 (en) Probabilistic determination of compatible content
Withanawasam Apache Mahout Essentials

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20220301

WD01 Invention patent application deemed withdrawn after publication