CN113343702A - Entity matching method and system based on unmarked corpus - Google Patents

Entity matching method and system based on unmarked corpus Download PDF

Info

Publication number
CN113343702A
CN113343702A CN202110887645.1A CN202110887645A CN113343702A CN 113343702 A CN113343702 A CN 113343702A CN 202110887645 A CN202110887645 A CN 202110887645A CN 113343702 A CN113343702 A CN 113343702A
Authority
CN
China
Prior art keywords
entity
candidate
entities
seed
template
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110887645.1A
Other languages
Chinese (zh)
Other versions
CN113343702B (en
Inventor
韩瑞峰
杨红飞
金霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huoshi Creation Technology Co ltd
Original Assignee
Hangzhou Firestone Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Firestone Technology Co ltd filed Critical Hangzhou Firestone Technology Co ltd
Priority to CN202110887645.1A priority Critical patent/CN113343702B/en
Publication of CN113343702A publication Critical patent/CN113343702A/en
Application granted granted Critical
Publication of CN113343702B publication Critical patent/CN113343702B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to an entity matching method and system based on a non-labeled corpus, wherein the method comprises the following steps: the method comprises the steps of obtaining a plurality of candidate entities by segmenting a target corpus, calculating statistical information of the candidate entities, obtaining a seed entity set, judging and selecting the entity closest to the seed entity from the candidate entities according to the statistical information of the seed entity set and the candidate entities to obtain a plurality of optimal candidate entities, adding the optimal candidate entities into the seed entity set, repeating the judgment and selection until no optimal candidate entity is generated, judging whether the optimal candidate entities are entities or not based on the generated word vectors of the optimal candidate entities and the seed entities to obtain an entity identification result. By the method and the device, the problems of strong dependence on the labeled sample and low identification accuracy in entity identification are solved, the field entity word list is utilized to obtain the entity identification result of the target corpus without labels, and the effect of expanding the field entity word list is achieved.

Description

Entity matching method and system based on unmarked corpus
Technical Field
The present application relates to the field of data identification, and in particular, to an entity matching method and system based on a markerless corpus.
Background
In an application scenario of text information extraction, due to various and refined scenarios, sample labeling becomes an important part in a text information extraction process, and the current situations of lack of labeled samples and high sample labeling cost are faced on industrial application.
At present, no effective solution is provided aiming at the problems of strong dependence on labeled samples and low identification accuracy rate in the related technology.
Disclosure of Invention
The embodiment of the application provides an entity matching method and system based on a non-labeled corpus, and aims to at least solve the problems of strong dependence on labeled samples and low identification accuracy rate in the related technology.
In a first aspect, an embodiment of the present application provides an entity matching method based on a non-labeled corpus, where the method includes:
segmenting the target corpus by a preset entity segmentation method to obtain a plurality of candidate entities, and calculating statistical information of the candidate entities, wherein the statistical information comprises entity-to-template mapping, template-to-entity mapping, entity-to-template correlation and entity-to-vector mapping;
acquiring a seed entity set from a field entity word list, wherein the seed entity set comprises a plurality of seed entities;
judging and selecting an entity closest to the seed entity from the candidate entities according to the seed entity set and the statistical information of the candidate entities to obtain a plurality of optimal candidate entities;
adding the optimal candidate entity into the seed entity set, and repeatedly executing the judgment and selection until no optimal candidate entity is generated;
and judging whether the optimal candidate entity is an entity or not based on the generated word vectors of the optimal candidate entity and the seed entity to obtain an entity identification result.
In some of these embodiments, calculating the statistical information of the candidate entities comprises:
counting a context template corresponding to each candidate entity and the frequency information of the context template appearing in the target corpus, and storing the context template and the frequency information in the mapping from the entity to the template;
counting the number of candidate entities and the number of candidate entities corresponding to each context template, and storing the candidate entities and the number of candidate entities in the mapping from the templates to the entities;
counting the information of the binding degree between the candidate entity and the context template, and storing the information of the binding degree in the correlation degree between the entity and the template;
and calculating a word vector of the candidate entity, and storing the word vector in an entity-to-vector mapping.
In some embodiments, the step of calculating the information of the degree of binding between the candidate entity and the context template, and the step of storing the information of the degree of binding in the degree of correlation between the candidate entity and the context template comprises:
and counting the information of the combination degree of the candidate entities and the context template according to the total number of the candidate entities, the co-occurrence times of the candidate entities and the context template in the target corpus and the co-occurrence times of the context template and all the candidate entities in the target corpus, and storing the information of the combination degree in the correlation degree of the entities and the template.
In some of these embodiments, computing a word vector for the candidate entity, storing the word vector in an entity to vector mapping comprises:
training a word2vec model or a bert model according to sentences containing candidate entities to obtain a trained word vector model, obtaining word vectors of the candidate entities through the word vector model, and storing the word vectors in entity-vector mapping.
In some embodiments, determining and selecting an entity closest to the seed entity from the candidate entities according to the seed entity set and the statistical information of the candidate entities to obtain a plurality of optimal candidate entities includes:
selecting a plurality of core context templates with the highest combination degree with the seed entity according to the correlation degrees of the entity and the templates in the statistical information, wherein the combination degree information between the candidate entity and the context template is stored in the correlation degrees of the entity and the template;
selecting a plurality of core candidate entities through the core context template according to the mapping from the template to the entity in the statistical information, wherein the mapping from the template to the entity stores the candidate entity corresponding to the context template and the number of the candidate entities;
calculating and summing the similarity between the core candidate entity and each seed entity according to the correlation between the entity and the template in the statistical information to obtain a first score of the core candidate entity, and selecting a plurality of first core candidate entities according to the first score;
calculating the similarity between the core candidate entity and each seed entity through the mapping from the entity to the vector in the statistical information, summing the similarity to obtain a second score of the core candidate entity, and selecting a plurality of second core candidate entities according to the second score, wherein the word vector of the candidate entity is stored in the mapping from the entity to the vector;
and processing and sequencing the first core candidate entity and the second core candidate entity to obtain a plurality of optimal candidate entities.
In some embodiments, selecting the core context templates with the highest binding degree with the seed entity according to the correlation degree between the entity and the template in the statistical information comprises:
calculating the combination degree of the seed entity and the context template according to the mapping from the entity to the template in the statistical information, wherein the context template corresponding to the candidate entity and the frequency information of the occurrence of the context template in the target corpus are stored in the mapping from the entity to the template;
and selecting a plurality of core context templates with the highest combination degree with the seed entity according to the combination degree and the correlation degree of the entity and the template.
In some embodiments, calculating and summing similarities between the core candidate entities and each seed entity according to the correlation between the entities in the statistical information and the template to obtain a first score of the core candidate entities, and selecting a plurality of first core candidate entities according to the first score includes:
calculating the correlation degree of the entity and the template in the statistical information, wherein the weighted Jaccard similarity is calculated according to the combination degree of the core candidate entity and the core context template, and the seed entity and the core context template;
and adding the Jaccard similarity of the core candidate entity and all the seed entities to obtain a first score of the core candidate entity, and selecting a plurality of candidate entities with the maximum first score as first core candidate entities.
In some embodiments, the similarity between the core candidate entity and each seed entity is calculated and summed through entity-to-vector mapping in the statistical information to obtain a second score of the core candidate entity, a plurality of second core candidate entities are selected according to the second score,
obtaining word vectors of core candidate entities through mapping from entities to vectors in the statistical information
Calculating word vector similarity between the core candidate entity and each seed entity according to the word vectors of the core candidate entity and the word vectors of the seed entities;
and adding the word vector similarity between the core candidate entity and all the seed entities to obtain a second score of the core candidate entity, and selecting a plurality of candidate entities with the maximum second score as second core candidate entities.
In some embodiments, determining whether the optimal candidate entity is an entity based on the generated word vectors of the optimal candidate entity and the seed entity, and obtaining the result of entity identification includes:
and calculating word vectors of the optimal candidate entities in the target corpus, calculating the distance between the word vectors and the average word vectors of the seed entities in the seed entity set, if the distance is within a preset threshold range, judging the corresponding optimal candidate entities as entities until all the optimal candidate entities are judged, and obtaining entity identification results.
In a second aspect, an embodiment of the present application provides an entity matching system based on a non-labeled corpus, where the system includes a segmentation statistics module, an acquisition module, a matching selection module, a circulation module, and an identification module;
the segmentation statistical module segments a target corpus by a preset entity segmentation method to obtain a plurality of candidate entities, and calculates statistical information of the candidate entities, wherein the statistical information comprises entity-to-template mapping, template-to-entity mapping, entity-to-template correlation and entity-to-vector mapping;
the acquisition module acquires a seed entity set from a field entity word list, wherein the seed entity set comprises a plurality of seed entities;
the matching selection module judges and selects an entity closest to the seed entity from the candidate entities according to the seed entity set and the statistical information of the candidate entities to obtain a plurality of optimal candidate entities;
the circulation module adds the optimal candidate entity into the seed entity set and repeatedly executes the judgment and selection until no optimal candidate entity is generated;
and the identification module judges whether the optimal candidate entity is an entity or not based on the generated word vectors of the optimal candidate entity and the seed entity to obtain an entity identification result.
Compared with the related technology, the entity matching method and system based on the unmarked corpus provided by the embodiment of the application divide the target corpus by the preset entity division method to obtain a plurality of candidate entities, calculate the statistical information of the candidate entities, obtain the seed entity set from the domain entity word list, wherein the seed entity set comprises a plurality of seed entities, judge and select the entity closest to the seed entity from the candidate entities according to the statistical information of the seed entity set and the candidate entities to obtain a plurality of optimal candidate entities, add the optimal candidate entities into the seed entity set, repeatedly execute judgment and selection until no optimal candidate entities are generated, judge whether the optimal candidate entities are entities based on the generated word vectors of the optimal candidate entities and the seed entities, and obtain the result of entity identification. The method solves the problems of strong dependence on the labeled sample and low identification accuracy in the entity identification, realizes the utilization of the domain entity word list to obtain the entity identification result of the target corpus without labels, and simultaneously achieves the effect of expanding the domain entity word list.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 is a flowchart illustrating steps of an entity matching method based on unmarked corpus according to an embodiment of the present application;
FIG. 2 is a flowchart of the steps for computing candidate entity statistics according to an embodiment of the present application;
FIG. 3 is a flowchart of the steps for computing optimal candidate entities according to an embodiment of the present application;
FIG. 4 is a flowchart illustrating steps of an entity matching method based on unmarked corpus according to an embodiment of the present application;
fig. 5 is an internal structural diagram of an electronic device according to an embodiment of the present application.
Description of the drawings: 41. a segmentation statistical module; 42. an acquisition module; 43. a matching selection module; 44. A circulation module; 45. and identifying the module.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.
It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.
Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.
An embodiment of the present application provides an entity matching method based on a markup-free corpus, and fig. 1 is a flowchart illustrating steps of the entity matching method based on the markup-free corpus according to the embodiment of the present application, as shown in fig. 1, the method includes the following steps:
step S102, segmenting the target corpus by a preset entity segmentation method to obtain a plurality of candidate entities, and calculating statistical information of the candidate entities, wherein the statistical information comprises entity-to-template mapping, template-to-entity mapping, entity-to-template correlation and entity-to-vector mapping;
step S104, a seed entity set is obtained from the field entity word list, wherein the seed entity set comprises a plurality of seed entities;
step S106, judging and selecting the entity closest to the seed entity from the candidate entities according to the seed entity set and the statistical information of the candidate entities to obtain a plurality of optimal candidate entities;
step S108, adding the optimal candidate entity into the seed entity set, and repeatedly executing judgment and selection until no optimal candidate entity is generated;
step S110, judging whether the optimal candidate entity is an entity or not based on the generated word vectors of the optimal candidate entity and the seed entity, and obtaining an entity identification result.
It should be noted that, in step S102, the preset entity segmentation method for segmenting the target corpus includes a phrase mining method that is not based on manual annotation, such as an ngram method. And the statistical information of the candidate entity, such as the mapping from the entity to the template, and the corresponding variable name is ent2 patterns; mapping from a template to an entity, wherein the corresponding variable name is patterns2 entries; the correlation degree of the entity and the template, and the corresponding variable name of the correlation degree is entAndPattern2 strength; entity to vector mapping, with the corresponding variable name eid2 embed.
Through the steps S102 to S110 in the embodiment of the application, the problems of strong dependence on the labeling sample and low identification accuracy rate in the entity identification are solved, the entity identification result of the target corpus without the label is obtained by utilizing the field entity word list, and meanwhile, the effect of expanding the field entity word list is achieved.
In some embodiments, fig. 2 is a flowchart illustrating steps of calculating statistical information of candidate entities according to an embodiment of the present application, and as shown in fig. 2, the step S102 of calculating the statistical information of the candidate entities includes the following steps:
step S202, counting context templates corresponding to each candidate entity and frequency information of the context templates appearing in the target corpus, and storing the context templates and the frequency information in the mapping from the entity to the templates;
step S204, counting the number of candidate entities and the number of candidate entities corresponding to each context template, and storing the candidate entities and the number of candidate entities in the mapping from the templates to the entities;
step S206, counting the information of the combination degree between the candidate entity and the context template, and storing the information of the combination degree in the correlation degree between the entity and the template;
step S208, calculating word vectors of the candidate entities, and storing the word vectors in the mapping from the entities to the vectors.
It should be noted that, in the context template in this embodiment, the corresponding variable name is pattern.
Optionally, in step S202, after the text segment formed by each context template and the candidate ENTITY is replaced by an ENTITY character string, a vector of the text segment is calculated by a pre-training model such as bert, and the vector is used as the vector of the context template. And aggregating the context templates with the similarity of the context template vectors larger than a threshold value, and storing the context templates obtained by aggregation and corresponding times information in the mapping from the entity to the template.
In some embodiments, step S206, counting information of the degree of association between the candidate entity and the context template, and saving the information of the degree of association in the degree of association between the entity and the template includes:
and counting the information of the combination degree of the candidate entities and the context template according to the total number of the candidate entities, the co-occurrence times of the candidate entities and the context template in the target corpus and the co-occurrence times of the context template and all the candidate entities in the target corpus, and storing the information of the combination degree in the correlation degree of the entities and the template. In particular, by the formula
Figure 347590DEST_PATH_IMAGE001
Counting the information of the combination degree between the candidate entity and the context template, and storing the information of the combination degree in the correlation degree of the entity and the template, wherein e represents the candidate entity, c represents the context template, and X representse,cE is the number of co-occurrences of the candidate entity and the context template in the target corpus, E is the total number of candidate entities,
Figure 812069DEST_PATH_IMAGE002
the number of times the context template c and all candidate entities co-occur in the target corpus, i.e. the number of times the context template c occurs.
In some embodiments, step S208, calculating a word vector of the candidate entity, and storing the word vector in the entity-to-vector mapping includes:
training a word2vec model or a bert model according to sentences containing candidate entities to obtain a trained word vector model, obtaining word vectors of the candidate entities through the word vector model, and storing the word vectors in the mapping from the entities to the vectors.
Optionally, each candidate ENTITY in the sentence of the target corpus is replaced by an ENTITY _ i character string, i is the ith ENTITY in the mapping from the ENTITY to the template, and for the sentences with overlapped entities, a new sentence is constructed for each overlapped ENTITY. The word vector model is trained with these sentences through the word2vec or bert tool. And obtaining word vectors corresponding to the candidate entities from the model, and storing the word vectors in the mapping from the entities to the vectors.
Optionally, the bert model is pre-trained by using a target corpus which is not subjected to character string replacement, a forward vector and a backward vector at each word position in a sentence are obtained through all sentence input models of the candidate entities, a vector obtained by splicing the backward vector of the first word of the candidate entities and the forward vector of the last word of the candidate entities is used as a word vector of the candidate entities in the sentence, word vectors obtained by all sentences of the candidate entities are averaged to obtain a final word vector of the candidate entities, and the word vector is stored in the mapping from the entities to the vectors. Optionally, the syntax analysis may be performed on the sentences appearing in the entity to obtain a chunk structure of the sentence, the chunk where the candidate entity is located is used to replace the sentence input model, the vector of the context template may also be calculated through the mapping from the entity to the template, and the semantic proximity of the context template is supported to be calculated, and the forward and backward concatenation vector is calculated in the chunk where the candidate entity is located or in the pattern where the candidate entity is located.
Through the embodiment of the application, the word vector is obtained by splicing the forward vector and the backward vector of the phrase before and after the phrase, and the word vector model is prevented from being retrained after each re-segmentation. The range of the preceding and following text is limited in one chunk, and semantic interference of long text is reduced.
In some embodiments, fig. 3 is a flowchart of steps of calculating to obtain optimal candidate entities according to an embodiment of the present application, and as shown in fig. 3, step S106 is to determine and select an entity closest to a seed entity from candidate entities according to statistical information of a set of seed entities and candidate entities, and obtain a plurality of optimal candidate entities includes the following steps:
step S302, selecting a plurality of core context templates with the highest combination degree with the seed entities by counting the correlation degrees of the entities and the templates in the information, wherein the combination degree information between the candidate entities and the context templates is stored in the correlation degrees of the entities and the templates, and the seed entities selected from the domain entity word list are in the candidate entities, namely the candidate entities comprise the seed entities;
step S304, selecting a plurality of core candidate entities through a core context template according to the mapping from the template to the entity in the statistical information, wherein the mapping from the template to the entity stores the candidate entities corresponding to the context template and the number of the candidate entities;
step S306, calculating and summing the similarity between the core candidate entity and each seed entity through the correlation between the entity and the template in the statistical information to obtain a first score of the core candidate entity, and selecting a plurality of first core candidate entities according to the first score;
step S308, calculating and summing the similarity between the core candidate entity and each seed entity through entity-vector mapping in the statistical information to obtain a second score of the core candidate entity, and selecting a plurality of second core candidate entities according to the second score, wherein word vectors of the candidate entities are stored in the entity-vector mapping;
step S310, the first core candidate entity and the second core candidate entity are processed and sorted to obtain a plurality of optimal candidate entities.
Optionally, in step S310, the similarity scores between the candidate entities obtained in steps S306 and S308 and each seed entity are multiplied, a square root is taken, an average value is taken after all seed entities are calculated, the average value is taken as a comprehensive score of the candidate entities, the first N candidate entities are selected as the optimal candidate entities according to the comprehensive score by combining and sorting the first core candidate entities and the second core candidate entities.
In some embodiments, the step S302, selecting a plurality of core context templates with the highest binding degree with the seed entity according to the correlation degree between the entity and the template in the statistical information includes:
calculating the combination degree of the seed entity and the context template according to the mapping from the entity to the template in the statistical information, wherein the context template corresponding to the candidate entity and the frequency information of the occurrence of the context template in the target corpus are stored in the mapping from the entity to the template;
and selecting a plurality of core context templates with the highest combination degree with the seed entity according to the combination degree and the correlation degree of the entity and the template.
Specifically, the first M context templates with the highest degree of combination with the seed entity are selected by using the degrees of correlation between the entity and the templates, namely, a set is formed by the context templates corresponding to each seed entity in the mapping from the entity to the templates, the sum of the degrees of combination between each context template and all the seed entities in the set is calculated through the degrees of correlation between the entity and the templates and is used as the weight of the context template, the first M context templates are ranked and taken according to the weight, and the context templates with the occurrence times less than a threshold in the mapping from the template to the entity are removed.
In some embodiments, step S306, calculating and summing the similarity between the core candidate entity and each seed entity through the correlation between the entity in the statistical information and the template to obtain a first score of the core candidate entity, and selecting a plurality of first core candidate entities according to the first score includes:
calculating the weighted Jaccard similarity of the core candidate entity and each seed entity through the correlation between the entity and the template in the statistical information, wherein the weighted Jaccard similarity is calculated according to the combination degree of the core candidate entity and the core context template and the seed entity and the core context template;
optionally, weighting Jaccard similarity
Figure 684210DEST_PATH_IMAGE003
Wherein e is1As core candidate entity, e2As seed entities, set F as core context template, Fe1,cAs e in the correlation of entities and templates1And c;
and adding the Jaccard similarity of the core candidate entity and all the seed entities to obtain a first score of the core candidate entity, and selecting a plurality of candidate entities with the maximum first score as first core candidate entities.
Specifically, for candidate entity e1And seed entitiese2The set F is the core pattern, Fe1,cFor entity e in the correlation of entity and template1The degree of association corresponding to the context template c if none of the entity and template relatedness (e)1And c), the corresponding binding capacity value is 0. And adding the Jaccard similarity of the candidate entity and all the seed entities to obtain a first score of the candidate entity. The first K1 candidate entities with the largest first score are retained as first core candidate entities.
In some embodiments, in step S308, the similarity between the core candidate entity and each seed entity is calculated and summed through entity-to-vector mapping in the statistical information to obtain a second score of the core candidate entity, a plurality of second core candidate entities are selected according to the second score,
obtaining word vectors of core candidate entities through mapping from entities to vectors in statistical information
Calculating word vector similarity between the core candidate entity and each seed entity according to the word vectors of the core candidate entity and the word vectors of the seed entities;
optionally, word vector similarity
Figure 677574DEST_PATH_IMAGE004
Where cos is the cosine distance of two vectors, embed (e)1) Word vectors, embed (e), which are core candidate entities2) A word vector that is a seed entity;
and adding the word vector similarity between the core candidate entity and all the seed entities to obtain a second score of the core candidate entity, and selecting a plurality of candidate entities with the maximum second score as second core candidate entities.
Specifically, find the candidate entity and the vector corresponding to the seed entity from the mapping from the entity to the vector, if the candidate entity e1The word vector of (a) is embed (e)1) = (1,0), set of seed entities e2The word vector is embed (e)2) = (2,0) calculating word vector similarity between candidate entity and each seed entity, i.e.
Figure 697483DEST_PATH_IMAGE004
Cos is the cosine distance of two vectors and embed is the vector of the entity, thus e1And e2The word vector similarity of
Figure 762391DEST_PATH_IMAGE005
I.e., 1/3, the similarity is calculated for all of the seed entities and summed to obtain a second score for the candidate entity. The first K2 candidate entities with the largest second score are retained as second core candidate entities. It should be noted that although the word vectors of the candidate entity and the seed entity in the present embodiment example are two-dimensional, word vectors with higher dimensions are also applicable to the present embodiment, such as word vectors with 200 dimensions.
In some embodiments, in step S110, determining whether the optimal candidate entity is an entity based on the generated word vectors of the optimal candidate entity and the seed entity, and obtaining the result of entity identification includes:
and calculating a word vector of the optimal candidate entity in the target language material, calculating the distance between the word vector and the average word vector of the seed entities in the seed entity set, if the distance is within a preset threshold range, judging the corresponding optimal candidate entity as an entity until all the optimal candidate entities are judged, and obtaining an entity identification result.
Specifically, if the word vector of the optimal candidate entity a is (1,0), the average word vector of the seed entities in the seed entity set is (2,0), the preset threshold is 2, and the distance (euclidean distance) between the word vector of a and the average word vector is 1, since 1<2, the optimal candidate entity a is an entity. It should be noted that although the word vectors of the candidate entity and the seed entity in the present embodiment example are two-dimensional, word vectors with higher dimensions are also applicable to the present embodiment, such as word vectors with 200 dimensions.
Optionally, in the process of determining whether the optimal candidate entity is the entity, adding artificial assistance to determine to obtain an entity identification result.
It should be noted that the steps illustrated in the above-described flow diagrams or in the flow diagrams of the figures may be performed in a computer system, such as a set of computer-executable instructions, and that, although a logical order is illustrated in the flow diagrams, in some cases, the steps illustrated or described may be performed in an order different than here.
The embodiment of the present application provides an entity matching system based on a markup-free corpus, fig. 4 is a flowchart of steps of an entity matching method based on a markup-free corpus according to the embodiment of the present application, and as shown in fig. 4, the system includes a segmentation statistics module 41, an acquisition module 42, a matching selection module 43, a circulation module 44 and an identification module 45;
the segmentation statistical module 41 segments the target corpus by a preset entity segmentation method to obtain a plurality of candidate entities, and calculates statistical information of the candidate entities, wherein the statistical information includes entity-to-template mapping, template-to-entity mapping, entity-to-template correlation and entity-to-vector mapping;
the obtaining module 42 obtains a seed entity set from the field entity word list, where the seed entity set includes a plurality of seed entities;
the matching selection module 43 judges and selects an entity closest to the seed entity from the candidate entities according to the seed entity set and the statistical information of the candidate entities to obtain a plurality of optimal candidate entities;
the circulation module 44 adds the optimal candidate entity into the seed entity set, and repeatedly executes judgment and selection until no optimal candidate entity is generated;
the recognition module 45 determines whether the optimal candidate entity is an entity based on the generated word vectors of the optimal candidate entity and the seed entity, and obtains a result of entity recognition.
According to the embodiment of the application, the segmentation statistical module 41 segments the target corpus by using a preset entity segmentation method to obtain a plurality of candidate entities, and calculates statistical information of the candidate entities, the acquisition module 42 acquires a seed entity set from a domain entity word list, wherein the seed entity set comprises a plurality of seed entities, the matching selection module 43 judges and selects an entity closest to the seed entity from the candidate entities according to the statistical information of the seed entity set and the candidate entities to obtain a plurality of optimal candidate entities, the circulation module 44 adds the optimal candidate entities into the seed entity set, and repeatedly executes judgment and selection until no optimal candidate entities are generated, and the identification module 45 judges whether the optimal candidate entities are entities or not based on the generated word vectors of the optimal candidate entities and the seed entities to obtain an entity identification result. The method solves the problems of strong dependence on the labeled sample and low identification accuracy in the entity identification, realizes the utilization of the domain entity word list to obtain the entity identification result of the target corpus without labels, and simultaneously achieves the effect of expanding the domain entity word list.
The above modules may be functional modules or program modules, and may be implemented by software or hardware. For a module implemented by hardware, the modules may be located in the same processor; or the modules can be respectively positioned in different processors in any combination.
The embodiment of the application provides an information storage method under an entity matching method based on unmarked corpus, under the condition that the information quantity of target corpus is large, the mapping from the statistical information entity of the candidate entity to the template, the mapping from the template to the entity, the correlation degree between the entity and the template and the mapping from the entity to the vector are very large, and huge memory space is needed.
Dividing the statistical information of the candidate entities into a plurality of processes according to the entities, storing the statistical information on a plurality of machines in a distributed manner, and dividing the 4 array structures according to the entities, namely only storing elements corresponding to part of the entities in the mapping from each part of the entities to the templates, only storing elements corresponding to the context templates in the mapping from each part of the entities to the templates and only storing the elements corresponding to the context templates in the mapping from each part of the entities to the templates and the relevance of the entities and the templates.
Further, if a portion of the segmented template-to-entity mapping and the correlation between entities and templates is still too large, the segmentation may continue with the context template until the memory of a machine can accommodate the computation of one or more sub-expansion processes. And integrating the results obtained in each sub-expansion process to obtain a round of expansion results.
Optionally, when the target corpus is larger than the pet, the array structures are stored in an es database, and distributed computation is performed by using es.
By the embodiment of the application, the storage structure of the statistical information of the candidate entity is optimized, parallelization is realized, and the calculation speed is increased.
The present embodiment also provides an electronic device comprising a memory having a computer program stored therein and a processor configured to execute the computer program to perform the steps of any of the above method embodiments.
Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.
It should be noted that, for specific examples in this embodiment, reference may be made to examples described in the foregoing embodiments and optional implementations, and details of this embodiment are not described herein again.
In addition, in combination with the entity matching method based on the unlabeled corpus in the above embodiment, the embodiment of the present application may provide a storage medium to implement. The storage medium having stored thereon a computer program; when executed by a processor, the computer program implements any one of the entity matching methods based on the unlabeled corpus in the above embodiments.
In one embodiment, a computer device is provided, which may be a terminal. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement an entity matching method based on the unlabeled corpus. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
In one embodiment, fig. 5 is a schematic diagram of an internal structure of an electronic device according to an embodiment of the present application, and as shown in fig. 5, an electronic device is provided, where the electronic device may be a server, and the internal structure diagram may be as shown in fig. 5. The electronic device comprises a processor, a network interface, an internal memory and a non-volatile memory connected by an internal bus, wherein the non-volatile memory stores an operating system, a computer program and a database. The processor is used for providing calculation and control capability, the network interface is used for communicating with an external terminal through network connection, the internal memory is used for providing an environment for an operating system and the running of the computer program, the computer program is executed by the processor to realize an entity matching method based on the unmarked corpus, and the database is used for storing data.
Those skilled in the art will appreciate that the configuration shown in fig. 5 is a block diagram of only a portion of the configuration associated with the present application, and does not constitute a limitation on the electronic device to which the present application is applied, and a particular electronic device may include more or less components than those shown in the drawings, or may combine certain components, or have a different arrangement of components.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
It should be understood by those skilled in the art that various features of the above-described embodiments can be combined in any combination, and for the sake of brevity, all possible combinations of features in the above-described embodiments are not described in detail, but rather, all combinations of features which are not inconsistent with each other should be construed as being within the scope of the present disclosure.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. An entity matching method based on a non-labeled corpus is characterized by comprising the following steps:
segmenting the target corpus by a preset entity segmentation method to obtain a plurality of candidate entities, and calculating statistical information of the candidate entities, wherein the statistical information comprises entity-to-template mapping, template-to-entity mapping, entity-to-template correlation and entity-to-vector mapping;
acquiring a seed entity set from a field entity word list, wherein the seed entity set comprises a plurality of seed entities;
judging and selecting an entity closest to the seed entity from the candidate entities according to the seed entity set and the statistical information of the candidate entities to obtain a plurality of optimal candidate entities;
adding the optimal candidate entity into the seed entity set, and repeatedly executing the judgment and selection until no optimal candidate entity is generated;
and judging whether the optimal candidate entity is an entity or not based on the generated word vectors of the optimal candidate entity and the seed entity to obtain an entity identification result.
2. The method of claim 1, wherein computing statistical information for the candidate entities comprises:
counting a context template corresponding to each candidate entity and the frequency information of the context template appearing in the target corpus, and storing the context template and the frequency information in the mapping from the entity to the template;
counting the number of candidate entities and the number of candidate entities corresponding to each context template, and storing the candidate entities and the number of candidate entities in the mapping from the templates to the entities;
counting the information of the binding degree between the candidate entity and the context template, and storing the information of the binding degree in the correlation degree between the entity and the template;
and calculating a word vector of the candidate entity, and storing the word vector in an entity-to-vector mapping.
3. The method of claim 2, wherein counting the information of the degree of association between the candidate entity and the context template, and wherein saving the information of the degree of association in the degree of association between the entity and the template comprises:
and counting the information of the combination degree of the candidate entities and the context template according to the total number of the candidate entities, the co-occurrence times of the candidate entities and the context template in the target corpus and the co-occurrence times of the context template and all the candidate entities in the target corpus, and storing the information of the combination degree in the correlation degree of the entities and the template.
4. The method of claim 2, wherein computing a word vector for the candidate entity, and wherein storing the word vector in an entity to vector mapping comprises:
training a word2vec model or a bert model according to sentences containing candidate entities to obtain a trained word vector model, obtaining word vectors of the candidate entities through the word vector model, and storing the word vectors in entity-vector mapping.
5. The method of claim 1, wherein determining and selecting the entity closest to the seed entity from the candidate entities according to the seed entity set and the statistical information of the candidate entities to obtain a plurality of optimal candidate entities comprises:
selecting a plurality of core context templates with the highest combination degree with the seed entity according to the correlation degrees of the entity and the templates in the statistical information, wherein the combination degree information between the candidate entity and the context template is stored in the correlation degrees of the entity and the template;
selecting a plurality of core candidate entities through the core context template according to the mapping from the template to the entity in the statistical information, wherein the mapping from the template to the entity stores the candidate entity corresponding to the context template and the number of the candidate entities;
calculating and summing the similarity between the core candidate entity and each seed entity according to the correlation between the entity and the template in the statistical information to obtain a first score of the core candidate entity, and selecting a plurality of first core candidate entities according to the first score;
calculating the similarity between the core candidate entity and each seed entity through the mapping from the entity to the vector in the statistical information, summing the similarity to obtain a second score of the core candidate entity, and selecting a plurality of second core candidate entities according to the second score, wherein the word vector of the candidate entity is stored in the mapping from the entity to the vector;
and processing and sequencing the first core candidate entity and the second core candidate entity to obtain a plurality of optimal candidate entities.
6. The method of claim 5, wherein selecting the core context templates with the highest binding degree to the seed entity according to the correlation degree between the entity and the template in the statistical information comprises:
calculating the combination degree of the seed entity and the context template according to the mapping from the entity to the template in the statistical information, wherein the context template corresponding to the candidate entity and the frequency information of the occurrence of the context template in the target corpus are stored in the mapping from the entity to the template;
and selecting a plurality of core context templates with the highest combination degree with the seed entity according to the combination degree and the correlation degree of the entity and the template.
7. The method of claim 5, wherein the calculating and summing the similarity between the core candidate entity and each seed entity through the correlation between the entity and the template in the statistical information to obtain a first score of the core candidate entity, and the selecting a plurality of first core candidate entities according to the first score comprises:
calculating the weighted Jaccard similarity of the core candidate entity and each seed entity according to the correlation between the entity and the template in the statistical information, wherein the weighted Jaccard similarity is calculated according to the combination degree of the core candidate entity and the core context template and the seed entity and the core context template;
and adding the Jaccard similarity of the core candidate entity and all the seed entities to obtain a first score of the core candidate entity, and selecting a plurality of candidate entities with the maximum first score as first core candidate entities.
8. The method of claim 5, wherein the similarity between the core candidate entity and each seed entity is calculated and summed by entity to vector mapping in the statistical information to obtain a second score for the core candidate entity, and a number of second core candidate entities are selected based on the second score,
obtaining word vectors of core candidate entities through mapping from entities to vectors in the statistical information
Calculating word vector similarity between the core candidate entity and each seed entity according to the word vectors of the core candidate entity and the word vectors of the seed entities;
and adding the word vector similarity between the core candidate entity and all the seed entities to obtain a second score of the core candidate entity, and selecting a plurality of candidate entities with the maximum second score as second core candidate entities.
9. The method of claim 1, wherein determining whether the optimal candidate entity is an entity based on the generated word vectors of the optimal candidate entity and the seed entity, and obtaining the result of entity identification comprises:
and calculating word vectors of the optimal candidate entities in the target corpus, calculating the distance between the word vectors and the average word vectors of the seed entities in the seed entity set, if the distance is within a preset threshold range, judging the corresponding optimal candidate entities as entities until all the optimal candidate entities are judged, and obtaining entity identification results.
10. An entity matching system based on a non-labeled corpus is characterized by comprising a segmentation statistical module, an acquisition module, a matching selection module, a circulation module and an identification module;
the segmentation statistical module segments a target corpus by a preset entity segmentation method to obtain a plurality of candidate entities, and calculates statistical information of the candidate entities, wherein the statistical information comprises entity-to-template mapping, template-to-entity mapping, entity-to-template correlation and entity-to-vector mapping;
the acquisition module acquires a seed entity set from a field entity word list, wherein the seed entity set comprises a plurality of seed entities;
the matching selection module judges and selects an entity closest to the seed entity from the candidate entities according to the seed entity set and the statistical information of the candidate entities to obtain a plurality of optimal candidate entities;
the circulation module adds the optimal candidate entity into the seed entity set and repeatedly executes the judgment and selection until no optimal candidate entity is generated;
and the identification module judges whether the optimal candidate entity is an entity or not based on the generated word vectors of the optimal candidate entity and the seed entity to obtain an entity identification result.
CN202110887645.1A 2021-08-03 2021-08-03 Entity matching method and system based on unmarked corpus Active CN113343702B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110887645.1A CN113343702B (en) 2021-08-03 2021-08-03 Entity matching method and system based on unmarked corpus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110887645.1A CN113343702B (en) 2021-08-03 2021-08-03 Entity matching method and system based on unmarked corpus

Publications (2)

Publication Number Publication Date
CN113343702A true CN113343702A (en) 2021-09-03
CN113343702B CN113343702B (en) 2021-11-30

Family

ID=77480598

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110887645.1A Active CN113343702B (en) 2021-08-03 2021-08-03 Entity matching method and system based on unmarked corpus

Country Status (1)

Country Link
CN (1) CN113343702B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106933802A (en) * 2017-02-24 2017-07-07 黑龙江特士信息技术有限公司 A kind of social security class entity recognition method and device towards multi-data source
CN110309515A (en) * 2019-07-10 2019-10-08 北京奇艺世纪科技有限公司 Entity recognition method and device
CN111339764A (en) * 2019-09-18 2020-06-26 华为技术有限公司 Chinese named entity recognition method and device
WO2021042516A1 (en) * 2019-09-02 2021-03-11 平安科技(深圳)有限公司 Named-entity recognition method and device, and computer readable storage medium
CN112800769A (en) * 2021-02-20 2021-05-14 深圳追一科技有限公司 Named entity recognition method and device, computer equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106933802A (en) * 2017-02-24 2017-07-07 黑龙江特士信息技术有限公司 A kind of social security class entity recognition method and device towards multi-data source
CN110309515A (en) * 2019-07-10 2019-10-08 北京奇艺世纪科技有限公司 Entity recognition method and device
WO2021042516A1 (en) * 2019-09-02 2021-03-11 平安科技(深圳)有限公司 Named-entity recognition method and device, and computer readable storage medium
CN111339764A (en) * 2019-09-18 2020-06-26 华为技术有限公司 Chinese named entity recognition method and device
CN112800769A (en) * 2021-02-20 2021-05-14 深圳追一科技有限公司 Named entity recognition method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN113343702B (en) 2021-11-30

Similar Documents

Publication Publication Date Title
CN111444320B (en) Text retrieval method and device, computer equipment and storage medium
CN113191152B (en) Entity identification method and system based on entity extension
CN107391565B (en) Matching method of cross-language hierarchical classification system based on topic model
CN112380837B (en) Similar sentence matching method, device, equipment and medium based on translation model
CN110472049B (en) Disease screening text classification method, computer device and readable storage medium
CN113536735B (en) Text marking method, system and storage medium based on keywords
CN114443850B (en) Label generation method, system, device and medium based on semantic similar model
CN109857957B (en) Method for establishing label library, electronic equipment and computer storage medium
CN111985228A (en) Text keyword extraction method and device, computer equipment and storage medium
Liu et al. Flexible discrete multi-view hashing with collective latent feature learning
CN113806493A (en) Entity relationship joint extraction method and device for Internet text data
CN111813888A (en) Training target model
CN111552802A (en) Text classification model training method and device
CN113343702B (en) Entity matching method and system based on unmarked corpus
Fakeri-Tabrizi et al. Multiview self-learning
CN112989040B (en) Dialogue text labeling method and device, electronic equipment and storage medium
CN114741499A (en) Text abstract generation method and system based on sentence semantic model
KR102383965B1 (en) Method, apparatus and system for determining similarity of patent documents based on similarity score and dissimilarity score
CN110069780B (en) Specific field text-based emotion word recognition method
CN112926340A (en) Semantic matching model for knowledge point positioning
CN113297378A (en) Text data labeling method and system, electronic equipment and storage medium
Ben Rejeb et al. Fuzzy VA-Files for multi-label image annotation based on visual content of regions
CN113254587B (en) Search text recognition method and device, computer equipment and storage medium
KR102300352B1 (en) Method, apparatus and system for determining similarity of patent documents based on importance score
CN114969339B (en) Text matching method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: 310000 7th floor, building B, No. 482, Qianmo Road, Xixing street, Binjiang District, Hangzhou City, Zhejiang Province

Patentee after: Huoshi Creation Technology Co.,Ltd.

Address before: 310000 7th floor, building B, No. 482, Qianmo Road, Xixing street, Binjiang District, Hangzhou City, Zhejiang Province

Patentee before: HANGZHOU FIRESTONE TECHNOLOGY Co.,Ltd.