CN112784018A

CN112784018A - Text similarity entity disambiguation method and system for character entity library

Info

Publication number: CN112784018A
Application number: CN202110118190.7A
Authority: CN
Inventors: 郭鑫润
Original assignee: Xinhua Zhiyun Technology Co ltd
Current assignee: Xinhua Fusion Media Technology Development Beijing Co ltd; Xinhua Zhiyun Technology Co ltd
Priority date: 2021-01-28
Filing date: 2021-01-28
Publication date: 2021-05-11

Abstract

The invention discloses a text similarity entity disambiguation method and a system for a character entity library, wherein the method comprises the following steps: acquiring an entity to be disambiguated, and fishing a candidate entity from a knowledge base according to the entity to be disambiguated; roughly recalling candidate entities similar to the entity to be disambiguated by cosine similarity; adopting a Bi-LSTM model to calculate the text similarity of the entity to be disambiguated and the candidate entity, respectively adopting the Bi-LSTM model to respectively extract the characteristics of the entity to be disambiguated and the candidate entity, and acquiring a text sequence list of the entity to be disambiguated and a text sequence representation of the candidate entity; respectively calculating the text sequence list of the entity to be disambiguated and the feature vectors represented by the text sequence of the candidate entity, fusing the text sequence list and the feature vectors of the same type of entity, and calculating text similarity according to the feature vector fusion result of the entity to be disambiguated and the candidate entity.

Description

Text similarity entity disambiguation method and system for character entity library

Technical Field

The invention relates to the field of artificial intelligence, in particular to a text similarity entity disambiguation method and system for a character entity library.

Background

Entity disambiguation is of great significance to entity library construction. And after the entity to be disambiguated is obtained, judging whether the entity to be disambiguated is consistent with the entity in the library or not through entity disambiguation, thereby selecting a corresponding warehousing process. And placing the disambiguated entity in an entity library, and subsequently constructing related applications such as search, recommendation and the like according to the entity in the entity library. The current flow of entity disambiguation is: (1) the entity to be disambiguated is obtained from a specified data source (typically the data source is from a crawler). (2) And obtaining candidate entities in the entity library according to the names of the entities to be disambiguated. (3) And performing information comparison so as to judge whether the entity to be disambiguated is matched with the candidate entities in the entity library. (4) And selecting an entity registration process corresponding to the result according to the comparison result. In the existing human entity disambiguation scheme, information such as gender, party information, occupation and the like in entity information is generally used, and rules such as whether to match or not are used to judge candidate entities matched with entities to be disambiguated. And if all the candidate entities can not be matched with the entity to be disambiguated, the entity to be disambiguated enters a registration process. If the corresponding candidate entity matches the entity to be disambiguated, the entity to be disambiguated does not enter the registration process. If the disambiguation algorithm cannot be determined, the entity to be disambiguated enters a manual confirmation registration process.

The above prior art entity disambiguation has a low recall rate, and the comparison process between the disambiguation entity and the candidate entity is difficult to expand due to the fact that the entity information is usually incomplete. In addition, the entity information limitations are too broad to be spread out for the disambiguation process. The existing entity disambiguation is based on the quality of an entity library, and if the quality of the entity library is not high, a large number of errors are generated when entity disambiguation is carried out by combining various entity information.

Disclosure of Invention

One of the purposes of the invention is to provide a text similarity entity disambiguation method and system for a character entity library, which disambiguate a resume entity based on text similarity and can acquire more accurate entity information.

Another object of the present invention is to provide a method and system for disambiguating a text similarity entity in a character entity library, which disambiguates the text similarity entity in a manner of combining text similarity with a rough recall, so that resume information can be better utilized in the disambiguation process, thereby overcoming the defect that long text information is difficult to utilize in the conventional technical scheme, and improving the performance of entity disambiguation.

The invention also aims to provide a text similarity entity disambiguation method and system for a character entity library.

Another object of the present invention is to provide a text similarity entity disambiguation method and system for a human entity library, which ensure the accuracy of an entity disambiguation model through periodic incremental training.

To achieve at least one of the above objects, the present invention further provides a text similarity entity disambiguation method for a people entity library, the method comprising the steps of:

acquiring an entity to be disambiguated, and fishing a candidate entity from a knowledge base according to the entity to be disambiguated;

calculating word2vec vectors by using the character entity resume texts of the entity to be disambiguated and the candidate entity, calculating cosine similarity scores by using the word2vec vector pairs, setting a cosine similarity threshold, and coarsely recalling the candidate entity of which the cosine similarity score is greater than the cosine similarity threshold with the entity to be disambiguated;

adopting a Bi-LSTM model to calculate the text similarity of the entity to be disambiguated and the candidate entity, respectively adopting the Bi-LSTM model to respectively extract the characteristics of the entity to be disambiguated and the candidate entity, and acquiring a text sequence list of the entity to be disambiguated and a text sequence representation of the candidate entity;

respectively calculating the text sequence list of the entity to be disambiguated and the feature vectors represented by the text sequences of the candidate entities, and fusing the text sequence list and the feature vectors of the entities of the same type;

and calculating text similarity according to the feature vector fusion result of the entity to be disambiguated and the candidate entity.

According to another preferred embodiment of the present invention, before obtaining the disambiguation entity, the screening data at the article level is obtained by performing screening processing on the original data through the input layer.

According to another preferred embodiment of the present invention, screening data is obtained and the screening data is preprocessed, the preprocessing including: and segmenting the screened data text through a word segmentation device to obtain the corpus information of word level.

According to another preferred embodiment of the present invention, the preprocessing further comprises: and removing stop words and nonsense fictional words from the screened data text.

According to another preferred embodiment of the present invention, the input layer fetches candidate entities having the same name as the entity to be disambiguated, and after obtaining a plurality of resume information of the candidate entities, constructs a plurality of sentence pairs for each of the resume of the candidate entities and the resume of the entity to be disambiguated, for the Bi-LSTM model training.

According to another preferred embodiment of the present invention, the sentence pair includes a candidate entity resume sentence and an entity resume sentence to be disambiguated, the candidate entity resume sentence is a token sequence text composed of a plurality of words in the candidate entity resume, and the entity resume sentence to be disambiguated is a token sequence text composed of a plurality of words in the entity resume to be disambiguated.

According to another preferred embodiment of the invention, the Bi-LSTM model is adopted to respectively perform feature extraction on the token sequence text of the candidate entity resume and the token sequence text of the entity resume to be disambiguated, so as to obtain a first candidate entity resume text sequence representation a and a first entity resume text sequence representation B to be disambiguated.

According to another preferred embodiment of the invention, the feature vector matrixes in the soft entry mode are respectively calculated for the candidate entity resume text sequence representation A and the entity resume text sequence representation B to be disambiguated, and the normalized soft entry weight is calculated.

According to another preferred embodiment of the present invention, a candidate entity resume text sequence so is calculatedft attention vector characterization A₁And a resume text sequence soft entry vector representation B of an entity to be disambiguated₁Candidate entity resume text sequence soft entry vector characterization A₁Fusing with candidate entity resume text sequence representation A, and disambiguating the entity resume text sequence soft attention vector representation B₁Fusing the resume text sequence representations B of the entities to be disambiguated to respectively generate fused candidate entity resume text representations A₃And fusing resume text representation B of entity to be disambiguated₃Representing A the text of the fused candidate entity resume₃And fusing resume text representation B of entity to be disambiguated₃And respectively inputting the representation data into a Bi-LSTM model to generate a second candidate entity resume text sequence representation a and a second entity resume text sequence representation b to be disambiguated.

According to another preferred embodiment of the present invention, the second candidate entity resume text sequence representation a and the second entity resume text sequence representation b to be disambiguated are respectively subjected to average pooling and maximum pooling, and four pooled feature vectors [ a ] are obtained₁，a₂，b₁,b₂]Wherein a is₁、b₁Respectively, the averaged pooled feature vectors, a₂、b₂And performing concat on the four pooled feature vectors for the maximum pooled feature vector, inputting the concat feature vectors into a softmax function, and outputting a final similarity degree value.

In order to achieve at least one of the above objects, the present invention further provides a text similarity entity disambiguation system for a human entity library, the system employing the above text similarity entity disambiguation method for a human entity library.

Drawings

FIG. 1 is a schematic diagram showing a process of a text similarity entity disambiguation method for a people entity library according to the present invention;

FIG. 2 is a schematic diagram of a text similarity model framework according to the present invention;

FIG. 3 is a schematic diagram showing a principle of text similarity fine-ranking according to the present invention.

Detailed Description

The following description is presented to disclose the invention so as to enable any person skilled in the art to practice the invention. The preferred embodiments in the following description are given by way of example only, and other obvious variations will occur to those skilled in the art. The basic principles of the invention, as defined in the following description, may be applied to other embodiments, variations, modifications, equivalents, and other technical solutions without departing from the spirit and scope of the invention.

It will be understood by those skilled in the art that in the present disclosure, the terms "longitudinal," "lateral," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like are used in an orientation or positional relationship indicated in the drawings for ease of description and simplicity of description, and do not indicate or imply that the referenced devices or components must be in a particular orientation, constructed and operated in a particular orientation, and thus the above terms are not to be construed as limiting the present invention.

Referring to fig. 1-3, the present invention discloses a text similarity entity disambiguation method and system for a people entity library, wherein the method comprises the following steps:

the method comprises the steps of obtaining original data in the Internet through a web crawler technology, and fishing entity resume information from the original data, wherein the entity resume information is entity resume information to be disambiguated. And constructing a candidate entity information base in the local knowledge base, and extracting candidate entity resume information from the local knowledge base. And establishing a plurality of sentence pairs by each candidate entity resume information and the resume information of the entity to be disambiguated, wherein the sentence pairs are used for Bi-LSTM model training.

Specifically, the original data is input to an input layer of the data for data processing after being acquired through a web crawler technology, wherein the data processing process includes screening entity information in the original data, and acquiring entity establishment data at an article level through character recognition, and the entity resume data at the article level includes character information: the names, the family addresses, the work addresses, the native place, the positions, the job levels, the salary levels and the like, the screened article-level original data are input into a data preprocessing layer, and clean and meaningful word-level data are obtained through the data preprocessing layer.

The method for acquiring the word level data comprises the following steps: and performing Chinese word segmentation on the original data of the article level by adopting a Chinese word segmentation device to obtain corpus data of word level. The word-level corpus data includes the names, home addresses, sexes, nationalities, work addresses, native place, positions, job levels, salary levels, and the like, for example, the word-level corpus data may be word-level corpus data such as "zhang san", "man", "han", "hang state", "zhejiang", "financial prison", and the like. And performing Chinese word segmentation on the candidate entity resume by adopting the Chinese word segmentation device to obtain corpus information of corresponding word levels. And the corpus information of the word level is used for constructing sentences of token sequence texts.

The input layer also includes a stop word module for removing stop words including, but not limited to, "the", etc. Because the original data and the candidate entity resume information comprise meaningless virtual words which are one of noises of the text information, the invention also comprises a meaningless word removing module in the input layer, and the meaningless virtual words such as adverbs, prepositions, conjunctions and auxiliary words in the entity resume to be disambiguated and the candidate entity resume text can be removed by the meaningless word removing module. The text information of the entity resume to be disambiguated and the candidate entity resume after the meaningless imaginary words are removed is cleaner.

Further, the clean entity resume to be disambiguated and the candidate entity resume text are used for constructing token sequence text sentences through the input layer, the token sequence text sentences of the entity resume to be disambiguated and the candidate entity resume are respectively formed, the token sequence text sentences of the entity resume to be disambiguated are sentences composed of n words with practical meanings, the token sequence text sentences of the candidate entity resume are sentences composed of m words with practical meanings, and the formats of the two token sequence text sentences are respectively as follows: the numbers in the format sentence represent the filled-in words, [ token _1, token _2.. token _ n ], [ token _1, token _2.. token _ m ]. It should be noted that, in the present invention, BERT is used for feature representation, and BERT is a method using unsupervised pre-training, and a general "language understanding" model is trained on a large amount of text corpora (wikipedia). BERT performed better than the previous method, which is the first unsupervised, deep bi-directional model used on pre-trained NLPs.

Respectively inputting the token sequence text sentences of the resume of the entity to be disambiguated and the token sequence text sentences of the resume of the candidate entity into different Bi-LSTM models for feature extraction, and obtaining a first candidate entity resume text sequence representation A and a first candidate entity resume text sequence representation B, wherein A can be expressed as: a _ n [ a _1, a _ 2...... a _ n]B may be represented as [ B _1, B _2.. B _ m]. And further calculating a soft entry feature matrix of the first candidate entity resume text sequence representation A and a soft entry feature matrix of the first entity resume text sequence representation B to be disambiguated. Respectively normalizing the soft entry feature matrix of the first candidate entity resume text sequence representation A and the soft entry feature matrix of the first entity resume text sequence representation B to be disambiguated, and then calculating the weights e _ ij of the two soft entry feature matrices, wherein e _ ij is a _ i^Tb _ j, wherein a _ i is [ a _1, a _2.. a _ n]B _ j is [ b _1, b _2.. b _ m]The vector of (1).

Further, respectively calculating a candidate entity resume text sequence soft entry vector representation A₁And a resume text sequence soft entry vector representation B of an entity to be disambiguated₁And characterizing the first candidate entity resume text sequence A and the candidate entity resume text sequence soft attribute vector A₁Fusing, and representing the first entity resume text sequence representation B to be disambiguated and the entity resume text sequence soft attention vector representation B to be disambiguated₁Fusing to respectively form fused candidate entity resume text representation A₃And fusing resume text representation B of entity to be disambiguated₃It is worth mentioning that the fusion method is as follows: two word vectors, a vector of two word vectors addedA vector by which the two word vectors are multiplied; concat (splicing) is carried out on the three vectors, the vectors after concat (splicing) are input into a fusion layer, and finally the fusion candidate entity resume text representation A is output₃And fusing resume text representation B of entity to be disambiguated₃It should be noted that the vector stitching may be stitching in the horizontal direction or the vertical direction, which is not limited in the present invention.

Representing the obtained fusion candidate entity resume text A₃And fusing resume text representation B of entity to be disambiguated₃Respectively inputting the data into corresponding Bi-LSTM models, and further performing feature extraction to obtain a second candidate entity resume text sequence representation a and a second entity resume text sequence representation b to be disambiguated. Referring to fig. 3, the second candidate entity resume text sequence representation a and the second entity resume text sequence representation b to be disambiguated are input to the pooling layer for pooling operations, which include average pooling and maximum pooling. Namely, the second candidate entity resume text sequence representation a and the second entity resume text sequence representation b to be disambiguated are averaged and pooled, and the average pooled feature vector a is output₁,b₁Performing maximum pooling on the second candidate entity resume text sequence representation a and the second entity resume text sequence representation b to be disambiguated, and outputting the feature vector a after the maximum pooling₂,b₂The four pooled vectors can be represented as [ a ]₁,a₂,b₁,b₂]And further performing concat (splicing) on the four vectors, and inputting the concat (spliced) feature vectors into a softmax function to obtain a final similarity value. It should be noted that the softmax function may map the outputs of the plurality of neurons into the interval of (0,1), so that the final matching degree of the text may be known, and in the interval of (0,1), the higher the score is, the higher the matching degree is, and the entity resume to be disambiguated with the highest score is categorized into the candidate entity resume in the corresponding knowledge base.

In another preferred embodiment of the present invention, a trained text similarity model is used for periodic incremental training, incremental data is processed by a newly added entity to be disambiguated according to the preprocessing process and is input into the text similarity model for training, and the model obtained after incremental training has more robustness and can adapt to more character resumes.

It should be noted that, in another preferred embodiment of the present invention, a rough recall of the resume of the candidate entity similar to the resume of the entity to be disambiguated is further performed, where the rough recall method includes: the method comprises the steps of segmenting candidate entity resume sentences similar to the resume of the entity to be disambiguated to obtain a word vector list, pooling (posing) the word vector list to obtain sentence vectors, calculating sentence text similarity scores for every two sentence vectors by using cosine similarity, setting a text similarity threshold, and taking the candidate entity resume sentences similar to the resume of the entity to be disambiguated as a training text if the text similarity of the candidate entity resume sentences is larger than the threshold.

It is understood that the terms "a" and "an" should be interpreted as meaning that a number of one element or element is one in one embodiment, while a number of other elements is one in another embodiment, and the terms "a" and "an" should not be interpreted as limiting the number.

In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication section, and/or installed from a removable medium. The computer program, when executed by a Central Processing Unit (CPU), performs the above-described functions defined in the method of the present application. It should be noted that the computer readable medium mentioned above in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. The computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wire segments, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless section, wire section, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It will be understood by those skilled in the art that the embodiments of the present invention described above and illustrated in the drawings are given by way of example only and not by way of limitation, the objects of the invention having been fully and effectively achieved, the functional and structural principles of the present invention having been shown and described in the embodiments, and that various changes or modifications may be made in the embodiments of the present invention without departing from such principles.

Claims

1. A text similarity entity disambiguation method for a people entity library, the method comprising the steps of:

2. The method of claim 1, wherein the filtering data at the article level is obtained by filtering the original data through an input layer before the disambiguation entity is obtained.

3. The method of claim 1, wherein filtering data is obtained and preprocessed, the preprocessing comprising: and segmenting the screened data text through a word segmentation device to obtain the corpus information of word level.

4. The method of claim 3, wherein the pre-processing further comprises: and removing stop words and nonsense fictional words from the screened data text.

5. The method of claim 2, wherein the input layer retrieves candidate entities with the same name as the entity to be disambiguated, and after obtaining a plurality of resume information of the candidate entities, constructs a plurality of sentence pairs for the Bi-LSTM model training by using each of the candidate entity resumes and the entity to be disambiguated.

6. The method of claim 5, wherein the sentence pair comprises a candidate entity resume sentence and a to-be-disambiguated entity resume sentence, the candidate entity resume sentence is a token sequence text composed of a plurality of words in a candidate entity resume, and the to-be-disambiguated entity resume sentence is a token sequence text composed of a plurality of words in a to-be-disambiguated entity resume.

7. The method as claimed in claim 6, wherein the Bi-LSTM model is adopted to perform feature extraction on token sequence text of the resume of the candidate entity and token sequence text of the resume of the entity to be disambiguated, respectively, so as to obtain a first candidate entity resume text sequence representation a and a first entity resume text sequence representation B to be disambiguated.

8. The text similarity entity disambiguation method for the human entity library according to claim 7, wherein the feature vector matrices in the soft orientation mode are calculated for the candidate entity resume text sequence representation a and the entity resume text sequence representation B to be disambiguated, respectively, and the normalized soft orientation weight is calculated.

9. The method of claim 8, wherein a candidate entity resume text sequence soft entry vector representation A is calculated₁And a resume text sequence soft entry vector representation B of an entity to be disambiguated₁Candidate entity resume text sequence soft entry vector characterization A₁Fusing with candidate entity resume text sequence representation A, and disambiguating the entity resume text sequence soft attention vector representation B₁Fusing the resume text sequence representations B of the entities to be disambiguated to respectively generate fused candidate entity resume text representations A₃And fusing resume text representation B of entity to be disambiguated₃Representing A the text of the fused candidate entity resume₃And fusing resume text representation B of entity to be disambiguated₃And respectively inputting the representation data into a Bi-LSTM model to generate a second candidate entity resume text sequence representation a and a second entity resume text sequence representation b to be disambiguated.

10. The method of claim 9, wherein the second candidate entity resume text sequence representation a and the second entity resume text sequence representation b to be disambiguated are respectively subjected to average pooling and maximum pooling, and four pooled feature vectors [ a ] are obtained₁，a₂，b₁,b₂]Wherein a is₁、b₁Respectively, the averaged pooled feature vectors, a₂、b₂Performing concat on the four pooled feature vectors to obtain the maximum pooled feature vector, and performing concat on cInputting the uncat feature vector into a softmax function, and outputting a final similarity degree value.

11. A text similarity entity disambiguation system for a human entity library, the system employing a text similarity entity disambiguation method for a human entity library as described in any of claims 1-10 above.