CN112784018A - Text similarity entity disambiguation method and system for character entity library - Google Patents

Text similarity entity disambiguation method and system for character entity library Download PDF

Info

Publication number
CN112784018A
CN112784018A CN202110118190.7A CN202110118190A CN112784018A CN 112784018 A CN112784018 A CN 112784018A CN 202110118190 A CN202110118190 A CN 202110118190A CN 112784018 A CN112784018 A CN 112784018A
Authority
CN
China
Prior art keywords
entity
disambiguated
resume
text
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110118190.7A
Other languages
Chinese (zh)
Inventor
郭鑫润
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xinhua Fusion Media Technology Development Beijing Co ltd
Xinhua Zhiyun Technology Co ltd
Original Assignee
Xinhua Zhiyun Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xinhua Zhiyun Technology Co ltd filed Critical Xinhua Zhiyun Technology Co ltd
Priority to CN202110118190.7A priority Critical patent/CN112784018A/en
Publication of CN112784018A publication Critical patent/CN112784018A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Character Discrimination (AREA)

Abstract

The invention discloses a text similarity entity disambiguation method and a system for a character entity library, wherein the method comprises the following steps: acquiring an entity to be disambiguated, and fishing a candidate entity from a knowledge base according to the entity to be disambiguated; roughly recalling candidate entities similar to the entity to be disambiguated by cosine similarity; adopting a Bi-LSTM model to calculate the text similarity of the entity to be disambiguated and the candidate entity, respectively adopting the Bi-LSTM model to respectively extract the characteristics of the entity to be disambiguated and the candidate entity, and acquiring a text sequence list of the entity to be disambiguated and a text sequence representation of the candidate entity; respectively calculating the text sequence list of the entity to be disambiguated and the feature vectors represented by the text sequence of the candidate entity, fusing the text sequence list and the feature vectors of the same type of entity, and calculating text similarity according to the feature vector fusion result of the entity to be disambiguated and the candidate entity.

Description

Text similarity entity disambiguation method and system for character entity library
Technical Field
The invention relates to the field of artificial intelligence, in particular to a text similarity entity disambiguation method and system for a character entity library.
Background
Entity disambiguation is of great significance to entity library construction. And after the entity to be disambiguated is obtained, judging whether the entity to be disambiguated is consistent with the entity in the library or not through entity disambiguation, thereby selecting a corresponding warehousing process. And placing the disambiguated entity in an entity library, and subsequently constructing related applications such as search, recommendation and the like according to the entity in the entity library. The current flow of entity disambiguation is: (1) the entity to be disambiguated is obtained from a specified data source (typically the data source is from a crawler). (2) And obtaining candidate entities in the entity library according to the names of the entities to be disambiguated. (3) And performing information comparison so as to judge whether the entity to be disambiguated is matched with the candidate entities in the entity library. (4) And selecting an entity registration process corresponding to the result according to the comparison result. In the existing human entity disambiguation scheme, information such as gender, party information, occupation and the like in entity information is generally used, and rules such as whether to match or not are used to judge candidate entities matched with entities to be disambiguated. And if all the candidate entities can not be matched with the entity to be disambiguated, the entity to be disambiguated enters a registration process. If the corresponding candidate entity matches the entity to be disambiguated, the entity to be disambiguated does not enter the registration process. If the disambiguation algorithm cannot be determined, the entity to be disambiguated enters a manual confirmation registration process.
The above prior art entity disambiguation has a low recall rate, and the comparison process between the disambiguation entity and the candidate entity is difficult to expand due to the fact that the entity information is usually incomplete. In addition, the entity information limitations are too broad to be spread out for the disambiguation process. The existing entity disambiguation is based on the quality of an entity library, and if the quality of the entity library is not high, a large number of errors are generated when entity disambiguation is carried out by combining various entity information.
Disclosure of Invention
One of the purposes of the invention is to provide a text similarity entity disambiguation method and system for a character entity library, which disambiguate a resume entity based on text similarity and can acquire more accurate entity information.
Another object of the present invention is to provide a method and system for disambiguating a text similarity entity in a character entity library, which disambiguates the text similarity entity in a manner of combining text similarity with a rough recall, so that resume information can be better utilized in the disambiguation process, thereby overcoming the defect that long text information is difficult to utilize in the conventional technical scheme, and improving the performance of entity disambiguation.
The invention also aims to provide a text similarity entity disambiguation method and system for a character entity library.
Another object of the present invention is to provide a text similarity entity disambiguation method and system for a human entity library, which ensure the accuracy of an entity disambiguation model through periodic incremental training.
To achieve at least one of the above objects, the present invention further provides a text similarity entity disambiguation method for a people entity library, the method comprising the steps of:
acquiring an entity to be disambiguated, and fishing a candidate entity from a knowledge base according to the entity to be disambiguated;
calculating word2vec vectors by using the character entity resume texts of the entity to be disambiguated and the candidate entity, calculating cosine similarity scores by using the word2vec vector pairs, setting a cosine similarity threshold, and coarsely recalling the candidate entity of which the cosine similarity score is greater than the cosine similarity threshold with the entity to be disambiguated;
adopting a Bi-LSTM model to calculate the text similarity of the entity to be disambiguated and the candidate entity, respectively adopting the Bi-LSTM model to respectively extract the characteristics of the entity to be disambiguated and the candidate entity, and acquiring a text sequence list of the entity to be disambiguated and a text sequence representation of the candidate entity;
respectively calculating the text sequence list of the entity to be disambiguated and the feature vectors represented by the text sequences of the candidate entities, and fusing the text sequence list and the feature vectors of the entities of the same type;
and calculating text similarity according to the feature vector fusion result of the entity to be disambiguated and the candidate entity.
According to another preferred embodiment of the present invention, before obtaining the disambiguation entity, the screening data at the article level is obtained by performing screening processing on the original data through the input layer.
According to another preferred embodiment of the present invention, screening data is obtained and the screening data is preprocessed, the preprocessing including: and segmenting the screened data text through a word segmentation device to obtain the corpus information of word level.
According to another preferred embodiment of the present invention, the preprocessing further comprises: and removing stop words and nonsense fictional words from the screened data text.
According to another preferred embodiment of the present invention, the input layer fetches candidate entities having the same name as the entity to be disambiguated, and after obtaining a plurality of resume information of the candidate entities, constructs a plurality of sentence pairs for each of the resume of the candidate entities and the resume of the entity to be disambiguated, for the Bi-LSTM model training.
According to another preferred embodiment of the present invention, the sentence pair includes a candidate entity resume sentence and an entity resume sentence to be disambiguated, the candidate entity resume sentence is a token sequence text composed of a plurality of words in the candidate entity resume, and the entity resume sentence to be disambiguated is a token sequence text composed of a plurality of words in the entity resume to be disambiguated.
According to another preferred embodiment of the invention, the Bi-LSTM model is adopted to respectively perform feature extraction on the token sequence text of the candidate entity resume and the token sequence text of the entity resume to be disambiguated, so as to obtain a first candidate entity resume text sequence representation a and a first entity resume text sequence representation B to be disambiguated.
According to another preferred embodiment of the invention, the feature vector matrixes in the soft entry mode are respectively calculated for the candidate entity resume text sequence representation A and the entity resume text sequence representation B to be disambiguated, and the normalized soft entry weight is calculated.
According to another preferred embodiment of the present invention, a candidate entity resume text sequence so is calculatedft attention vector characterization A1And a resume text sequence soft entry vector representation B of an entity to be disambiguated1Candidate entity resume text sequence soft entry vector characterization A1Fusing with candidate entity resume text sequence representation A, and disambiguating the entity resume text sequence soft attention vector representation B1Fusing the resume text sequence representations B of the entities to be disambiguated to respectively generate fused candidate entity resume text representations A3And fusing resume text representation B of entity to be disambiguated3Representing A the text of the fused candidate entity resume3And fusing resume text representation B of entity to be disambiguated3And respectively inputting the representation data into a Bi-LSTM model to generate a second candidate entity resume text sequence representation a and a second entity resume text sequence representation b to be disambiguated.
According to another preferred embodiment of the present invention, the second candidate entity resume text sequence representation a and the second entity resume text sequence representation b to be disambiguated are respectively subjected to average pooling and maximum pooling, and four pooled feature vectors [ a ] are obtained1,a2,b1,b2]Wherein a is1、b1Respectively, the averaged pooled feature vectors, a2、b2And performing concat on the four pooled feature vectors for the maximum pooled feature vector, inputting the concat feature vectors into a softmax function, and outputting a final similarity degree value.
In order to achieve at least one of the above objects, the present invention further provides a text similarity entity disambiguation system for a human entity library, the system employing the above text similarity entity disambiguation method for a human entity library.
Drawings
FIG. 1 is a schematic diagram showing a process of a text similarity entity disambiguation method for a people entity library according to the present invention;
FIG. 2 is a schematic diagram of a text similarity model framework according to the present invention;
FIG. 3 is a schematic diagram showing a principle of text similarity fine-ranking according to the present invention.
Detailed Description
The following description is presented to disclose the invention so as to enable any person skilled in the art to practice the invention. The preferred embodiments in the following description are given by way of example only, and other obvious variations will occur to those skilled in the art. The basic principles of the invention, as defined in the following description, may be applied to other embodiments, variations, modifications, equivalents, and other technical solutions without departing from the spirit and scope of the invention.
It will be understood by those skilled in the art that in the present disclosure, the terms "longitudinal," "lateral," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like are used in an orientation or positional relationship indicated in the drawings for ease of description and simplicity of description, and do not indicate or imply that the referenced devices or components must be in a particular orientation, constructed and operated in a particular orientation, and thus the above terms are not to be construed as limiting the present invention.
Referring to fig. 1-3, the present invention discloses a text similarity entity disambiguation method and system for a people entity library, wherein the method comprises the following steps:
the method comprises the steps of obtaining original data in the Internet through a web crawler technology, and fishing entity resume information from the original data, wherein the entity resume information is entity resume information to be disambiguated. And constructing a candidate entity information base in the local knowledge base, and extracting candidate entity resume information from the local knowledge base. And establishing a plurality of sentence pairs by each candidate entity resume information and the resume information of the entity to be disambiguated, wherein the sentence pairs are used for Bi-LSTM model training.
Specifically, the original data is input to an input layer of the data for data processing after being acquired through a web crawler technology, wherein the data processing process includes screening entity information in the original data, and acquiring entity establishment data at an article level through character recognition, and the entity resume data at the article level includes character information: the names, the family addresses, the work addresses, the native place, the positions, the job levels, the salary levels and the like, the screened article-level original data are input into a data preprocessing layer, and clean and meaningful word-level data are obtained through the data preprocessing layer.
The method for acquiring the word level data comprises the following steps: and performing Chinese word segmentation on the original data of the article level by adopting a Chinese word segmentation device to obtain corpus data of word level. The word-level corpus data includes the names, home addresses, sexes, nationalities, work addresses, native place, positions, job levels, salary levels, and the like, for example, the word-level corpus data may be word-level corpus data such as "zhang san", "man", "han", "hang state", "zhejiang", "financial prison", and the like. And performing Chinese word segmentation on the candidate entity resume by adopting the Chinese word segmentation device to obtain corpus information of corresponding word levels. And the corpus information of the word level is used for constructing sentences of token sequence texts.
The input layer also includes a stop word module for removing stop words including, but not limited to, "the", etc. Because the original data and the candidate entity resume information comprise meaningless virtual words which are one of noises of the text information, the invention also comprises a meaningless word removing module in the input layer, and the meaningless virtual words such as adverbs, prepositions, conjunctions and auxiliary words in the entity resume to be disambiguated and the candidate entity resume text can be removed by the meaningless word removing module. The text information of the entity resume to be disambiguated and the candidate entity resume after the meaningless imaginary words are removed is cleaner.
Further, the clean entity resume to be disambiguated and the candidate entity resume text are used for constructing token sequence text sentences through the input layer, the token sequence text sentences of the entity resume to be disambiguated and the candidate entity resume are respectively formed, the token sequence text sentences of the entity resume to be disambiguated are sentences composed of n words with practical meanings, the token sequence text sentences of the candidate entity resume are sentences composed of m words with practical meanings, and the formats of the two token sequence text sentences are respectively as follows: the numbers in the format sentence represent the filled-in words, [ token _1, token _2.. token _ n ], [ token _1, token _2.. token _ m ]. It should be noted that, in the present invention, BERT is used for feature representation, and BERT is a method using unsupervised pre-training, and a general "language understanding" model is trained on a large amount of text corpora (wikipedia). BERT performed better than the previous method, which is the first unsupervised, deep bi-directional model used on pre-trained NLPs.
Respectively inputting the token sequence text sentences of the resume of the entity to be disambiguated and the token sequence text sentences of the resume of the candidate entity into different Bi-LSTM models for feature extraction, and obtaining a first candidate entity resume text sequence representation A and a first candidate entity resume text sequence representation B, wherein A can be expressed as: a _ n [ a _1, a _ 2...... a _ n]B may be represented as [ B _1, B _2.. B _ m]. And further calculating a soft entry feature matrix of the first candidate entity resume text sequence representation A and a soft entry feature matrix of the first entity resume text sequence representation B to be disambiguated. Respectively normalizing the soft entry feature matrix of the first candidate entity resume text sequence representation A and the soft entry feature matrix of the first entity resume text sequence representation B to be disambiguated, and then calculating the weights e _ ij of the two soft entry feature matrices, wherein e _ ij is a _ iTb _ j, wherein a _ i is [ a _1, a _2.. a _ n]B _ j is [ b _1, b _2.. b _ m]The vector of (1).
Further, respectively calculating a candidate entity resume text sequence soft entry vector representation A1And a resume text sequence soft entry vector representation B of an entity to be disambiguated1And characterizing the first candidate entity resume text sequence A and the candidate entity resume text sequence soft attribute vector A1Fusing, and representing the first entity resume text sequence representation B to be disambiguated and the entity resume text sequence soft attention vector representation B to be disambiguated1Fusing to respectively form fused candidate entity resume text representation A3And fusing resume text representation B of entity to be disambiguated3It is worth mentioning that the fusion method is as follows: two word vectors, a vector of two word vectors addedA vector by which the two word vectors are multiplied; concat (splicing) is carried out on the three vectors, the vectors after concat (splicing) are input into a fusion layer, and finally the fusion candidate entity resume text representation A is output3And fusing resume text representation B of entity to be disambiguated3It should be noted that the vector stitching may be stitching in the horizontal direction or the vertical direction, which is not limited in the present invention.
Representing the obtained fusion candidate entity resume text A3And fusing resume text representation B of entity to be disambiguated3Respectively inputting the data into corresponding Bi-LSTM models, and further performing feature extraction to obtain a second candidate entity resume text sequence representation a and a second entity resume text sequence representation b to be disambiguated. Referring to fig. 3, the second candidate entity resume text sequence representation a and the second entity resume text sequence representation b to be disambiguated are input to the pooling layer for pooling operations, which include average pooling and maximum pooling. Namely, the second candidate entity resume text sequence representation a and the second entity resume text sequence representation b to be disambiguated are averaged and pooled, and the average pooled feature vector a is output1,b1Performing maximum pooling on the second candidate entity resume text sequence representation a and the second entity resume text sequence representation b to be disambiguated, and outputting the feature vector a after the maximum pooling2,b2The four pooled vectors can be represented as [ a ]1,a2,b1,b2]And further performing concat (splicing) on the four vectors, and inputting the concat (spliced) feature vectors into a softmax function to obtain a final similarity value. It should be noted that the softmax function may map the outputs of the plurality of neurons into the interval of (0,1), so that the final matching degree of the text may be known, and in the interval of (0,1), the higher the score is, the higher the matching degree is, and the entity resume to be disambiguated with the highest score is categorized into the candidate entity resume in the corresponding knowledge base.
In another preferred embodiment of the present invention, a trained text similarity model is used for periodic incremental training, incremental data is processed by a newly added entity to be disambiguated according to the preprocessing process and is input into the text similarity model for training, and the model obtained after incremental training has more robustness and can adapt to more character resumes.
It should be noted that, in another preferred embodiment of the present invention, a rough recall of the resume of the candidate entity similar to the resume of the entity to be disambiguated is further performed, where the rough recall method includes: the method comprises the steps of segmenting candidate entity resume sentences similar to the resume of the entity to be disambiguated to obtain a word vector list, pooling (posing) the word vector list to obtain sentence vectors, calculating sentence text similarity scores for every two sentence vectors by using cosine similarity, setting a text similarity threshold, and taking the candidate entity resume sentences similar to the resume of the entity to be disambiguated as a training text if the text similarity of the candidate entity resume sentences is larger than the threshold.
It is understood that the terms "a" and "an" should be interpreted as meaning that a number of one element or element is one in one embodiment, while a number of other elements is one in another embodiment, and the terms "a" and "an" should not be interpreted as limiting the number.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication section, and/or installed from a removable medium. The computer program, when executed by a Central Processing Unit (CPU), performs the above-described functions defined in the method of the present application. It should be noted that the computer readable medium mentioned above in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. The computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wire segments, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless section, wire section, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
It will be understood by those skilled in the art that the embodiments of the present invention described above and illustrated in the drawings are given by way of example only and not by way of limitation, the objects of the invention having been fully and effectively achieved, the functional and structural principles of the present invention having been shown and described in the embodiments, and that various changes or modifications may be made in the embodiments of the present invention without departing from such principles.

Claims (11)

1. A text similarity entity disambiguation method for a people entity library, the method comprising the steps of:
acquiring an entity to be disambiguated, and fishing a candidate entity from a knowledge base according to the entity to be disambiguated;
calculating word2vec vectors by using the character entity resume texts of the entity to be disambiguated and the candidate entity, calculating cosine similarity scores by using the word2vec vector pairs, setting a cosine similarity threshold, and coarsely recalling the candidate entity of which the cosine similarity score is greater than the cosine similarity threshold with the entity to be disambiguated;
adopting a Bi-LSTM model to calculate the text similarity of the entity to be disambiguated and the candidate entity, respectively adopting the Bi-LSTM model to respectively extract the characteristics of the entity to be disambiguated and the candidate entity, and acquiring a text sequence list of the entity to be disambiguated and a text sequence representation of the candidate entity;
respectively calculating the text sequence list of the entity to be disambiguated and the feature vectors represented by the text sequences of the candidate entities, and fusing the text sequence list and the feature vectors of the entities of the same type;
and calculating text similarity according to the feature vector fusion result of the entity to be disambiguated and the candidate entity.
2. The method of claim 1, wherein the filtering data at the article level is obtained by filtering the original data through an input layer before the disambiguation entity is obtained.
3. The method of claim 1, wherein filtering data is obtained and preprocessed, the preprocessing comprising: and segmenting the screened data text through a word segmentation device to obtain the corpus information of word level.
4. The method of claim 3, wherein the pre-processing further comprises: and removing stop words and nonsense fictional words from the screened data text.
5. The method of claim 2, wherein the input layer retrieves candidate entities with the same name as the entity to be disambiguated, and after obtaining a plurality of resume information of the candidate entities, constructs a plurality of sentence pairs for the Bi-LSTM model training by using each of the candidate entity resumes and the entity to be disambiguated.
6. The method of claim 5, wherein the sentence pair comprises a candidate entity resume sentence and a to-be-disambiguated entity resume sentence, the candidate entity resume sentence is a token sequence text composed of a plurality of words in a candidate entity resume, and the to-be-disambiguated entity resume sentence is a token sequence text composed of a plurality of words in a to-be-disambiguated entity resume.
7. The method as claimed in claim 6, wherein the Bi-LSTM model is adopted to perform feature extraction on token sequence text of the resume of the candidate entity and token sequence text of the resume of the entity to be disambiguated, respectively, so as to obtain a first candidate entity resume text sequence representation a and a first entity resume text sequence representation B to be disambiguated.
8. The text similarity entity disambiguation method for the human entity library according to claim 7, wherein the feature vector matrices in the soft orientation mode are calculated for the candidate entity resume text sequence representation a and the entity resume text sequence representation B to be disambiguated, respectively, and the normalized soft orientation weight is calculated.
9. The method of claim 8, wherein a candidate entity resume text sequence soft entry vector representation A is calculated1And a resume text sequence soft entry vector representation B of an entity to be disambiguated1Candidate entity resume text sequence soft entry vector characterization A1Fusing with candidate entity resume text sequence representation A, and disambiguating the entity resume text sequence soft attention vector representation B1Fusing the resume text sequence representations B of the entities to be disambiguated to respectively generate fused candidate entity resume text representations A3And fusing resume text representation B of entity to be disambiguated3Representing A the text of the fused candidate entity resume3And fusing resume text representation B of entity to be disambiguated3And respectively inputting the representation data into a Bi-LSTM model to generate a second candidate entity resume text sequence representation a and a second entity resume text sequence representation b to be disambiguated.
10. The method of claim 9, wherein the second candidate entity resume text sequence representation a and the second entity resume text sequence representation b to be disambiguated are respectively subjected to average pooling and maximum pooling, and four pooled feature vectors [ a ] are obtained1,a2,b1,b2]Wherein a is1、b1Respectively, the averaged pooled feature vectors, a2、b2Performing concat on the four pooled feature vectors to obtain the maximum pooled feature vector, and performing concat on cInputting the uncat feature vector into a softmax function, and outputting a final similarity degree value.
11. A text similarity entity disambiguation system for a human entity library, the system employing a text similarity entity disambiguation method for a human entity library as described in any of claims 1-10 above.
CN202110118190.7A 2021-01-28 2021-01-28 Text similarity entity disambiguation method and system for character entity library Pending CN112784018A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110118190.7A CN112784018A (en) 2021-01-28 2021-01-28 Text similarity entity disambiguation method and system for character entity library

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110118190.7A CN112784018A (en) 2021-01-28 2021-01-28 Text similarity entity disambiguation method and system for character entity library

Publications (1)

Publication Number Publication Date
CN112784018A true CN112784018A (en) 2021-05-11

Family

ID=75759389

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110118190.7A Pending CN112784018A (en) 2021-01-28 2021-01-28 Text similarity entity disambiguation method and system for character entity library

Country Status (1)

Country Link
CN (1) CN112784018A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114580392A (en) * 2022-04-29 2022-06-03 中科雨辰科技有限公司 Data processing system for identifying entity

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109543180A (en) * 2018-11-08 2019-03-29 中山大学 A kind of text emotion analysis method based on attention mechanism
CN110083682A (en) * 2019-04-19 2019-08-02 西安交通大学 It is a kind of to understand answer acquisition methods based on the machine readings for taking turns attention mechanism more
CN110427612A (en) * 2019-07-02 2019-11-08 平安科技(深圳)有限公司 Based on multilingual entity disambiguation method, device, equipment and storage medium
WO2020193966A1 (en) * 2019-03-26 2020-10-01 Benevolentai Technology Limited Name entity recognition with deep learning
CN112016314A (en) * 2020-09-17 2020-12-01 汪秀英 Medical text understanding method and system based on BERT model
CN112182166A (en) * 2020-10-29 2021-01-05 腾讯科技(深圳)有限公司 Text matching method and device, electronic equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109543180A (en) * 2018-11-08 2019-03-29 中山大学 A kind of text emotion analysis method based on attention mechanism
WO2020193966A1 (en) * 2019-03-26 2020-10-01 Benevolentai Technology Limited Name entity recognition with deep learning
CN110083682A (en) * 2019-04-19 2019-08-02 西安交通大学 It is a kind of to understand answer acquisition methods based on the machine readings for taking turns attention mechanism more
CN110427612A (en) * 2019-07-02 2019-11-08 平安科技(深圳)有限公司 Based on multilingual entity disambiguation method, device, equipment and storage medium
CN112016314A (en) * 2020-09-17 2020-12-01 汪秀英 Medical text understanding method and system based on BERT model
CN112182166A (en) * 2020-10-29 2021-01-05 腾讯科技(深圳)有限公司 Text matching method and device, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
阿尔贝托 等: "《数据科学导论 Python语言》", 30 March 2020 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114580392A (en) * 2022-04-29 2022-06-03 中科雨辰科技有限公司 Data processing system for identifying entity
CN114580392B (en) * 2022-04-29 2022-07-29 中科雨辰科技有限公司 Data processing system for identifying entity

Similar Documents

Publication Publication Date Title
CN110705301B (en) Entity relationship extraction method and device, storage medium and electronic equipment
CN110110062B (en) Machine intelligent question and answer method and device and electronic equipment
WO2018157805A1 (en) Automatic questioning and answering processing method and automatic questioning and answering system
CN106874441B (en) Intelligent question-answering method and device
CN110929038B (en) Knowledge graph-based entity linking method, device, equipment and storage medium
CN111090736B (en) Question-answering model training method, question-answering method, device and computer storage medium
CN116775847B (en) Question answering method and system based on knowledge graph and large language model
CN112214593A (en) Question and answer processing method and device, electronic equipment and storage medium
CN112131881B (en) Information extraction method and device, electronic equipment and storage medium
CN113204611A (en) Method for establishing reading understanding model, reading understanding method and corresponding device
CN115203507A (en) Event extraction method based on pre-training model and oriented to document field
CN113468891A (en) Text processing method and device
CN107526721A (en) A kind of disambiguation method and device to electric business product review vocabulary
CN112580331A (en) Method and system for establishing knowledge graph of policy text
JP2023002690A (en) Semantics recognition method, apparatus, electronic device, and storage medium
CN117746078B (en) Object detection method and system based on user-defined category
CN115408488A (en) Segmentation method and system for novel scene text
CN111241843B (en) Semantic relation inference system and method based on composite neural network
CN116913278B (en) Voice processing method, device, equipment and storage medium
CN112784018A (en) Text similarity entity disambiguation method and system for character entity library
CN113705207A (en) Grammar error recognition method and device
CN112380861A (en) Model training method and device and intention identification method and device
CN112287077A (en) Statement extraction method and device for combining RPA and AI for document, storage medium and electronic equipment
US20140372106A1 (en) Assisted Free Form Decision Definition Using Rules Vocabulary
CN114416923A (en) News entity linking method and system based on rich text characteristics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20221219

Address after: Room 430, cultural center, 460 Wenyi West Road, Xihu District, Hangzhou City, Zhejiang Province, 310012

Applicant after: XINHUA ZHIYUN TECHNOLOGY Co.,Ltd.

Applicant after: Xinhua fusion media technology development (Beijing) Co.,Ltd.

Address before: Room 430, cultural center, 460 Wenyi West Road, Xihu District, Hangzhou City, Zhejiang Province, 310012

Applicant before: XINHUA ZHIYUN TECHNOLOGY Co.,Ltd.

TA01 Transfer of patent application right
RJ01 Rejection of invention patent application after publication

Application publication date: 20210511

RJ01 Rejection of invention patent application after publication