CN114266245A - Entity linking method and device - Google Patents

Entity linking method and device Download PDF

Info

Publication number
CN114266245A
CN114266245A CN202010974083.XA CN202010974083A CN114266245A CN 114266245 A CN114266245 A CN 114266245A CN 202010974083 A CN202010974083 A CN 202010974083A CN 114266245 A CN114266245 A CN 114266245A
Authority
CN
China
Prior art keywords
vector
entity
text
linked
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010974083.XA
Other languages
Chinese (zh)
Inventor
弓源
李长亮
汪美玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Digital Entertainment Co Ltd
Original Assignee
Beijing Kingsoft Digital Entertainment Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Digital Entertainment Co Ltd filed Critical Beijing Kingsoft Digital Entertainment Co Ltd
Priority to CN202010974083.XA priority Critical patent/CN114266245A/en
Publication of CN114266245A publication Critical patent/CN114266245A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides an entity linking method and an entity linking device, wherein the entity linking method comprises the following steps: inputting a text to be linked into a pre-trained information tagging model, obtaining a first coding vector output by the information tagging model, wherein the first coding vector represents an entity designation of the text to be linked, screening resume information related to the text to be linked from a pre-constructed knowledge graph spectrum based on the first coding vector and time information contained in the text to be linked, inputting the resume information into the pre-trained vector coding model, obtaining a second coding vector output by the vector coding model, wherein the second coding vector represents a candidate entity of the resume information, and performing entity linking on the entity designation and the candidate entity under the condition that the matching degree between the first coding vector and the second coding vector is determined to be greater than a preset matching degree threshold value.

Description

Entity linking method and device
Technical Field
The present application relates to the field of knowledge graph technology, and in particular, to an entity linking method and apparatus, a computing device, and a computer-readable storage medium.
Background
With the rapid development of the internet, various text messages on the network include various entity names such as a person name, a place name and an organization name. Due to the diversity of natural language expressions, an entity reference may often point to multiple real entities, and therefore, in order to correctly understand the real meaning of an entity reference in a text, it is necessary to link the entity reference in the text to a corresponding unambiguous entity in an entity knowledge base.
The neural network and deep learning method can be rapidly applied to some tasks of computer vision and natural language processing due to the excellent characteristics of end-to-end, no need of artificial characteristic engineering and the like, and the result superior to that of the traditional method is obtained. The entity linking field is no exception, and the effect of the entity linking task is improved to a certain extent by methods such as shallow word vectors or neural network models, but at least the problems of low efficiency, limited applicability, low accuracy and the like exist and need to be solved urgently.
Disclosure of Invention
In view of this, embodiments of the present application provide an entity linking method and apparatus, a computing device, and a computer-readable storage medium, so as to solve technical defects in the prior art.
According to a first aspect of embodiments of the present application, there is provided an entity linking method, including:
inputting a text to be linked into a pre-trained information labeling model, and obtaining a first coding vector output by the information labeling model, wherein the first coding vector represents an entity designation of the text to be linked;
screening resume information related to the text to be linked from a pre-constructed knowledge graph spectrum based on the first encoding vector and time information contained in the text to be linked;
inputting the resume information into a vector coding model trained in advance to obtain a second coding vector output by the vector coding model, wherein the second coding vector represents a candidate entity of the resume information;
and under the condition that the matching degree between the first encoding vector and the second encoding vector is determined to be larger than a preset matching degree threshold value, the entity name and the candidate entity are subjected to entity linkage.
Optionally, the inputting a text to be linked into a pre-trained information labeling model to obtain a first coding vector output by the information labeling model, where the first coding vector represents an entity name of the text to be linked, includes:
inputting a text to be linked into the information labeling model, performing word segmentation processing on the text to be linked to obtain word units of the text to be linked, and performing pre-embedding processing on the word units to obtain word vectors, sentence vectors and position vectors corresponding to the word units;
adding the word vector, the sentence vector and the position vector corresponding to the word unit to generate a vector to be linked corresponding to the word unit;
a vector coding module of the information labeling model codes the vector to be linked to obtain an intermediate coding vector;
and a named entity labeling module of the vector labeling model performs entity designation recognition on the text to be linked, and performs entity designation labeling on the intermediate coding vector according to a recognition result to generate the first coding vector.
Optionally, the inputting the resume information into a pre-trained vector coding model to obtain a second coding vector output by the vector coding model includes:
inputting the resume information into the vector coding model, performing word segmentation processing on the resume information to obtain word units of the resume information, and performing pre-embedding processing on the word units to obtain word vectors, sentence vectors and position vectors corresponding to the word units;
adding the word vector, the sentence vector and the position vector corresponding to the word unit to generate the intermediate vector corresponding to the word unit;
and inputting the intermediate vector into a vector coding module of the information labeling model for coding to obtain the second coding vector.
Optionally, the screening, based on the first encoding vector and the time information included in the text to be linked, resume information related to the text to be linked from a pre-constructed knowledge graph spectrum includes:
preprocessing the text to be linked to obtain time information contained in the text to be linked;
performing first information screening in a pre-constructed knowledge graph based on the entity designation to obtain intermediate screening information;
and carrying out second information screening on the intermediate screening information based on the time information to obtain the resume information.
Optionally, after obtaining the second encoding vector output by the vector coding model, the method further includes:
splicing the first encoding vector with the second encoding vector;
determining the matching degree between the query text and the resume description information according to the spliced coding vectors;
and under the condition that the matching degree between the first encoding vector and the second encoding vector is determined to be larger than a preset matching degree threshold value, the entity name is in entity linkage with the candidate entity.
Optionally, after obtaining the second encoding vector output by the vector coding model, the method further includes:
splicing the first encoding vector with the second encoding vector;
and inputting the code vectors obtained by splicing into a text matching module of the vector coding model, and calculating the matching degree of the code vectors by the text matching module and outputting a calculation result.
Optionally, the entity linking the entity reference with the candidate entity includes:
entity screening is carried out on the candidate entities to obtain target candidate entities;
determining a corresponding target relation and attribute information of the target candidate entity according to the target candidate entity;
and performing entity link on the entity designation and the candidate entity based on the target relationship and the attribute information.
Optionally, the entity linking method further includes:
determining position vectors corresponding to the entity designations in the text to be linked;
fusing the position vector and the first encoding vector to obtain a fused encoding vector;
and under the condition that the matching degree between the fused encoding vector and the second encoding vector is determined to be larger than a preset matching degree threshold value, performing entity link on the entity designation and the candidate entity.
According to a second aspect of embodiments of the present application, there is provided an entity linking apparatus, including:
the first processing module is configured to input a text to be linked into a pre-trained information labeling model, and obtain a first encoding vector output by the information labeling model, wherein the first encoding vector represents an entity designation of the text to be linked;
the screening module is configured to screen resume information related to the text to be linked from a pre-constructed knowledge graph spectrum on the basis of the first encoding vector and time information contained in the text to be linked;
the second processing module is configured to input the resume information into a pre-trained vector coding model, and obtain a second coding vector output by the vector coding model, wherein the second coding vector represents a candidate entity of the resume information;
a linking module configured to physically link the entity designation with the candidate entity if it is determined that the degree of match between the first encoded vector and the second encoded vector is greater than a preset degree of match threshold.
According to a third aspect of embodiments herein, there is provided a computing device comprising a memory, a processor and computer instructions stored on the memory and executable on the processor, the processor implementing the steps of the entity linking method when executing the instructions.
According to a fourth aspect of embodiments herein, there is provided a computer readable storage medium storing computer instructions which, when executed by a processor, implement the steps of the entity linking method.
In the embodiment of the application, a text to be linked is input into a pre-trained information tagging model, a first coding vector output by the information tagging model is obtained, the first coding vector represents an entity designation of the text to be linked, resume information related to the text to be linked is screened out from a pre-constructed knowledge graph based on the first coding vector and time information contained in the text to be linked, the resume information is input into the pre-trained vector coding model, a second coding vector output by the vector coding model is obtained, the second coding vector represents a candidate entity of the resume information, and the entity designation and the candidate entity are subjected to entity linking under the condition that the matching degree between the first coding vector and the second coding vector is determined to be greater than a preset matching degree threshold value;
the method and the device realize simultaneous completion of entity identification and entity link tasks in a joint training mode, and screen resume information in a knowledge graph by utilizing time information and entity marking information by preprocessing information in the text to be linked to extract time information, so that accuracy of a screening result is improved.
Drawings
FIG. 1 is a block diagram of a computing device provided by an embodiment of the present application;
FIG. 2 is a flowchart of an entity linking method provided by an embodiment of the present application;
fig. 3 is a schematic diagram of an architecture of a vector encoding module according to an embodiment of the present application;
FIG. 4 is a schematic diagram of an implementation process of an entity linking method provided in an embodiment of the present application;
FIG. 5 is a flowchart of a processing procedure of an entity linking method according to an embodiment of the present disclosure;
fig. 6 is a schematic structural diagram of an entity linking apparatus according to an embodiment of the present application.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.
The terminology used in the one or more embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the present application. As used in one or more embodiments of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present application refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments of the present application to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first aspect may be termed a second aspect, and, similarly, a second aspect may be termed a first aspect, without departing from the scope of one or more embodiments of the present application.
First, the noun terms to which one or more embodiments of the present invention relate are explained.
Knowledge map (KGs): the scientific knowledge map is a concept in the field of book intelligence, is used for drawing, analyzing and displaying the interrelationship between subjects or academic research subjects, and is a visual tool for revealing and displaying the development process and the structural relationship of scientific knowledge.
Entity: the entity is a basic unit of the knowledge graph and is also an important language unit for bearing information in the text.
Mention (mentioned): the language fragments of the entity are expressed in natural text.
Entity linking: it is the task to link the ention in the text to the entity in the KG.
In the present application, an entity linking method and apparatus, a computing device and a computer readable storage medium are provided, which are described in detail in the following embodiments one by one.
FIG. 1 shows a block diagram of a computing device 100 according to an embodiment of the present application. The components of the computing device 100 include, but are not limited to, memory 110 and processor 120. The processor 120 is coupled to the memory 110 via a bus 130 and a database 150 is used to store data.
Computing device 100 also includes access device 140, access device 140 enabling computing device 100 to communicate via one or more networks 160. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. Access device 140 may include one or more of any type of network interface (e.g., a Network Interface Card (NIC)) whether wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.
In one embodiment of the present application, the above-mentioned components of the computing device 100 and other components not shown in fig. 1 may also be connected to each other, for example, by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 1 is for purposes of example only and is not limiting as to the scope of the present application. Those skilled in the art may add or replace other components as desired.
Computing device 100 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), a mobile phone (e.g., smartphone), a wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 100 may also be a mobile or stationary server.
Wherein the processor 120 may perform the steps of the entity linking method shown in fig. 2. Fig. 2 shows a flowchart of an entity linking method according to an embodiment of the present application, including steps 202 to 208.
Step 202, inputting a text to be linked into a pre-trained information labeling model, and obtaining a first coding vector output by the information labeling model, wherein the first coding vector represents an entity designation of the text to be linked.
The entity linking method provided by the embodiment of the specification can be applied to the fields of government affairs, finance and economics, military affairs and the like, and the information labeling model is realized through a pre-training model, wherein the pre-training model comprises 12 stack layers, and the 12 stack layers are sequentially connected. Each stack layer further comprises: a self-attention layer, a first specification layer, a feedforward layer, and a second specification layer. And inputting a text to be linked as an input set to an embedded layer of the information labeling model to obtain a text vector, inputting the text vector to the 1 st stack layer, inputting the output vector of the 1 st stack layer to the 2 nd stack layer … …, and repeating the steps to finally obtain the output vector of the last stack layer. And taking the output vector of the last stack layer as the expression vector of each word unit, inputting the expression vector into a feedforward layer for processing to obtain the coding vector of the input set.
Further, the text to be linked is input into a pre-trained information labeling model, and a first encoding vector output by the information labeling model is obtained, where the first encoding vector represents an entity designation of the text to be linked, and the method can be specifically implemented in the following manner:
inputting a text to be linked into the information labeling model, performing word segmentation processing on the text to be linked to obtain word units of the text to be linked, and performing pre-embedding processing on the word units to obtain word vectors, sentence vectors and position vectors corresponding to the word units;
adding the word vector, the sentence vector and the position vector corresponding to the word unit to generate a vector to be linked corresponding to the word unit;
a vector coding module of the information labeling model codes the vector to be linked to obtain an intermediate coding vector;
and a named entity labeling module of the vector labeling model performs entity designation recognition on the text to be linked, and performs entity designation labeling on the intermediate coding vector according to a recognition result to generate the first coding vector.
In an embodiment of the present specification, the information labeling model is a Named Entity Recognition (NER) model, and the information labeling model is implemented by a pre-trained model (BERT model), and in practical applications, since the pre-trained model can provide some general linguistic information and some prior knowledge in general fields, a corresponding model can be constructed based on the pre-trained model and in combination with other layer structures, for example, the pre-trained model is combined with a classification layer, i.e., an information labeling model is constructed, and for convenience of description, the BERT model in the information labeling model is collectively referred to as a vector coding module.
In the application stage, inputting a query text (text to be linked) into an information labeling model, performing word segmentation processing on the text to be linked by an embedding layer in the information labeling model to obtain a word unit of the text to be linked, performing pre-embedding processing on the word unit to obtain a word vector, a sentence vector and a position vector corresponding to the word unit, then adding the word vector, the sentence vector and the position vector to generate a vector to be linked corresponding to the word unit of the text to be linked, inputting the text to be linked into a vector coding module of the information labeling model to code the vector to be linked to obtain an intermediate coding vector, and finally inputting the intermediate coding vector into a classification layer (named entity labeling module) to perform entity name labeling on the intermediate coding vector and output a first coding vector with entity name labeling information, the first encoding vector characterizes an entity designation (segment) of the text to be linked.
Specifically, the entity refers to a language segment expressing an entity in a natural text, and taking a text to be linked (query text) as "zhang san is a leader in X city, the working time of the text to be linked in X city is three years, and a superior leader title is obtained in Y year" as an example, zhang san "," X city "," a leader "," he ", and a superior leader" in the text to be linked can be taken as entity references.
In the embodiment of the present specification, the input set may adopt the following format: [ [ CLS ], text to link, [ SEP ] ].
Taking a given text to be linked as 'zhang san is hero in a Z game' as an example, the text to be linked can be used as an input set, and is input into a vector coding module of an NER model in a character string manner to obtain an output intermediate coding vector; the specific schematic diagram is shown in fig. 3, wherein the input vector (intermediate coded vector) generated by the embedding layer is formed by summing the following 3 vectors:
word unit vector-the vector to which each word unit corresponds;
sentence vector-the sentence vector to which each word unit belongs;
position vector-a vector generated by the position corresponding to each word unit.
For example, the text to be linked is "Zhang III is hero in Z game", the word segmentation processing is carried out on the text to be linked, and a word unit set [ [ CLS ] is obtained]Zhang, Ye, Z, Gao, Xie, Ying, Xiong, [ SEP ]]]Wherein, CLS is the symbol of the beginning of sentence, SEP is the symbol of the sentence division, the word unit set is embedded and input to the vector coding module, the output vector of the last stack layer of the vector coding module is input to the feedforward layer as the expression vector of each word unit to be processed, and the coding vector of the input set is [ A ]1、A2、……A10、A11]。
After the vector coding module outputs the intermediate coding vector, entity designation labeling is carried out on each word unit in the intermediate coding vector through a classification layer so as to determine a starting or ending position index of an entity designation in the text to be linked, wherein the vector coding module is equivalent to an encoder with some prior characteristics, after the input text to be linked is coded through the vector coding module, a sequence labeling task aiming at the intermediate coding vector output by the vector coding module is executed through the classification layer of the NER model, namely, a label category corresponding to each character in the intermediate coding vector is identified, and information labeling is carried out according to the category, for example, the non-target information, namely the label is o, and the target information, namely the label is pr.
And determining which entity names are in the text to be linked, the position index of each entity name, the ID of each entity name in a knowledge base and the like according to the labeling result.
In the embodiment of the description, after the input text to be linked is encoded by the vector encoding module, the classification layer of the NER model executes the sequence labeling task for the intermediate encoding vector output by the vector encoding module, which is beneficial to ensuring the accuracy of the labeling result.
And 204, screening resume information related to the text to be linked from a pre-constructed knowledge graph spectrum based on the first encoding vector and the time information contained in the text to be linked.
In specific implementation, the resume information related to the text to be linked is screened out from a pre-constructed knowledge graph spectrum based on the first encoding vector and the time information contained in the text to be linked, and the method can be specifically realized in the following manner:
preprocessing the text to be linked to obtain time information contained in the text to be linked;
performing first information screening in a pre-constructed knowledge graph based on the entity designation to obtain intermediate screening information;
and carrying out second information screening on the intermediate screening information based on the time information to obtain the resume information.
Specifically, after the text to be linked is determined, the text to be linked may be preprocessed to obtain time information included in the text to be linked, and information screening may be performed in a pre-constructed knowledge graph according to the time information and entity designation marking information included in the first encoding vector to obtain resume information associated with the time information and the entity designation information.
The preprocessing may include performing semantic recognition on the text to be linked or extracting keywords from the text to be linked to obtain time information included in the text to be linked, and a specific processing manner may be determined according to actual requirements, which is not limited herein.
In practical application, in an application scene of the entity linking method, most of related entities are actual person names, a large amount of resume information about the person names is stored in the knowledge graph, and personal basic conditions, academic conditions, work qualification conditions and the like in the resume information are all related to time, so that after entity names and time information contained in a text to be linked are determined, resume information screening is performed by combining the entity names and the time information, and a more accurate screening result can be obtained.
And 208, inputting the resume information into a pre-trained vector coding model to obtain a second coding vector output by the vector coding model, wherein the second coding vector represents a candidate entity of the resume information.
Specifically, the vector coding model, i.e., the text matching model, may be implemented by a pre-training model (BERT model), and for convenience of description, the BERT model in the information labeling model is collectively referred to as a vector coding module in the embodiments of the present specification.
In specific implementation, the resume information is input into a pre-trained vector coding model to obtain a second coding vector output by the vector coding model, which can be specifically realized by the following method:
inputting the resume information into the vector coding model, performing word segmentation processing on the resume information to obtain word units of the resume information, and performing pre-embedding processing on the word units to obtain word vectors, sentence vectors and position vectors corresponding to the word units;
adding the word vector, the sentence vector and the position vector corresponding to the word unit to generate the intermediate vector corresponding to the word unit;
and inputting the intermediate vector into a vector coding module of the information labeling model for coding to obtain the second coding vector.
Specifically, the specific implementation manner of the vector coding model for performing word segmentation processing, pre-embedding processing and coding process on the resume information is similar to the specific implementation manner of the information labeling model for processing the text to be linked, and is not repeated here.
Step 210, performing entity linking on the entity designation and the candidate entity when it is determined that the matching degree between the first encoding vector and the second encoding vector is greater than a preset matching degree threshold.
Specifically, after obtaining the resume information related to the text to be linked, the resume information may be input to a pre-trained vector coding model, and a second coding vector output by the vector coding model is obtained, where the second coding vector represents a candidate entity of the resume information.
Considering the situation that entities contained in a pre-constructed knowledge graph may have duplicate names, resume information screening is performed according to entity names and time information, and a plurality of resume information corresponding to the same entity name may exist in an obtained screening result, so that after a first encoding vector of a text to be linked and a second encoding vector corresponding to the resume information are obtained, the matching degree between the first encoding vector and the second encoding vector needs to be calculated, the entity contained in the resume information with the highest matching degree is taken as a candidate entity, and the candidate entity and the entity name in the text to be linked are subjected to entity linking.
Further, after obtaining the second encoding vector output by the vector coding model, the entity refers to a process of performing entity linking with the candidate entity, which can be specifically realized by the following means:
splicing the first encoding vector with the second encoding vector;
determining the matching degree between the query text and the resume description information according to the spliced coding vectors;
and under the condition that the matching degree between the first encoding vector and the second encoding vector is determined to be larger than a preset matching degree threshold value, the entity name and the candidate entity are subjected to entity linkage.
In addition, the matching degree between the first encoding vector and the second encoding vector can be calculated by a text matching module of a vector encoding model, and the method can be specifically realized by the following steps:
splicing the first encoding vector with the second encoding vector;
and inputting the code vectors obtained by splicing into a text matching module of the vector coding model, and calculating the matching degree of the code vectors by the text matching module and outputting a calculation result.
Alternatively, it can also be realized by:
determining position vectors corresponding to the entity designations in the text to be linked;
fusing the position vector and the first encoding vector to obtain a fused encoding vector;
and under the condition that the matching degree between the fused encoding vector and the second encoding vector is determined to be larger than a preset matching degree threshold value, performing entity link on the entity designation and the candidate entity.
Specifically, after a first encoding vector output by the information labeling model is obtained, since the first encoding vector is used to represent the entity name of the text to be linked, entities that need to be linked and position vectors corresponding to the entities in the text to be linked can be determined, for example: the text to be linked is 'Libai is a song of Lixx', wherein the 'Libai' is an entity name, and a position vector corresponding to the 'Libai' in the text to be linked is [ 011000000000 ], after the position vector corresponding to each entity name in the text to be linked is determined, the position vector is fused with a first encoding vector used for representing the entity name, the encoding vector obtained by fusion is spliced with a second encoding vector, the splicing result is input into the text matching model, and a classification layer of the text matching model executes a classification task aiming at the encoding vector obtained by fusion and the second encoding vector so as to determine the matching degree between the encoding vector obtained by fusion and the second encoding vector.
In practical application, the link result can be displayed through an output interface, or the link result can be judged in a manual intervention mode under the condition that the matching degree between the fused coding vector and the second coding vector is determined to be less than or equal to the preset matching degree threshold value, namely the resume information which is similar to or matched with the information of the entity designation is judged, the resume information corresponding to the judgment result can be updated, the entity contained in the updated resume information is used as a candidate entity, and the candidate entity is in entity link with the entity designation in the text to be linked, and displaying the link result on the output interface, or displaying a plurality of candidate entities with the matching degree lower than a preset threshold value on the output interface.
In practical applications, the lengths of the encoded vector obtained by fusing and the second encoded vector may be different, so that the encoded vector obtained by fusing and the second encoded vector are selected to be spliced in a front-back manner, but the specific front-back order is not particularly limited.
In addition, the model construction is completed by adopting a mode of freezing a BERT pre-training model shallow network layer and reserving a part of deep network and connecting a next NER and a text matching task, so that the model training time is favorably shortened, the training efficiency is improved, and the accuracy of the algorithm model can be improved to a certain extent.
In addition, when it is determined that the matching degree between the first encoding vector and the second encoding vector is lower than the preset threshold but higher than the standard value (for example, between 0.5 and 0.9 or between 0.5 and 0.8), the entity designation in the query text needs to be determined in a manual intervention manner, that is, which resume information is similar to or matched with the information of the entity designation is determined, the resume information corresponding to the determination result can be updated, the entity included in the updated resume information is used as a candidate entity, the candidate entity is entity-linked with the entity designation in the text to be linked, and an entity linking result is output.
In addition, the entity linking the entity designation with the candidate entity may be specifically implemented by:
entity screening is carried out on the candidate entities to obtain target candidate entities;
determining a corresponding target relation and attribute information of the target candidate entity according to the target candidate entity;
and performing entity link on the entity designation and the candidate entity based on the target relationship and the attribute information.
In particular, in practical applications, the same word may have different meanings in different contexts, entity screening is therefore required, and processes of entity screening may include, but are not limited to, entity disambiguation, entity normalization, and reference resolution, the purpose of entity disambiguation being to correspond the same word to different entities according to different contexts, such as for prunus, occurring in a context of a context about a song, it may be determined as the name of the song, in the context of the poetry-related context, it may be determined as the poetry, and likewise, in practical applications, it may also happen that two words correspond to the same entity, such as "beijing" and "capital of the country", although the two words are literally two different entities, but actually refers to the same Entity, and Entity normalization (Entity Resolution) operation needs to be performed on multiple candidate entities.
The reference Resolution (Co-reference Resolution) is also an important step in knowledge fusion, in the target audit information, there are usually many pronouns such as "he", "it", "themselves", etc., and knowledge fusion also needs to determine the entity corresponding to each pronoun, such as for a certain leader in the city of "zhang san is X" of a sentence, the working time of the leader in the city of X is three years, and the superior leader reference number is obtained in Y years. After the "other" is subjected to the reference resolution, the specific reference is determined to be Zhang III.
And determining a target candidate entity through operations such as entity disambiguation, entity normalization, reference resolution and the like of the candidate entity, determining a target relation corresponding to the target candidate entity and attribute information of the target candidate entity based on the target candidate entity, giving the target relation and the attribute information, performing entity link on the entity designation and the candidate entity, and outputting a link result.
In addition, a schematic diagram of an implementation process of the entity linking method provided by the embodiment of the present specification is shown in fig. 4: the Named Entity Recognition (NER) model in fig. 4 is used for entity recognition of query text, the text matching model is used for calculating the matching degree of two vectors, and both the named entity recognition model and the text matching model can be realized by a pre-training model (BERT model).
Inputting the labeled data into a Named Entity Recognition (NER) model and a text matching model for model training, and constructing a model frame by the two models in a joint training mode, wherein the NER model is a sequence labeling task, the text matching model is a text matching task, and joint training and optimization are performed by adopting multi-task loss.
In the application stage, a query text (text to be linked) is input into an information labeling model, a first coding vector output by the information labeling model is obtained, the first coding vector represents an entity designation (contribution) of the text to be linked, wherein a vector coding module in the information labeling model for inputting the query text outputs an intermediate coding vector corresponding to the query text, the intermediate coding vector is input into a classification layer, so that entity designation labeling is carried out on the intermediate coding vector, and the first coding vector with entity designation labeling information is output.
In addition, after the text to be linked is determined, the text to be linked can be preprocessed to obtain time information contained in the text to be linked, information screening is carried out in a pre-constructed knowledge graph according to the time information and entity designation marking information contained in the first coding vector, and resume information related to the time information and the entity designation information is obtained.
After obtaining the resume information related to the text to be linked, the resume information can be input into a vector coding model trained in advance, and a second coding vector output by the vector coding model is obtained, wherein the second coding vector represents a candidate entity of the resume information.
Considering the situation that entities contained in a pre-constructed knowledge graph may have duplicate names, in a screening result obtained by screening resume information according to an entity name and time information, a plurality of resume information corresponding to the same entity name may be contained, so after a first encoding vector of a text to be linked and a second encoding vector corresponding to the resume information are obtained, a matching degree between the first encoding vector and the second encoding vector needs to be calculated, an entity contained in the resume information with the highest matching degree is taken as a candidate entity, and the candidate entity and the entity name in the text to be linked are subjected to entity linking.
The embodiment of the specification obtains the first coding vector output by the information marking model by inputting the text to be linked into the pre-trained information marking model, the first encoding vector characterizes an entity designation of the text to be linked, based on the first encoding vector and time information contained in the text to be linked, screening information in a pre-constructed knowledge graph to obtain resume information related to the text to be linked, inputting the resume information into a pre-trained vector coding model to obtain a second coding vector output by the vector coding model, wherein the second coding vector represents a candidate entity of the resume information, and under the condition that the matching degree between the first encoding vector and the second encoding vector is determined to be larger than a preset matching degree threshold value, the entity name and the candidate entity are subjected to entity linkage.
The method and the device realize simultaneous completion of entity identification and entity link tasks in a joint training mode, and screen resume information in a knowledge graph by utilizing time information and entity marking information by preprocessing information in the text to be linked to extract time information, so that accuracy of a screening result is improved.
Fig. 5 is a flowchart illustrating a processing procedure of an entity linking method according to an embodiment of the present application, including steps 502 to 522.
Step 502, inputting a text to be linked into an information labeling model.
Step 504, obtaining the information labeling model and outputting a first encoding vector, wherein the first encoding vector represents the entity name of the text to be linked.
Step 506, preprocessing the text to be linked to obtain the time information contained in the text to be linked.
And step 508, performing first information screening in a pre-constructed knowledge graph based on the entity designation to obtain intermediate screening information.
And 510, performing second information screening on the intermediate screening information based on the time information to obtain resume information.
Step 512, inputting the resume information into the vector coding model to obtain a second coding vector.
Specifically, the second encoding vector characterizes a candidate entity of the resume information.
Step 514, determining the position vector corresponding to each entity name in the text to be linked.
Step 516, fusing the position vector and the first encoding vector to obtain a fused encoding vector.
Step 518, vector splicing is performed on the fused encoded vector and the second encoded vector.
And step 520, determining the matching degree between the query text and the resume description information according to the spliced coding vectors.
Step 522, under the condition that the matching degree is determined to be larger than the preset matching degree threshold value, entity linking is carried out on the entity designation and the candidate entity, and a linking result is output.
The embodiment of the specification realizes that the entity recognition and the entity link task are simultaneously completed by adopting a joint training mode, is favorable for improving the link efficiency and ensuring the accuracy of the link result; the information in the text to be linked is preprocessed to extract time information, and the resume information in the knowledge graph is screened by utilizing the time information and the entity labeling information, so that the accuracy of the screening result is high.
Corresponding to the above method embodiment, the present application further provides an embodiment of an entity linking device, and fig. 6 shows a schematic structural diagram of the entity linking device according to an embodiment of the present application. As shown in fig. 6, the apparatus 600 includes:
a first processing module 602, configured to input a text to be linked into a pre-trained information labeling model, and obtain a first encoding vector output by the information labeling model, where the first encoding vector represents an entity name of the text to be linked;
a screening module 604, configured to screen out resume information related to the text to be linked from a pre-constructed knowledge graph spectrum based on the first encoding vector and time information included in the text to be linked;
a second processing module 606 configured to input the resume information into a pre-trained vector coding model, and obtain a second coding vector output by the vector coding model, where the second coding vector represents a candidate entity of the resume information;
a linking module 608 configured to physically link the entity designation with the candidate entity if it is determined that the degree of match between the first encoded vector and the second encoded vector is greater than a preset degree of match threshold.
Optionally, the first processing module 602 includes:
the text processing submodule is configured to input a text to be linked into the information labeling model, perform word segmentation processing on the text to be linked to obtain a word unit of the text to be linked, and perform pre-embedding processing on the word unit to obtain a word vector, a sentence vector and a position vector corresponding to the word unit;
the vector to be linked generation submodule is configured to sum the word vector, the sentence vector and the position vector corresponding to the word unit to generate a vector to be linked corresponding to the word unit;
the coding submodule is configured to code the vector to be linked to obtain an intermediate coding vector;
and the marking submodule is configured to perform entity designation identification on the text to be linked, perform entity designation marking on the intermediate coding vector according to an identification result, and generate the first coding vector.
Optionally, the second processing module 606 includes:
the resume information processing submodule is configured to input the resume information into the vector coding model, perform word segmentation processing on the resume information to obtain word units of the resume information, and perform pre-embedding processing on the word units to obtain word vectors, sentence vectors and position vectors corresponding to the word units;
the intermediate vector generation submodule is configured to sum the word vector, the sentence vector and the position vector corresponding to the word unit to generate the intermediate vector corresponding to the word unit;
a second encoding vector generation submodule configured to input the intermediate vector into the vector encoding module of the information labeling model for encoding to obtain the second encoding vector
Optionally, the screening module 604 includes:
the preprocessing submodule is configured to preprocess the text to be linked to obtain time information contained in the text to be linked;
a first screening submodule configured to perform first information screening in a pre-constructed knowledge graph based on the entity designation to obtain intermediate screening information;
and the second screening submodule is configured to perform second information screening on the intermediate screening information based on the time information to obtain the resume information.
Optionally, the entity linking apparatus further includes:
a splicing sub-module configured to splice the first encoded vector with the second encoded vector;
the determining submodule is configured to determine the matching degree between the query text and the resume description information according to the encoding vectors obtained by splicing;
and under the condition that the matching degree between the first encoding vector and the second encoding vector is determined to be larger than a preset matching degree threshold value, operating the linking module.
Optionally, the entity linking apparatus further includes:
a stitching module configured to stitch the first encoded vector with the second encoded vector;
and the matching degree calculation module is configured to input the encoding vectors obtained by splicing into the text matching module of the vector encoding model, and the text matching module performs matching degree calculation on the encoding vectors and outputs a calculation result.
Optionally, the link module 608 includes:
a target candidate entity determining submodule configured to perform entity screening on the candidate entities to obtain target candidate entities;
an information determination sub-module configured to determine a corresponding target relationship and attribute information of the target candidate entity according to the target candidate entity;
and the entity linking sub-module is configured to perform entity linking on the entity designation and the candidate entity based on the target relationship and the attribute information.
Optionally, the entity linking apparatus further includes:
the position vector determining module is configured to determine position vectors corresponding to the entity designations in the text to be linked;
a fusion module configured to fuse the position vector with the first encoding vector to obtain a fused encoding vector;
an entity linking module configured to perform entity linking on the entity designation and the candidate entity if it is determined that the degree of matching between the fused encoded vector and the second encoded vector is greater than a preset degree of matching threshold.
It should be noted that the components in the device claims should be understood as functional blocks which are necessary to implement the steps of the program flow or the steps of the method, and each functional block is not actually defined by functional division or separation. The device claims defined by such a set of functional modules are to be understood as a functional module framework for implementing the solution mainly by means of a computer program as described in the specification, and not as a physical device for implementing the solution mainly by means of hardware.
There is also provided in an embodiment of the present application a computing device comprising a memory, a processor, and computer instructions stored on the memory and executable on the processor, the processor implementing the steps of the entity linking method when executing the instructions.
An embodiment of the present application further provides a computer readable storage medium storing computer instructions, which when executed by a processor, implement the steps of the entity linking method as described above.
The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium belongs to the same concept as the technical solution of the entity linking method described above, and for details that are not described in detail in the technical solution of the storage medium, reference may be made to the description of the technical solution of the entity linking method described above.
The foregoing description of specific embodiments of the present application has been presented. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The preferred embodiments of the present application disclosed above are intended only to aid in the explanation of the application. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the application and its practical applications, to thereby enable others skilled in the art to best understand and utilize the application. The application is limited only by the claims and their full scope and equivalents.

Claims (11)

1. An entity linking method, comprising:
inputting a text to be linked into a pre-trained information labeling model, and obtaining a first coding vector output by the information labeling model, wherein the first coding vector represents an entity designation of the text to be linked;
screening resume information related to the text to be linked from a pre-constructed knowledge graph spectrum based on the first encoding vector and time information contained in the text to be linked;
inputting the resume information into a vector coding model trained in advance to obtain a second coding vector output by the vector coding model, wherein the second coding vector represents a candidate entity of the resume information;
and under the condition that the matching degree between the first encoding vector and the second encoding vector is determined to be larger than a preset matching degree threshold value, the entity name and the candidate entity are subjected to entity linkage.
2. The entity linking method of claim 1, wherein the inputting the text to be linked into a pre-trained information labeling model to obtain the first encoding vector output by the information labeling model comprises:
inputting a text to be linked into the information labeling model, performing word segmentation processing on the text to be linked to obtain word units of the text to be linked, and performing pre-embedding processing on the word units to obtain word vectors, sentence vectors and position vectors corresponding to the word units;
adding the word vector, the sentence vector and the position vector corresponding to the word unit to generate a vector to be linked corresponding to the word unit;
a vector coding module of the information labeling model codes the vector to be linked to obtain an intermediate coding vector;
and a named entity labeling module of the vector labeling model performs entity designation recognition on the text to be linked, and performs entity designation labeling on the intermediate coding vector according to a recognition result to generate the first coding vector.
3. The entity linking method of claim 1, wherein the inputting the resume information into a pre-trained vector coding model to obtain a second coding vector output by the vector coding model comprises:
inputting the resume information into the vector coding model, performing word segmentation processing on the resume information to obtain word units of the resume information, and performing pre-embedding processing on the word units to obtain word vectors, sentence vectors and position vectors corresponding to the word units;
adding the word vector, the sentence vector and the position vector corresponding to the word unit to generate the intermediate vector corresponding to the word unit;
and inputting the intermediate vector into a vector coding module of the information labeling model for coding to obtain the second coding vector.
4. The entity linking method according to claim 1, wherein the screening out resume information related to the text to be linked from a pre-constructed knowledge graph based on the first encoding vector and time information contained in the text to be linked comprises:
preprocessing the text to be linked to obtain time information contained in the text to be linked;
performing first information screening in a pre-constructed knowledge graph based on the entity designation to obtain intermediate screening information;
and carrying out second information screening on the intermediate screening information based on the time information to obtain the resume information.
5. The entity linking method according to claim 4, wherein after obtaining the second encoded vector output by the vector coding model, further comprising:
splicing the first encoding vector with the second encoding vector;
determining the matching degree between the query text and the resume description information according to the spliced coding vectors;
and performing entity linking on the entity designation and the candidate entity if it is determined that the matching degree between the first encoding vector and the second encoding vector is greater than a preset matching degree threshold.
6. The entity linking method according to claim 3, wherein after obtaining the second encoded vector output by the vector coding model, further comprising:
splicing the first encoding vector with the second encoding vector;
and inputting the code vectors obtained by splicing into a text matching module of the vector coding model, and calculating the matching degree of the code vectors by the text matching module and outputting a calculation result.
7. The entity linking method of claim 1, wherein said entity assigning said entity to be entity linked with said candidate entity comprises:
screening the candidate entities to obtain target candidate entities;
determining a corresponding target relation and attribute information of the target candidate entity according to the target candidate entity;
and performing entity link on the entity designation and the candidate entity based on the target relationship and the attribute information.
8. The entity linking method according to claim 1, further comprising:
determining position vectors corresponding to the entity designations in the text to be linked;
fusing the position vector and the first encoding vector to obtain a fused encoding vector;
and under the condition that the matching degree between the fused encoding vector and the second encoding vector is determined to be larger than a preset matching degree threshold value, performing entity link on the entity designation and the candidate entity.
9. An entity linking apparatus, comprising:
the first processing module is configured to input a text to be linked into a pre-trained information labeling model, and obtain a first encoding vector output by the information labeling model, wherein the first encoding vector represents an entity designation of the text to be linked;
the screening module is configured to screen resume information related to the text to be linked from a pre-constructed knowledge graph spectrum on the basis of the first encoding vector and time information contained in the text to be linked;
the second processing module is configured to input the resume information into a pre-trained vector coding model, and obtain a second coding vector output by the vector coding model, wherein the second coding vector represents a candidate entity of the resume information;
a linking module configured to physically link the entity designation with the candidate entity if it is determined that the degree of match between the first encoded vector and the second encoded vector is greater than a preset degree of match threshold.
10. A computing device comprising a memory, a processor, and computer instructions stored on the memory and executable on the processor, wherein the processor implements the steps of the method of any one of claims 1-8 when executing the instructions.
11. A computer-readable storage medium storing computer instructions, which when executed by a processor, perform the steps of the method of any one of claims 1 to 8.
CN202010974083.XA 2020-09-16 2020-09-16 Entity linking method and device Pending CN114266245A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010974083.XA CN114266245A (en) 2020-09-16 2020-09-16 Entity linking method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010974083.XA CN114266245A (en) 2020-09-16 2020-09-16 Entity linking method and device

Publications (1)

Publication Number Publication Date
CN114266245A true CN114266245A (en) 2022-04-01

Family

ID=80824288

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010974083.XA Pending CN114266245A (en) 2020-09-16 2020-09-16 Entity linking method and device

Country Status (1)

Country Link
CN (1) CN114266245A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116360752A (en) * 2023-06-02 2023-06-30 钱塘科技创新中心 Function programming method oriented to java, intelligent terminal and storage medium
CN116562303A (en) * 2023-07-04 2023-08-08 之江实验室 Reference resolution method and device for reference external knowledge

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140372102A1 (en) * 2013-06-18 2014-12-18 Xerox Corporation Combining temporal processing and textual entailment to detect temporally anchored events
US20190205301A1 (en) * 2016-10-10 2019-07-04 Microsoft Technology Licensing, Llc Combo of Language Understanding and Infomation Retrieval
CN110991187A (en) * 2019-12-05 2020-04-10 北京奇艺世纪科技有限公司 Entity linking method, device, electronic equipment and medium
CN111415748A (en) * 2020-02-18 2020-07-14 云知声智能科技股份有限公司 Entity linking method and device
CN111666427A (en) * 2020-06-12 2020-09-15 长沙理工大学 Entity relationship joint extraction method, device, equipment and medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140372102A1 (en) * 2013-06-18 2014-12-18 Xerox Corporation Combining temporal processing and textual entailment to detect temporally anchored events
US20190205301A1 (en) * 2016-10-10 2019-07-04 Microsoft Technology Licensing, Llc Combo of Language Understanding and Infomation Retrieval
CN110991187A (en) * 2019-12-05 2020-04-10 北京奇艺世纪科技有限公司 Entity linking method, device, electronic equipment and medium
CN111415748A (en) * 2020-02-18 2020-07-14 云知声智能科技股份有限公司 Entity linking method and device
CN111666427A (en) * 2020-06-12 2020-09-15 长沙理工大学 Entity relationship joint extraction method, device, equipment and medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116360752A (en) * 2023-06-02 2023-06-30 钱塘科技创新中心 Function programming method oriented to java, intelligent terminal and storage medium
CN116360752B (en) * 2023-06-02 2023-08-22 钱塘科技创新中心 Function programming method oriented to java, intelligent terminal and storage medium
CN116562303A (en) * 2023-07-04 2023-08-08 之江实验室 Reference resolution method and device for reference external knowledge
CN116562303B (en) * 2023-07-04 2023-11-21 之江实验室 Reference resolution method and device for reference external knowledge

Similar Documents

Publication Publication Date Title
US11922121B2 (en) Method and apparatus for information extraction, electronic device, and storage medium
CN111090987B (en) Method and apparatus for outputting information
CN111444340A (en) Text classification and recommendation method, device, equipment and storage medium
CN111368514B (en) Model training and ancient poem generating method, ancient poem generating device, equipment and medium
CN113127624B (en) Question-answer model training method and device
WO2023151314A1 (en) Protein conformation-aware representation learning method based on pre-trained language model
CN110852106A (en) Named entity processing method and device based on artificial intelligence and electronic equipment
JP7417679B2 (en) Information extraction methods, devices, electronic devices and storage media
CN114564593A (en) Completion method and device of multi-mode knowledge graph and electronic equipment
CN110807197A (en) Training method and device for recognition model and risk website recognition method and device
CN114266245A (en) Entity linking method and device
CN111767394A (en) Abstract extraction method and device based on artificial intelligence expert system
CN114495129A (en) Character detection model pre-training method and device
CN110795934B (en) Sentence analysis model training method and device and sentence analysis method and device
CN115269913A (en) Video retrieval method based on attention fragment prompt
CN114936565A (en) Method and device for extracting subject information
CN116975350A (en) Image-text retrieval method, device, equipment and storage medium
CN114138969A (en) Text processing method and device
CN114282555A (en) Translation model training method and device, and translation method and device
CN114120342A (en) Resume document identification method and device, computing device and storage medium
CN114077655A (en) Method and device for training answer extraction model
CN112328777B (en) Answer detection method and device
CN115757723A (en) Text processing method and device
CN115481246A (en) Text detection model training method and device
CN114647717A (en) Intelligent question and answer method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination