CN118114677A - Automatic labeling optimization method and system for entity identification based on dense retrieval - Google Patents

Automatic labeling optimization method and system for entity identification based on dense retrieval Download PDF

Info

Publication number
CN118114677A
CN118114677A CN202410537686.1A CN202410537686A CN118114677A CN 118114677 A CN118114677 A CN 118114677A CN 202410537686 A CN202410537686 A CN 202410537686A CN 118114677 A CN118114677 A CN 118114677A
Authority
CN
China
Prior art keywords
entity
dense
candidate
sample set
automatic labeling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410537686.1A
Other languages
Chinese (zh)
Inventor
李海
余金波
奉秋林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Sirui Information Technology Co ltd
Original Assignee
Hangzhou Sirui Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Sirui Information Technology Co ltd filed Critical Hangzhou Sirui Information Technology Co ltd
Priority to CN202410537686.1A priority Critical patent/CN118114677A/en
Publication of CN118114677A publication Critical patent/CN118114677A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a method and a system for automatically labeling and optimizing entity identification based on dense retrieval, wherein the method comprises the following steps: based on the entity knowledge base, automatically labeling the data of the unlabeled corpus to obtain an automatic labeling sample set; generating a training sample set for training the entity dense coding model based on the sample relation in the automatic labeling sample set; obtaining entity dense vectors of entities in the automatic labeling sample set and candidate entity dense vectors of candidate entities in the unlabeled corpus through the trained entity dense coding model; and searching out entity dense vectors similar to the entity dense vectors of the candidate entities, and labeling and optimizing the labeled candidate entities in the automatic labeling sample set based on the searching result. The application realizes the creation of vector space through the entity dense coding model, thereby optimizing the quality of the automatic labeling sample and solving the problem of how to improve the automatic labeling quality of the data set.

Description

Automatic labeling optimization method and system for entity identification based on dense retrieval
Technical Field
The application relates to the technical field of natural language processing, in particular to an automatic labeling optimization method and system for entity identification based on dense retrieval.
Background
Named entity Recognition (NAMED ENTITY Recognition) aims at recognizing semantic entities with specific meaning from texts, and is the basis of Knowledge Graph (knowledgegraph) construction and numerous applications in the natural language related field. The traditional named entity recognition method is based on supervised learning, the existing method based on supervised learning relies on manual annotation data sets, and constructing the manual annotation data sets for entities in each field requires a great deal of time and resources.
Remote supervision (Distant Supervision) is a reliable way to avoid manual labeling. The remote supervision provides supervision by utilizing the existing facts in the knowledge base, and automatically generates the marked data set based on entity name matching of the knowledge base and the corpus, but the remote supervision method causes data to be falsely marked due to lack of constraint in the process of generating the automatic marked data set, so that the marked data set has a large amount of noise. The data containing noise is used directly to train the performance of the model to be compromised.
At present, no effective solution is proposed for the problem of how to improve the automatic labeling quality of the data set in the related technology.
Disclosure of Invention
The embodiment of the application provides an entity identification automatic labeling optimization method and system based on dense retrieval, which at least solve the problem of how to improve the automatic labeling quality of a data set in the related technology.
In a first aspect, an embodiment of the present application provides a method for optimizing automatic labeling for entity identification based on dense retrieval, where the method includes:
Based on the entity knowledge base, automatically labeling the data of the unlabeled corpus to obtain an automatic labeling sample set;
generating a training sample set for training the entity dense coding model based on the sample relation in the automatic labeling sample set;
Obtaining entity dense vectors of entities in the automatic labeling sample set and candidate entity dense vectors of candidate entities in the unlabeled corpus through the trained entity dense coding model;
And searching out entity dense vectors similar to the entity dense vectors of the candidate entities, and performing labeling optimization on the labeled candidate entities in the automatic labeling sample set based on the searching result.
In some of these embodiments, generating a training sample set for entity-dense coding model training based on sample relationships in the automatically labeled sample set comprises:
Clustering the entity e in the automatic labeling sample set (e, u, s) to obtain an automatic labeling sample subset {(u1,s1),(u1,s2),…, (ui-1,sk), (ui,sk)},, wherein e represents an entity, u represents a candidate entity corresponding to the entity e, s represents a sentence in which the candidate entity u exists, (u i,sk) represents that the candidate entity u i exists in an s k sentence;
and generating a training sample set for training the entity dense coding model based on the automatic labeling sample subset, wherein the training sample set comprises candidate entity positive example samples and candidate entity negative example samples.
In some of these embodiments, generating candidate entity positive examples samples in a training sample set for entity dense coding model training based on the automatically labeled sample subset comprises:
calculating the similarity between candidate entities in the automatic labeling sample subset;
And matching based on the similarity, and generating candidate entity positive example samples (x i,yi +) in a training sample set for training an entity dense coding model, wherein x i represents an ith candidate entity and sentences to which the candidate entity belongs, and y i + represents candidate entities similar to x i and sentences to which the candidate entity belongs.
In some of these embodiments, generating candidate entity counterexample samples in a training sample set for entity dense coding model training based on the automatically labeled sample subset comprises:
screening out the entities which do not participate in the automatic labeling from the entity knowledge base;
And combining the entity with elements in the automatic labeling sample subset to generate candidate entity counterexample samples in a training sample set for training an entity density coding model.
In some embodiments, searching for entity dense vectors similar to each candidate entity dense vector, and labeling the labeled candidate entities in the automatically labeled sample set based on the search results includes:
searching out entity dense vectors similar to the candidate entity dense vectors through a dense vector searching tool;
Filtering the entity dense vector with the similarity with the candidate entity dense vector lower than a preset threshold value to obtain a filtered entity dense vector;
Based on the filtered entity dense vector, determining the category of the candidate entity corresponding to the candidate entity dense vector through a k-nearest neighbor algorithm;
And labeling and optimizing the labeled candidate entities in the automatic labeling sample set based on the category of the candidate entity.
In some embodiments, before obtaining, from the trained entity-dense coding model, entity-dense vectors for entities in the automatically labeled sample set and candidate entity-dense vectors for candidate entities in the unlabeled corpus, the method includes:
constructing an entity density coding model based on a Transformer neural network structure;
And training the entity density coding model based on the training sample set to obtain a trained entity density coding model.
In some of these embodiments, constructing the entity dense coding model based on the Transformer neural network structure includes:
Constructing an entity encoder in an entity dense coding model based on a Transformer neural network structure, wherein the entity encoder comprises an input layer, a coding layer and an output layer, and the entity encoder is used for coding the entities in the automatic labeling sample set to obtain entity dense vectors;
And constructing a candidate entity encoder in an entity dense coding model based on a Transformer neural network structure, wherein the candidate entity encoder comprises an input layer, a coding layer and an output layer, and the candidate entity encoder is used for encoding the candidate entity in the unlabeled corpus to obtain a candidate entity dense vector.
In some embodiments, training the entity-dense coding model based on the training sample set, the obtaining a trained entity-dense coding model includes:
Constructing a loss function of the entity density coding model based on the training sample set:
wherein the training sample set comprises candidate entity positive examples and candidate entity negative examples, i represents the ith training sample of the training sample set, (x i,yi +) represents the ith candidate entity positive examples, Representing elements in the i candidate entity counterexample sample, sim () representing a similarity calculation function;
And training and updating parameters of the entity density coding model through the loss function, so that the entity density vector output by the entity density coding model is more similar to the related candidate entity density vector and is less similar to the unrelated candidate entity density vector, and a trained entity density coding model is obtained.
In some embodiments, automatically labeling the data of the unlabeled corpus based on the entity knowledge base, and obtaining the automatically labeled sample set includes:
extracting sentence sets from the unlabeled corpus;
and automatically labeling the sentence set through character string matching based on the entity knowledge base, and generating an automatic labeling sample set of the candidate entity.
In a second aspect, the embodiment of the application provides an entity identification automatic labeling optimization system based on dense retrieval, which comprises an automatic labeling module, a sample construction module, a vector generation module and a labeling optimization module;
the automatic labeling module is used for automatically labeling the data of the unlabeled corpus according to the entity knowledge base to obtain an automatic labeling sample set;
the sample construction module is used for generating a training sample set for training the entity density coding model according to the sample relation in the automatic labeling sample set;
the vector generation module is used for obtaining entity dense vectors of the entities in the automatic labeling sample set and candidate entity dense vectors of the candidate entities in the unlabeled corpus through the trained entity dense coding model;
And the labeling optimization module is used for searching out entity dense vectors similar to the entity dense vectors of each candidate entity, and labeling and optimizing the labeled candidate entities in the automatic labeling sample set based on the searching result.
Compared with the related art, the method and the system for automatically labeling and optimizing entity identification based on dense retrieval provided by the embodiment of the application have the advantages that the method obtains an automatically labeled sample set by automatically labeling the data of an unlabeled corpus based on an entity knowledge base; generating a training sample set for training the entity dense coding model based on the sample relation in the automatic labeling sample set; obtaining entity dense vectors of entities in the automatic labeling sample set and candidate entity dense vectors of candidate entities in the unlabeled corpus through the trained entity dense coding model; the entity dense vector similar to each candidate entity dense vector is searched, labeling optimization is carried out on the labeled candidate entities in the automatic labeling sample set based on the search result, the vector space is created through the entity dense coding model, the distance between the known entity and the related candidate entity in the space is smaller, the distance between the known entity and the unrelated candidate entity is larger, the quality of the automatic labeling sample is optimized, and the problem of how to improve the automatic labeling quality of the data set is solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:
FIG. 1 is a flow chart of steps of an automatic annotation optimization method for dense retrieval-based entity identification, according to an embodiment of the present application;
FIG. 2 is a schematic diagram of the structure of a dense-in-entity encoding model according to an embodiment of the application;
FIG. 3 is a schematic diagram of the structure of the encoder coding layer in the dense search model according to an embodiment of the present application;
FIG. 4 is a block diagram of an entity identification automatic annotation optimization system based on dense retrieval according to an embodiment of the application;
fig. 5 is a schematic view of an internal structure of an electronic device according to an embodiment of the present application.
Detailed Description
The present application will be described and illustrated with reference to the accompanying drawings and examples in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application. All other embodiments, which can be made by a person of ordinary skill in the art based on the embodiments provided by the present application without making any inventive effort, are intended to fall within the scope of the present application.
It is apparent that the drawings in the following description are only some examples or embodiments of the present application, and it is possible for those of ordinary skill in the art to apply the present application to other similar situations according to these drawings without inventive effort. Moreover, it should be appreciated that while such a development effort might be complex and lengthy, it would nevertheless be a routine undertaking of design, fabrication, or manufacture for those of ordinary skill having the benefit of this disclosure, and thus should not be construed as having the benefit of this disclosure.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is to be expressly and implicitly understood by those of ordinary skill in the art that the described embodiments of the application can be combined with other embodiments without conflict.
Unless defined otherwise, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this application belongs. The terms "a," "an," "the," and similar referents in the context of the application are not to be construed as limiting the quantity, but rather as singular or plural. The terms "comprising," "including," "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to only those steps or elements but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. The terms "connected," "coupled," and the like in connection with the present application are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as used herein means two or more. "and/or" describes an association relationship of an association object, meaning that there may be three relationships, e.g., "a and/or B" may mean: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. The terms "first," "second," "third," and the like, as used herein, are merely distinguishing between similar objects and not representing a particular ordering of objects.
The embodiment of the application provides an entity identification automatic labeling optimization method based on dense retrieval, and fig. 1 is a step flow chart of the entity identification automatic labeling optimization method based on dense retrieval, as shown in fig. 1, and the method comprises the following steps:
step S102, automatically labeling the data of the unlabeled corpus based on the entity knowledge base to obtain an automatically labeled sample set;
Step S102, specifically, extracting sentence sets from an unlabeled corpus; and automatically labeling the sentence subsets through character string matching based on the entity knowledge base to generate an automatic labeling sample set of the candidate entity.
Step S102 preferably labels a corresponding candidate entity in a sentence of an unlabeled corpus as one sample if the entity in the entity knowledge base is mentioned in the sentence according to a remote supervision assumption. Whether the entity E is mentioned in the sentence S can be identified by performing string matching on the entity E K in the knowledge base and the sentence S E S C in the corpus, wherein the string matching is classified into a complete matching and an inclusive matching.
Remote supervision is performed based on string perfect matching. Using X eql to represent an automatically labeled sample set with a complete match of strings, (e, u, s) e X eql to represent an element of the set, where e represents an entity, u represents a candidate entity corresponding to entity e, s represents a sentence in which candidate entity u exists, i.e., entity e is referred to by candidate entity u in sentence s. The specific construction process is as follows:
(1) Using a natural language processing tool to perform word segmentation and part-of-speech labeling processing on sentences in an unlabeled corpus, selecting continuous noun sequences and adjective modified noun sequences, and generating a fragment set C s;
(2) Generating all possible substrings according to each fragment C epsilon C s, filtering to remove substrings with the length lower than 3 to obtain a phrase set U c, forming a candidate entity set U s by all possible phrases in sentences, and respectively performing name matching on each candidate entity U epsilon U s and the entities in the entity knowledge base to obtain an automatic labeling sample set X eql with completely matched character strings.
Remote supervision is performed based on the string inclusion matches. Using X incl to denote that the string contains a matching set of automatically labeled samples, (e, u, s) e X incl denotes an element of the set, where e denotes an entity, u denotes a candidate entity corresponding to entity e, s denotes a sentence in which candidate entity u exists, i.e. the superstring of entity e is mentioned by candidate entity u in sentence s, where u+.e. The specific construction process is as follows:
(1) Using a natural language processing tool to perform word segmentation and part-of-speech labeling processing on sentences in an unlabeled corpus, selecting continuous noun sequences and adjective modified noun sequences, and generating a fragment set C s;
(2) Generating all possible substrings according to each fragment C epsilon C s, filtering to remove substrings with the length lower than 3 to obtain a phrase set U c, forming a candidate entity set U s by all possible phrases in sentences, respectively performing name matching on each candidate entity U epsilon U s and the entities in an entity knowledge base, and once matching exists, finding out superstrings of U in U c to obtain an automatic labeling sample set X incl containing matching of character strings;
Step S104, generating a training sample set for training an entity dense coding model based on sample relations in the automatic labeling sample set, wherein the training sample set comprises candidate entity positive example samples and candidate entity negative example samples;
Step S104 specifically includes the steps of:
Step S1041, clustering based on the entity e in the automatic labeling sample set (e, u, S), to obtain an automatic labeling sample subset {(u1,s1),(u1,s2),…, (ui-1,sk), (ui,sk)},, wherein e represents an entity, u represents a candidate entity corresponding to the entity e, S represents a sentence in which the candidate entity u exists, (u i,sk) represents that the candidate entity u i exists in the S k sentence;
Step S1042, based on the automatic labeling sample subset, generating candidate entity positive samples in a training sample set for training an entity dense coding model;
Step S1042 specifically, calculating the similarity between candidate entities in the automatic labeling sample subset;
And matching based on the similarity, and generating candidate entity positive example samples (x i,yi +) in a training sample set for training an entity dense coding model, wherein x i represents an ith candidate entity and sentences to which the candidate entity belongs, and y i + represents candidate entities similar to x i and sentences to which the candidate entity belongs.
It should be noted that there are two types of candidate entity positive examples: the first is that the two candidate entities in the sample are identical but have different sentences (for capturing semantic features of the candidate entities in different contexts against the control); the second sample is different for both candidate entities and sentences, but the type of entity is the same (for contrast capture of semantic features of similar entities).
Step S1043, generating candidate entity counterexample samples in a training sample set for training an entity dense coding model based on the automatic labeling sample subset;
Step S1043, specifically, screening out entities which do not participate in automatic labeling from the entity knowledge base;
And combining the entity with elements in the automatic labeling sample subset to generate candidate entity counterexample samples in a training sample set for training the entity dense coding model.
After step S104, the method further comprises step S105, wherein step S105 comprises the steps of:
step S1051, constructing an entity dense coding model based on a transducer neural network structure;
In step S1051, specifically, fig. 2 is a schematic structural diagram of an entity dense coding model according to an embodiment of the present application, and as shown in fig. 2, based on a Transformer neural network structure, an entity encoder and a candidate entity encoder in the entity dense coding model are constructed, where the entity encoder is used for encoding an entity in an automatic labeling sample set to obtain an entity dense vector, and the candidate entity encoder is used for encoding a candidate entity in an unlabeled corpus to obtain a candidate entity dense vector;
The two encoders have the same structure and comprise an input layer, an encoding layer and an output layer,
(1) Input layer
The input to the encoder is obtained by matrix addition as follows:
where WE s represents the word embedding matrix of the associated context (sentence) s of entity e, and PE (e,s) represents the position embedding matrix of entity e in s.
(2) Coding layer
Fig. 3 is a schematic diagram of the structure of an encoder coding layer in a dense search model according to an embodiment of the present application, and as shown in fig. 3, the coding layer is composed of a plurality of coding blocks, each of which is composed of a attention (attention) module, a residual and normalization (normalization) module, and a feed-forward network (feedforward network) module.
Attention module: the attention mechanism is the core of the code block, and the attention mechanism is used to capture the remote dependencies between the input words. On the basis of an attention mechanism, a multi-head attention module is adopted to obtain semantic information of a plurality of layers, the multi-head attention mechanism is connected with the information of a plurality of heads, and output is generated through simple linear conversion, as follows:
Wherein MultiHeadAttention () represents a multi-headed Attention, X represents an input matrix, concat () represents a matrix join function, and Attention () represents an Attention function; w O identifies parameter matrixes which are linearly transformed after being connected with multi-head information, W Q、WK、WV is a parameter matrix of an attention mechanism respectively, the values of the parameter matrixes can be automatically learned through training of a model, the multi-head attention in each coding block is provided with the parameter matrix, and the attention parameter matrixes among different coding blocks are not shared.
Residual and normalization module: firstly, connecting the input and the output of the attention module by adopting residual connection, and enabling the depth network to train more effectively through the residual connection so as to avoid the problems of gradient disappearance, network degradation and the like. And then, the standardization layer technology is adopted to normalize each sample in the feature dimension, so that the dependency relationship between different features is reduced, the model training is more stable and can be converged more quickly, the generalization capability of the model is improved, and the residual error and the standardization calculation are shown as the following formula:
where LayerNorm () represents a normalization layer, and x+ MultiHeadAttention (X) represents residual connection of the original result of the input matrix and the result after multi-headed attention processing.
Feed-forward network module: two full-connection layers are adopted as a feedforward network, the first layer does not use an activation function, the second layer adopts ReLu activation functions to carry out nonlinear transformation, and the calculation process of the feedforward network is as follows:
where FFN () represents the feed forward network function, W 1、W2 represents the parameter matrix of the linear transformation, and b1, b2 represent the bias matrix of the linear transformation.
Step S1052, training the entity density coding model based on the training sample set to obtain a trained entity density coding model.
In particular, in step S1052, the similarity measurement is performed on the codes generated by the entity encoder and the candidate entity encoder using the similarity measurement rule, and the similarity measurement is calculated as follows:
Where x represents a candidate entity, y + represents a known entity, E span () represents a candidate entity encoder, E enity () represents an entity encoder, and the similarity calculation formula measures the similarity between the candidate entity and the known entity in terms of a vector point multiplied value.
Constructing a loss function of the entity dense coding model based on the training sample set:
wherein the training sample set comprises candidate entity positive examples and candidate entity negative examples, i represents the ith training sample of the training sample set, (x i,yi +) represents the ith candidate entity positive examples, Representing elements in the ith candidate entity counterexample sample;
And training and updating parameters of the entity density coding model through the loss function, so that the entity density vector output by the entity density coding model is more similar to the related candidate entity density vector and is less similar to the unrelated candidate entity density vector, and a trained entity density coding model is obtained.
Step S106, obtaining entity dense vectors of entities in the automatic labeling sample set and candidate entity dense vectors of candidate entities in the unlabeled corpus through the trained entity dense coding model;
and step S108, searching out entity dense vectors similar to the entity dense vectors of the candidate entities, and labeling and optimizing the labeled candidate entities in the automatic labeling sample set based on the searching result.
Step S108 specifically includes the following steps:
Step S1081, searching out entity dense vectors similar to the candidate entity dense vectors through the dense vector search tool;
Step S1082, filtering the entity density vector with the similarity with the candidate entity density vector lower than the preset threshold value to obtain a filtered entity density vector;
Step S1083, determining the category of the candidate entity corresponding to the candidate entity dense vector through a k-nearest neighbor algorithm based on the filtered entity dense vector;
Step S1084, labeling optimization is performed on the labeled candidate entities in the automatic labeling sample set based on the types of the candidate entities.
Through the steps in the embodiment of the application, the vector space is created through the entity dense coding model, so that the distance between the known entity and the related candidate entity in the space is smaller, and the distance between the known entity and the unrelated candidate entity is larger, thereby optimizing the quality of the automatic labeling sample and solving the problem of how to improve the automatic labeling quality of the data set.
It should be noted that the steps illustrated in the above-described flow or flow diagrams of the figures may be performed in a computer system, such as a set of computer-executable instructions, and that, although a logical order is illustrated in the flow diagrams, in some cases, the steps illustrated or described may be performed in an order other than that illustrated herein.
The embodiment of the application provides an entity identification automatic labeling optimization system based on dense retrieval, and FIG. 4 is a structural block diagram of the entity identification automatic labeling optimization system based on dense retrieval, as shown in FIG. 4, and the system comprises an automatic labeling module, a sample construction module, a vector generation module and a labeling optimization module;
The automatic labeling module is used for automatically labeling the data of the unlabeled corpus according to the entity knowledge base to obtain an automatic labeling sample set;
the sample construction module is used for generating a training sample set for training the entity dense coding model according to the sample relation in the automatic labeling sample set;
the vector generation module is used for obtaining entity dense vectors of entities in the automatic labeling sample set and candidate entity dense vectors of candidate entities in the unlabeled corpus through the trained entity dense coding model;
and the labeling optimization module is used for searching out entity dense vectors similar to the entity dense vectors of the candidate entities and labeling and optimizing the labeled candidate entities in the automatic labeling sample set based on the searching result.
By the automatic labeling module, the sample construction module, the vector generation module and the labeling optimization module in the embodiment of the application, the vector space is created by the entity dense coding model, so that the distance between the known entity and the related candidate entity in the space is smaller, and the distance between the known entity and the unrelated candidate entity is larger, thereby optimizing the quality of the automatic labeling sample and solving the problem of how to improve the automatic labeling quality of the data set.
The above-described respective modules may be functional modules or program modules, and may be implemented by software or hardware. For modules implemented in hardware, the various modules described above may be located in the same processor; or the above modules may be located in different processors in any combination.
The present embodiment also provides an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.
Optionally, the electronic apparatus may further include a transmission device and an input/output device, where the transmission device is connected to the processor, and the input/output device is connected to the processor.
It should be noted that, specific examples in this embodiment may refer to examples described in the foregoing embodiments and alternative implementations, and this embodiment is not repeated herein.
In addition, in combination with the method for optimizing entity identification automatic labeling based on dense search in the above embodiment, the embodiment of the application can be realized by providing a storage medium. The storage medium has a computer program stored thereon; the computer program, when executed by a processor, implements any of the dense retrieval-based entity identification automatic annotation optimization methods of the above embodiments.
In one embodiment, a computer device is provided, which may be a terminal. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a dense retrieval based automatic labeling optimization method for entity identification. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
In one embodiment, fig. 5 is a schematic diagram of an internal structure of an electronic device according to an embodiment of the present application, and as shown in fig. 5, an electronic device, which may be a server, is provided, and an internal structure diagram thereof may be as shown in fig. 5. The electronic device includes a processor, a network interface, an internal memory, and a non-volatile memory connected by an internal bus, where the non-volatile memory stores an operating system, computer programs, and a database. The processor is used for providing computing and control capability, the network interface is used for communicating with an external terminal through network connection, the internal memory is used for providing environment for the operation of an operating system and a computer program, the computer program is executed by the processor to realize an automatic labeling optimization method based on entity identification of dense retrieval, and the database is used for storing data.
It will be appreciated by those skilled in the art that the structure shown in fig. 5 is merely a block diagram of a portion of the structure associated with the present inventive arrangements and is not limiting of the electronic device to which the present inventive arrangements are applied, and that a particular electronic device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
It should be understood by those skilled in the art that the technical features of the above-described embodiments may be combined in any manner, and for brevity, all of the possible combinations of the technical features of the above-described embodiments are not described, however, they should be considered as being within the scope of the description provided herein, as long as there is no contradiction between the combinations of the technical features.
The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims (10)

1. An automatic labeling optimization method for entity identification based on dense retrieval is characterized by comprising the following steps:
Based on the entity knowledge base, automatically labeling the data of the unlabeled corpus to obtain an automatic labeling sample set;
generating a training sample set for training the entity dense coding model based on the sample relation in the automatic labeling sample set;
Obtaining entity dense vectors of entities in the automatic labeling sample set and candidate entity dense vectors of candidate entities in the unlabeled corpus through the trained entity dense coding model;
And searching out entity dense vectors similar to the entity dense vectors of the candidate entities, and performing labeling optimization on the labeled candidate entities in the automatic labeling sample set based on the searching result.
2. The method of claim 1, wherein generating a training sample set for entity-dense coding model training based on sample relationships in the automatically labeled sample set comprises:
Clustering the entity e in the automatic labeling sample set (e, u, s) to obtain an automatic labeling sample subset {(u1,s1),(u1,s2),…, (ui-1,sk), (ui,sk)},, wherein e represents an entity, u represents a candidate entity corresponding to the entity e, s represents a sentence in which the candidate entity u exists, (u i,sk) represents that the candidate entity u i exists in an s k sentence;
and generating a training sample set for training the entity dense coding model based on the automatic labeling sample subset, wherein the training sample set comprises candidate entity positive example samples and candidate entity negative example samples.
3. The method of claim 2, wherein generating candidate entity positive examples samples in a training sample set for entity dense coding model training based on the automatically labeled sample subset comprises:
calculating the similarity between candidate entities in the automatic labeling sample subset;
And matching based on the similarity, and generating candidate entity positive example samples (x i,yi +) in a training sample set for training an entity dense coding model, wherein x i represents an ith candidate entity and sentences to which the candidate entity belongs, and y i + represents candidate entities similar to x i and sentences to which the candidate entity belongs.
4. The method of claim 2, wherein generating candidate entity counterexample samples in a training sample set for entity dense coding model training based on the automatically labeled sample subset comprises:
screening out the entities which do not participate in the automatic labeling from the entity knowledge base;
And combining the entity with elements in the automatic labeling sample subset to generate candidate entity counterexample samples in a training sample set for training an entity density coding model.
5. The method of claim 1, wherein searching for entity dense vectors that are similar to respective candidate entity dense vectors, labeling optimization of labeled candidate entities in the set of automatically labeled samples based on the results of the searching comprises:
searching out entity dense vectors similar to the candidate entity dense vectors through a dense vector searching tool;
Filtering the entity dense vector with the similarity with the candidate entity dense vector lower than a preset threshold value to obtain a filtered entity dense vector;
Based on the filtered entity dense vector, determining the category of the candidate entity corresponding to the candidate entity dense vector through a k-nearest neighbor algorithm;
And labeling and optimizing the labeled candidate entities in the automatic labeling sample set based on the category of the candidate entity.
6. The method of claim 1, wherein prior to obtaining entity dense vectors for entities in the automatically labeled sample set and candidate entity dense vectors for candidate entities in the unlabeled corpus from the trained entity dense coding model, the method comprises:
constructing an entity density coding model based on a Transformer neural network structure;
And training the entity density coding model based on the training sample set to obtain a trained entity density coding model.
7. The method of claim 6, wherein constructing a dense-entity encoding model based on a Transformer neural network structure comprises:
Constructing an entity encoder in an entity dense coding model based on a Transformer neural network structure, wherein the entity encoder comprises an input layer, a coding layer and an output layer, and the entity encoder is used for coding the entities in the automatic labeling sample set to obtain entity dense vectors;
And constructing a candidate entity encoder in an entity dense coding model based on a Transformer neural network structure, wherein the candidate entity encoder comprises an input layer, a coding layer and an output layer, and the candidate entity encoder is used for encoding the candidate entity in the unlabeled corpus to obtain a candidate entity dense vector.
8. The method of claim 6, wherein training the entity-dense coding model based on the training sample set comprises:
Constructing a loss function of the entity density coding model based on the training sample set:
Wherein the training sample set comprises candidate entity positive samples and candidate entity negative samples, i represents an ith training sample of the training sample set, (x i,yi +) represents an ith candidate entity positive sample,/> Representing elements in the i candidate entity counterexample sample, sim () representing a similarity calculation function;
And training and updating parameters of the entity density coding model through the loss function, so that the entity density vector output by the entity density coding model is more similar to the related candidate entity density vector and is less similar to the unrelated candidate entity density vector, and a trained entity density coding model is obtained.
9. The method of claim 1, wherein automatically labeling the unlabeled corpus for data based on the entity knowledge base, the obtaining an automatically labeled sample set comprises:
extracting sentence sets from the unlabeled corpus;
and automatically labeling the sentence set through character string matching based on the entity knowledge base, and generating an automatic labeling sample set of the candidate entity.
10. The entity identification automatic labeling optimization system based on dense retrieval is characterized by comprising an automatic labeling module, a sample construction module, a vector generation module and a labeling optimization module;
the automatic labeling module is used for automatically labeling the data of the unlabeled corpus according to the entity knowledge base to obtain an automatic labeling sample set;
the sample construction module is used for generating a training sample set for training the entity density coding model according to the sample relation in the automatic labeling sample set;
the vector generation module is used for obtaining entity dense vectors of the entities in the automatic labeling sample set and candidate entity dense vectors of the candidate entities in the unlabeled corpus through the trained entity dense coding model;
And the labeling optimization module is used for searching out entity dense vectors similar to the entity dense vectors of each candidate entity, and labeling and optimizing the labeled candidate entities in the automatic labeling sample set based on the searching result.
CN202410537686.1A 2024-04-30 2024-04-30 Automatic labeling optimization method and system for entity identification based on dense retrieval Pending CN118114677A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410537686.1A CN118114677A (en) 2024-04-30 2024-04-30 Automatic labeling optimization method and system for entity identification based on dense retrieval

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410537686.1A CN118114677A (en) 2024-04-30 2024-04-30 Automatic labeling optimization method and system for entity identification based on dense retrieval

Publications (1)

Publication Number Publication Date
CN118114677A true CN118114677A (en) 2024-05-31

Family

ID=91219695

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410537686.1A Pending CN118114677A (en) 2024-04-30 2024-04-30 Automatic labeling optimization method and system for entity identification based on dense retrieval

Country Status (1)

Country Link
CN (1) CN118114677A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190220749A1 (en) * 2018-01-17 2019-07-18 Beijing Baidu Netcom Science And Technology Co., Ltd. Text processing method and device based on ambiguous entity words
CN111125378A (en) * 2019-12-25 2020-05-08 同方知网(北京)技术有限公司 Closed-loop entity extraction method based on automatic sample labeling
CN111881264A (en) * 2020-09-28 2020-11-03 北京智源人工智能研究院 Method and electronic equipment for searching long text in question-answering task in open field
WO2021121198A1 (en) * 2020-09-08 2021-06-24 平安科技(深圳)有限公司 Semantic similarity-based entity relation extraction method and apparatus, device and medium
CN113987154A (en) * 2021-11-10 2022-01-28 润联软件***(深圳)有限公司 Similar sentence generation model training method based on UniLM and contrast learning and related equipment
CN117236338A (en) * 2023-08-29 2023-12-15 北京工商大学 Named entity recognition model of dense entity text and training method thereof

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190220749A1 (en) * 2018-01-17 2019-07-18 Beijing Baidu Netcom Science And Technology Co., Ltd. Text processing method and device based on ambiguous entity words
CN111125378A (en) * 2019-12-25 2020-05-08 同方知网(北京)技术有限公司 Closed-loop entity extraction method based on automatic sample labeling
WO2021121198A1 (en) * 2020-09-08 2021-06-24 平安科技(深圳)有限公司 Semantic similarity-based entity relation extraction method and apparatus, device and medium
CN111881264A (en) * 2020-09-28 2020-11-03 北京智源人工智能研究院 Method and electronic equipment for searching long text in question-answering task in open field
CN113987154A (en) * 2021-11-10 2022-01-28 润联软件***(深圳)有限公司 Similar sentence generation model training method based on UniLM and contrast learning and related equipment
CN117236338A (en) * 2023-08-29 2023-12-15 北京工商大学 Named entity recognition model of dense entity text and training method thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
钱伟中;邓蔚;傅;秦志光;: "基于联合训练的蛋白质互作用信息抽取方法", 计算机应用研究, no. 05, 15 May 2011 (2011-05-15) *

Similar Documents

Publication Publication Date Title
Lopez et al. Deep Learning applied to NLP
CN111581961B (en) Automatic description method for image content constructed by Chinese visual vocabulary
CN112084337B (en) Training method of text classification model, text classification method and equipment
WO2021042503A1 (en) Information classification extraction method, apparatus, computer device and storage medium
CN111738004A (en) Training method of named entity recognition model and named entity recognition method
CN108287911B (en) Relation extraction method based on constrained remote supervision
CN110414004B (en) Method and system for extracting core information
CN117076653B (en) Knowledge base question-answering method based on thinking chain and visual lifting context learning
CN111931935B (en) Network security knowledge extraction method and device based on One-shot learning
CN111814477B (en) Dispute focus discovery method and device based on dispute focus entity and terminal
CN115618291B (en) Web fingerprint identification method, system, equipment and storage medium based on Transformer
CN112328759A (en) Automatic question answering method, device, equipment and storage medium
CN113536795B (en) Method, system, electronic device and storage medium for entity relation extraction
CN114490954B (en) Document level generation type event extraction method based on task adjustment
CN116304307A (en) Graph-text cross-modal retrieval network training method, application method and electronic equipment
CN114491018A (en) Construction method of sensitive information detection model, and sensitive information detection method and device
CN111831624A (en) Data table creating method and device, computer equipment and storage medium
Cui et al. Enhancing multimodal entity and relation extraction with variational information bottleneck
CN114676346A (en) News event processing method and device, computer equipment and storage medium
CN116822527A (en) Machine reading understanding event detection method and device based on comparison pre-training
CN118114677A (en) Automatic labeling optimization method and system for entity identification based on dense retrieval
CN115759043A (en) Document-level sensitive information detection model training and prediction method
CN112270189B (en) Question type analysis node generation method, system and storage medium
CN115238645A (en) Asset data identification method and device, electronic equipment and computer storage medium
CN114298052A (en) Entity joint labeling relation extraction method and system based on probability graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination