CN111259134A - Entity identification method, equipment and computer readable storage medium - Google Patents

Entity identification method, equipment and computer readable storage medium Download PDF

Info

Publication number
CN111259134A
CN111259134A CN202010057489.1A CN202010057489A CN111259134A CN 111259134 A CN111259134 A CN 111259134A CN 202010057489 A CN202010057489 A CN 202010057489A CN 111259134 A CN111259134 A CN 111259134A
Authority
CN
China
Prior art keywords
entity
library
semi
training
labeled
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010057489.1A
Other languages
Chinese (zh)
Other versions
CN111259134B (en
Inventor
王东升
范红杰
林凤绿
雷欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mobvoi Information Technology Co Ltd
Original Assignee
Mobvoi Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mobvoi Information Technology Co Ltd filed Critical Mobvoi Information Technology Co Ltd
Priority to CN202010057489.1A priority Critical patent/CN111259134B/en
Publication of CN111259134A publication Critical patent/CN111259134A/en
Application granted granted Critical
Publication of CN111259134B publication Critical patent/CN111259134B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/382Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using citations

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Library & Information Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an entity identification method, equipment and a computer readable storage medium, wherein the method comprises the following steps: the method comprises the following steps of firstly, marking a specified text through an entity library, and determining a training set and a test set corresponding to the specified text; the training set comprises a labeled text set and a semi-labeled text set; secondly, training a model through the training set, predicting the test set based on an entity recognition model obtained through training, and screening to obtain an effective entity; adding the effective entity into the entity library, and re-determining a semi-labeled text set based on the entity library; and repeating the second operation and the third operation to obtain the target entity library. The method provided by the embodiment of the invention can realize the purpose of automatically expanding the number of the entities in the target entity library, and can automatically label a large amount of unlabeled texts.

Description

Entity identification method, equipment and computer readable storage medium
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a method and an apparatus for entity identification, and a computer-readable storage medium.
Background
An entity is something that is distinguishable and exists independently. Machine learning tasks such as supervised learning need a large amount of labeled entity data, and manual labeling of entities is time-consuming and labor-consuming. At present, most of entity marking work is finished manually, and although the accuracy is high, the efficiency is very low. The automatic method is to label the text by using a dictionary, but the method has poor portability, different dictionaries need to be obtained for labeling the text in different fields, and the complete dictionary is difficult to obtain.
Disclosure of Invention
The embodiment of the invention provides an entity identification method, equipment and a computer readable storage medium, which can automatically label unmarked entities.
One aspect of the present invention provides an entity identification method, including: the method comprises the following steps of firstly, marking a specified text through an entity library, and determining a training set and a test set corresponding to the specified text; the training set comprises a labeled text set and a semi-labeled text set; secondly, training a model through the training set, predicting the test set based on an entity recognition model obtained through training, and screening to obtain an effective entity; adding the effective entity into the entity library, and re-determining a semi-labeled text set based on the entity library; and repeating the second operation and the third operation to obtain the target entity library.
In an embodiment, the predicting the test set based on the entity recognition model obtained by training, and screening to obtain valid entities includes: predicting the test set through the entity recognition model obtained through training to obtain a predicted entity; filtering the predicted entity based on the entity library to obtain a filtered entity; and screening the filtering entity based on a constraint strategy to obtain an effective entity.
In one embodiment, the constraint policy includes at least one of: a first policy for length constraints, a second policy for character constraints, and a third policy for statistical constraints.
In an embodiment, said re-determining the set of semi-annotated text based on the entity library comprises: labeling the semi-labeled text set through the entity library to obtain a labeled semi-labeled text set; selecting the labeled semi-labeled text set through a strategy network in reinforcement learning to obtain a selected semi-labeled text set; and determining the selected semi-labeled text set as the semi-labeled text set of the third operation.
In one embodiment, the repeating of the second operation and the third operation to obtain the target entity library includes: after the third operation of the current round is finished, judging whether a termination condition is met; and when the judgment result shows that the termination condition is met, determining the entity library obtained in the current round as a target entity library.
In an embodiment, the method further comprises: and when the termination condition is judged not to be met, executing the next round of second operation and third operation.
In one embodiment, the termination condition is whether the loop times of the second operation and the third operation satisfy a loop threshold; or, the termination condition is whether the number of valid entities joining the entity library of the third operation satisfies a number threshold.
One aspect of the present invention provides an entity identification apparatus, including: the first operation module is used for marking the specified text through an entity library and determining a training set and a test set corresponding to the specified text; the training set comprises a labeled text set and a semi-labeled text set; the second operation module is used for training the model through the training set, predicting the test set based on the entity recognition model obtained through training, and screening to obtain an effective entity; a third operation module, configured to add the valid entity to the entity library, and re-determine a semi-labeled text set based on the entity library; and the circulating module is used for repeatedly circulating the second operation and the third operation to obtain the target entity library.
In an embodiment, the second operation module includes: the prediction submodule is used for predicting the test set through the entity recognition model obtained through training to obtain a predicted entity; a filtering submodule, configured to filter the predicted entity based on the entity library to obtain a filtered entity; and the screening submodule is used for screening the filtering entity based on the constraint strategy to obtain an effective entity.
In an embodiment, the third operating module includes: the labeling submodule is used for labeling the semi-labeled text set through the entity library to obtain a labeled semi-labeled text set; the selection submodule is used for selecting the labeled semi-labeled text set through a strategy network in reinforcement learning to obtain a selected semi-labeled text set; and the first determining submodule is used for determining the selected semi-labeled text set as the semi-labeled text set of the third operation.
In one embodiment, the circulation module includes: the judgment submodule is used for judging whether a termination condition is met or not after the third operation of the current round is finished; and the second determining submodule is used for determining the entity library obtained in the current round as the target entity library when the termination condition is judged to be met.
In an embodiment, the apparatus further comprises: and the execution sub-module is used for executing the next round of second operation and third operation when the termination condition is judged not to be met.
In another aspect, the present invention provides a computer-readable storage medium comprising a set of computer-executable instructions which, when executed, perform any of the entity identification methods described above.
The entity identification method, the entity identification equipment and the computer readable storage medium provided by the embodiment of the invention label the designated text through a small amount of entities in the entity library, test the test set through the entity identification model, screen to obtain effective entities, re-label the semi-labeled text set through adding the effective entities into the entity library, and perform cyclic operation, so that new effective entities can be generated and added into the entity library, and through repeatedly cycling the steps, the purpose of automatically expanding the entities in the target entity library can be realized, and a large amount of unlabelled texts can be automatically labeled.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:
in the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.
Fig. 1 is a schematic flow chart illustrating an implementation of an entity identification method according to an embodiment of the present invention;
FIG. 2 is a block diagram of an entity extraction framework of an entity recognition model according to an embodiment of the present invention;
FIG. 3 is a schematic diagram illustrating an implementation flow of entity prediction screening in an entity identification method according to an embodiment of the present invention;
FIG. 4 is a block diagram of an entity recognition model according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of an implementation flow of determining a semi-annotated text set by an entity identification method according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a labeling framework of an entity recognition model according to an embodiment of the present invention;
fig. 7 is a schematic diagram of an implementation module of an entity identification device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic flow chart illustrating an implementation of an entity identification method according to an embodiment of the present invention.
Referring to fig. 1, an aspect of the present invention provides an entity identification method, where the method includes: a first operation 101, labeling the specified text through an entity library, and determining a training set and a test set corresponding to the specified text; the training set comprises a labeled text set and a semi-labeled text set; a second operation 102 of training the model through the training set, predicting the test set based on the entity recognition model obtained by training, and screening to obtain an effective entity; a third operation 103, adding the effective entities into the entity library, and re-determining the semi-labeled text set based on the entity library; and operation 104, repeating the second operation and the third operation to obtain the target entity library.
The entity identification method provided by the embodiment of the invention marks the appointed text through a small amount of entities in the entity library, tests the test set through the entity identification model, screens to obtain effective entities, re-marks the semi-marked text set through adding the effective entities into the entity library, and performs cyclic operation, so that new effective entities can be generated and added into the entity library, and through repeatedly cycling the steps, the purpose of automatically expanding the number of the entities in the target entity library can be realized, and a large amount of unmarked texts can be automatically marked.
The method comprises a first operation of marking the specified text through an entity library and determining a training set and a test set corresponding to the specified text. According to the first operation provided by the method, the specified text is labeled through the entity library so as to determine a training set and a test set in the specified text. The designated text may be various corpora compiled according to a certain principle, the corpora may be various sentences, and it should be understood that the length and completeness of the sentences are not limited here. Certain rules may be language domain, language body, duration, common time, etc. The entity library includes a small number of seed entities corresponding to the designated text. A small number of seed entities can be obtained using manual labeling. And marking the specified text through the entity library, and forming a training set and a test set after marking the specified text. The training set includes entity labeled sentences. The test set is sentences that have not been annotated by the entity. The training set comprises a labeled text set and a semi-labeled text set, and sentences in the labeled text set can be completely labeled through original seed entities of the entity library. Sentences in the semi-labeled text set are jointly labeled through the original seed entity and the effective entity. Further, the specified text may also include a validation set, and the optimal model is selected by adjusting the model parameters with errors derived from the validation data set.
The method further comprises a second operation of training the model through the training set, predicting the test set based on the entity recognition model obtained through training, and screening to obtain effective entities.
Fig. 2 is a block diagram of an entity extraction framework of an entity recognition model according to an embodiment of the present invention. Referring to fig. 2, an entity recognition model can be obtained by training the model through a training set. The entity identification model may be selected as a BiLSTM-CRF model as shown in fig. 2, and the prediction entity corresponding to the test set can be obtained by predicting the test set through the BiLSTM-CRF model.
The method further comprises a third operation of adding the effective entities into the entity library and re-determining the semi-labeled text set based on the entity library. And after the effective entities are obtained, adding the effective entities into the entity library in the first operation to increase the number of the entities in the entity library, and re-labeling the semi-labeled text set through the entity library with the increased number of the entities to update the labeling of the sentences in the semi-labeled text set.
The method further comprises the step of repeatedly cycling the second operation and the third operation to obtain the target entity library. The training set can be updated because the labels of the sentences in the semi-labeled text set are updated, so that the model trained by the training set can be updated, and based on the updating, the test set can be predicted by the model to obtain the current round of effective entities different from the previous round of effective entities. And by the method, under the condition of having a small amount of seed entities in the entity library, a large amount of unlabelled texts can be automatically labeled.
Fig. 3 is a schematic diagram illustrating an implementation flow of entity prediction screening in an entity identification method according to an embodiment of the present invention. Fig. 4 is a schematic diagram of a framework of an entity recognition model of an entity recognition method according to an embodiment of the present invention.
Referring to fig. 2, fig. 3 and fig. 4, in the embodiment of the present invention, the second operation 102, predicting the test set based on the entity recognition model obtained by training, and screening to obtain valid entities, includes: operation 1021, predicting the test set through the entity recognition model obtained through training, so as to obtain a predicted entity; operation 1022, filter the predicted entity based on the entity library to obtain a filtered entity; and operation 1023, screening the filtering entities based on the constraint strategy to obtain effective entities.
The method comprises the steps of predicting a test set through an entity recognition model obtained through training to obtain a predicted entity. And (3) labeling the entity through a BilSTM-CRF model, namely searching a most possible label sequence for each sentence in the training set and the test set. When the model is used for prediction, training word vectors are used for embedding each sentence, the word vectors are input into a BilSTM layer, and hidden layer vectors h are output. Hidden layer vectors are input into a CRF layer, the conditional probability of a label sequence corresponding to each text sentence is calculated on the CRF layer, and the calculation formula is as follows:
Figure BDA0002373304690000071
s represents the current sentence, Y represents the tag sequence, YsRepresenting all possible tag sequence combinations for the current sentence s. The loss function is defined as follows:
loss(Θ,s,y)=-log p(y|s)
where Θ represents all parameters of the model. And predicting the test set through a BilSTM-CRF model to obtain a prediction entity corresponding to the test set.
The method further comprises filtering the predicted entities based on the entity library to obtain filtered entities. The predicted entity predicted by the entity recognition model may include the same entity as that in the entity library, and the predicted entity is filtered based on the entity library, so that data which is duplicated with the entity library in the predicted entity can be eliminated, and a filtering entity different from the entity library is obtained. It is to be understood that, when the current round of operation is not the first round of operation, the entity library used for the filtering operation adds the entity library of the valid entity for the previous round of result; when the current round of operation is the first round of operation, the entity library used for the filtering operation is the entity library referred to in the first operation.
The method also comprises the step of screening the filtering entity based on the constraint strategy to obtain an effective entity.
The quality of the predicted entity predicted by the entity recognition model is good or bad, and the predicted entity cannot be directly added into the entity library, so that an entity filter is required to be used for screening out high-quality effective entities. The entity filter screens by using three different constraint strategies to obtain effective entities.
In the embodiment of the present invention, the constraint policy includes at least one of: a first policy for length constraints, a second policy for character constraints, and a third policy for statistical constraints.
The purpose of the first strategy for length constraint is to limit the length of the character string of the filtering entity and to screen out entities whose length does not satisfy the preset length range, where the specific preset length range can be determined according to the actual situation, where the minimum length of the preset length range is defined as min and the maximum length of the preset length range is defined as max, the filtering entity is denoted by e, and the length len (e) of one filtering entity e is set to satisfy the following length constraint: min is less than or equal to len (e) is less than or equal to max, and filtering entities which do not meet the length constraint are screened.
The purpose of the second strategy for character constraint is to remove those predictive entities that contain special characters, such as commas and other non-textual symbols. And screening out the filtering entities containing the special characters.
A third strategy for statistical constraints is to determine confidence of entities based on multiple predictions, and screen out entities that do not meet the confidence requirement. If a filtering entity is predicted multiple times, the confidence that the filtering entity is a valid entity is higher, where a threshold num of occurrences and a selection probability pi are set. If an entity occurs a number of times that satisfies a number threshold num, it is selected with a higher probability, otherwise it is discarded. Where c represents a counter and p represents a confidence. It should be understood that the number threshold num corresponds to the number of the loop rounds, and the setting requirement of the confidence p may be determined according to actual situations. The calculation is disclosed as follows:
Figure BDA0002373304690000081
FIG. 5 is a schematic diagram of an implementation flow of determining a semi-annotated text set by an entity identification method according to an embodiment of the present invention; fig. 6 is a schematic diagram of a labeling framework of an entity recognition model according to an embodiment of the present invention.
Referring to fig. 5 and 6, in the embodiment of the present invention, operation 103, re-determining the semi-labeled text set based on the entity library, includes: operation 1031, labeling the semi-labeled text set through the entity library to obtain a labeled semi-labeled text set; operation 1032, selecting the labeled semi-labeled text set through a policy network in reinforcement learning, and obtaining a selected semi-labeled text set; in operation 1033, the selected half-label text set is determined as the half-label text set of the third operation.
When the semi-labeled text set is determined again based on the entity library, the semi-labeled text set is labeled through the entity library to obtain a labeled semi-labeled text set. It should be understood that the entity library is referred to as an entity library after adding valid entities in the third operation.
The method further comprises the step of selecting the labeled semi-labeled text set through a strategy network in reinforcement learning to obtain the selected semi-labeled text set. This can be handled by an instance selector, which is a combination of reward functions, policy functions and actions, i.e. a policy network in reinforcement learning. The quality of the sentences in the labeled semi-labeled set is good or bad, so that the sentences in the labeled semi-labeled set are determined by a strategy network to select so as to determine whether the sentences need to be added into an entity recognition model training set for the next round of model training, wherein the states, behaviors and rewards are defined as follows:
the state is used for representing the information of the current sentence and the corresponding label thereof, and is a real-value vector which comprises two parts: 1) representation of the current sentence vector, obtained from the BilSTM layer; 2) the tag sequence of the current sentence is obtained from the CRF layer.
Behavior, behavior aiE {0,1} is used to indicate whether the ith sentence in the semi-annotated set using annotations is chosen. Here, a custom policy function is used, the formula is as follows:
AΘ(si,ai)=aiσ(W*Si+b)+(1-ai)(1-σ(W*Si+b))
wherein SiIs the state feature vector, σ () is the sigmoid function, with parameters W and b. θ is a set of behavior parameters. It is determined whether to select the ith sentence for training based on the custom strategy function.
The method further comprises the step of determining the selected semi-labeled text set as a semi-labeled text set of the third operation. And determining whether the sentences marked in the semi-marked text set are determined to be added into the entity recognition model again by the entity library for training through a strategy network of reinforcement learning. And when the selected semi-labeled text set is determined as the semi-labeled text set of the third operation, taking the determined semi-labeled text set and labeled text set of the third operation as a training set for carrying out updating training on the model by the next second operation.
Referring to fig. 6, in the embodiment of the present invention, operation 401, which repeatedly loops the second operation and the third operation to obtain the target entity library, includes: firstly, judging whether a termination condition is met or not after the third operation of the current round is finished; and then, when the termination condition is judged to be met, determining the entity library obtained in the current round as a target entity library.
In the second operation and the third operation of the cycle, in order to determine that the cycle is finished, the method needs to judge whether the termination condition is met after each round of the third operation is finished. When the termination condition is judged to be satisfied, the loop may be ended, and the entity library obtained in the current round is determined as the target entity library. And when the termination condition is judged not to be met, continuing to cycle the second operation and the third operation, and executing the next round of the second operation and the third operation. The termination condition may be preset according to actual conditions.
In the embodiment of the present invention, the termination condition is whether the loop times of the second operation and the third operation satisfy a loop threshold; or, the termination condition is whether the number of valid entities of the entity library joining the third operation satisfies a number threshold.
In one case, the termination condition is whether the number of cycles of the second operation and the third operation satisfies a cycle threshold. The number of cycles may be any positive integer of one or more, for example: three times, five times and nine times. It is to be understood that a loop is performed only once for the second operation and the third operation. For example: and when the cycle times reach a cycle threshold value, judging that a termination condition is met, and determining the entity library obtained in the current round as a target entity library. And when the circulation times do not reach the circulation threshold value, namely the circulation times are judged not to meet the judgment that the termination condition is met, executing the next round of second operation and third operation.
In another case, the termination condition is whether the number of valid entities of the entity library added to the third operation satisfies a number threshold, and the number threshold may be 0 or any positive integer above 0. For example: and when the effective entity number of the entity library added in the third operation is smaller than the number threshold, judging that the termination condition is met, and determining the entity library obtained in the current round as the target entity library. And when the number of the effective entities in the entity library added in the third operation is greater than or equal to the number threshold, namely the termination condition is judged not to be met, and the next round of second operation and third operation are executed. It will be appreciated that when the number threshold is chosen to be 0, the termination condition is met when no new valid entities are added to the entity library.
Fig. 7 is a schematic diagram of an implementation module of an entity identification device according to an embodiment of the present invention.
Referring to fig. 7, in an aspect, an embodiment of the present invention provides an entity identification device, where the device includes: a first operation module 701, configured to label an appointed text through an entity library, and determine a training set and a test set corresponding to the appointed text; the training set comprises a labeled text set and a semi-labeled text set; a second operation module 702, configured to train the model through the training set, predict the test set based on the entity identification model obtained through the training, and screen to obtain an effective entity; a third operation module 703, configured to add the valid entity to the entity library, and re-determine the half-labeled text set based on the entity library; and a circulation module 704, configured to repeatedly circulate the second operation and the third operation to obtain the target entity library.
In this embodiment of the present invention, the second operation module 702 includes: the prediction submodule 7021 is configured to predict the test set through the trained entity recognition model to obtain a predicted entity; a filtering submodule 7022, configured to filter the predicted entity based on the entity library to obtain a filtered entity; and the screening submodule 7023 is configured to perform screening processing on the filtering entity based on the constraint policy to obtain an effective entity.
In this embodiment of the present invention, the third operation module 703 includes: the labeling submodule 7031 is configured to label the semi-labeled text set through the entity library to obtain a labeled semi-labeled text set; the selecting submodule 7032 is configured to select the labeled semi-labeled text set through a policy network in reinforcement learning, so as to obtain a selected semi-labeled text set; the first determining sub-module 7033 is configured to determine the selected half-labeled text set as the half-labeled text set of the third operation.
In an embodiment of the present invention, the loop module 704 includes: a judgment sub-module 7041, configured to judge whether a termination condition is met after the current round of third operation is ended; and a second determining sub-module 7042, configured to determine, when it is determined that the termination condition is satisfied, the entity library obtained in the current round as the target entity library.
In an embodiment of the present invention, the apparatus further includes: and an execution sub-module 7043 configured to, when it is determined that the termination condition is not satisfied, execute the next round of the second operation and the third operation.
Another aspect of the invention provides a computer-readable storage medium comprising a set of computer-executable instructions which, when executed, perform the entity identification method of any one of the above.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. An entity identification method, characterized in that the method comprises:
the method comprises the following steps of firstly, marking a specified text through an entity library, and determining a training set and a test set corresponding to the specified text; the training set comprises a labeled text set and a semi-labeled text set;
secondly, training a model through the training set, predicting the test set based on an entity recognition model obtained through training, and screening to obtain an effective entity;
adding the effective entity into the entity library, and re-determining a semi-labeled text set based on the entity library;
and repeating the second operation and the third operation to obtain the target entity library.
2. The method of claim 1, wherein predicting the test set based on the trained entity recognition model, and screening for valid entities comprises:
predicting the test set through the entity recognition model obtained through training to obtain a predicted entity;
filtering the predicted entity based on the entity library to obtain a filtered entity;
and screening the filtering entity based on a constraint strategy to obtain an effective entity.
3. The method of claim 2, wherein the constraint policy comprises at least one of: a first policy for length constraints, a second policy for character constraints, and a third policy for statistical constraints.
4. The method of claim 3, wherein said re-determining the set of semi-annotated text based on the entity library comprises:
labeling the semi-labeled text set through the entity library to obtain a labeled semi-labeled text set;
selecting the labeled semi-labeled text set through a strategy network in reinforcement learning to obtain a selected semi-labeled text set;
and determining the selected semi-labeled text set as the semi-labeled text set of the third operation.
5. The method of any of claims 1-3, wherein repeating the second and third operations to obtain the target entity library comprises:
after the third operation of the current round is finished, judging whether a termination condition is met;
and when the judgment result shows that the termination condition is met, determining the entity library obtained in the current round as a target entity library.
6. The method of claim 5, further comprising:
and when the termination condition is judged not to be met, executing the next round of second operation and third operation.
7. The method of claim 6, wherein the termination condition is whether a loop number of the second and third operations satisfies a loop threshold;
or the like, or, alternatively,
the termination condition is whether the number of valid entities joining the entity library of the third operation meets a number threshold.
8. An entity identification device, characterized in that the device comprises:
the first operation module is used for marking the specified text through an entity library and determining a training set and a test set corresponding to the specified text; the training set comprises a labeled text set and a semi-labeled text set;
the second operation module is used for training the model through the training set, predicting the test set based on the entity recognition model obtained through training, and screening to obtain an effective entity;
a third operation module, configured to add the valid entity to the entity library, and re-determine a semi-labeled text set based on the entity library;
and the circulating module is used for repeatedly circulating the second operation and the third operation to obtain the target entity library.
9. The apparatus of claim 8, wherein the second operational module comprises:
the prediction submodule is used for predicting the test set through the entity recognition model obtained through training to obtain a predicted entity;
a filtering submodule, configured to filter the predicted entity based on the entity library to obtain a filtered entity;
and the screening submodule is used for screening the filtering entity based on the constraint strategy to obtain an effective entity.
10. A computer-readable storage medium comprising a set of computer-executable instructions that, when executed, perform the entity identification method of any of claims 1-7.
CN202010057489.1A 2020-01-19 2020-01-19 Entity identification method, equipment and computer readable storage medium Active CN111259134B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010057489.1A CN111259134B (en) 2020-01-19 2020-01-19 Entity identification method, equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010057489.1A CN111259134B (en) 2020-01-19 2020-01-19 Entity identification method, equipment and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN111259134A true CN111259134A (en) 2020-06-09
CN111259134B CN111259134B (en) 2023-08-08

Family

ID=70948952

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010057489.1A Active CN111259134B (en) 2020-01-19 2020-01-19 Entity identification method, equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN111259134B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111832294A (en) * 2020-06-24 2020-10-27 平安科技(深圳)有限公司 Method and device for selecting marking data, computer equipment and storage medium
CN113886602A (en) * 2021-10-19 2022-01-04 四川大学 Multi-granularity cognition-based domain knowledge base entity identification method

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102682763A (en) * 2011-03-10 2012-09-19 北京三星通信技术研究有限公司 Method, device and terminal for correcting named entity vocabularies in voice input text
CN106445917A (en) * 2016-09-23 2017-02-22 中国电子科技集团公司第二十八研究所 Bootstrap Chinese entity extracting method based on modes
US20170060835A1 (en) * 2015-08-27 2017-03-02 Xerox Corporation Document-specific gazetteers for named entity recognition
CN109190110A (en) * 2018-08-02 2019-01-11 厦门快商通信息技术有限公司 A kind of training method of Named Entity Extraction Model, system and electronic equipment
CN110134959A (en) * 2019-05-15 2019-08-16 第四范式(北京)技术有限公司 Named Entity Extraction Model training method and equipment, information extraction method and equipment
CN110209764A (en) * 2018-09-10 2019-09-06 腾讯科技(北京)有限公司 The generation method and device of corpus labeling collection, electronic equipment, storage medium
CN110287480A (en) * 2019-05-27 2019-09-27 广州多益网络股份有限公司 A kind of name entity recognition method, device, storage medium and terminal device
CN110569366A (en) * 2019-09-09 2019-12-13 腾讯科技(深圳)有限公司 text entity relation extraction method and device and storage medium
CN110704633A (en) * 2019-09-04 2020-01-17 平安科技(深圳)有限公司 Named entity recognition method and device, computer equipment and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102682763A (en) * 2011-03-10 2012-09-19 北京三星通信技术研究有限公司 Method, device and terminal for correcting named entity vocabularies in voice input text
US20170060835A1 (en) * 2015-08-27 2017-03-02 Xerox Corporation Document-specific gazetteers for named entity recognition
CN106445917A (en) * 2016-09-23 2017-02-22 中国电子科技集团公司第二十八研究所 Bootstrap Chinese entity extracting method based on modes
CN109190110A (en) * 2018-08-02 2019-01-11 厦门快商通信息技术有限公司 A kind of training method of Named Entity Extraction Model, system and electronic equipment
CN110209764A (en) * 2018-09-10 2019-09-06 腾讯科技(北京)有限公司 The generation method and device of corpus labeling collection, electronic equipment, storage medium
CN110134959A (en) * 2019-05-15 2019-08-16 第四范式(北京)技术有限公司 Named Entity Extraction Model training method and equipment, information extraction method and equipment
CN110287480A (en) * 2019-05-27 2019-09-27 广州多益网络股份有限公司 A kind of name entity recognition method, device, storage medium and terminal device
CN110704633A (en) * 2019-09-04 2020-01-17 平安科技(深圳)有限公司 Named entity recognition method and device, computer equipment and storage medium
CN110569366A (en) * 2019-09-09 2019-12-13 腾讯科技(深圳)有限公司 text entity relation extraction method and device and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘彤等: "基于半监督学习与CRF的应急预案命名实体识别", 《软件导刊》 *
刘彤等: "基于半监督学习与CRF的应急预案命名实体识别", 《软件导刊》, no. 03, 2 January 2020 (2020-01-02) *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111832294A (en) * 2020-06-24 2020-10-27 平安科技(深圳)有限公司 Method and device for selecting marking data, computer equipment and storage medium
CN111832294B (en) * 2020-06-24 2022-08-16 平安科技(深圳)有限公司 Method and device for selecting marking data, computer equipment and storage medium
CN113886602A (en) * 2021-10-19 2022-01-04 四川大学 Multi-granularity cognition-based domain knowledge base entity identification method

Also Published As

Publication number Publication date
CN111259134B (en) 2023-08-08

Similar Documents

Publication Publication Date Title
Phan et al. Modelling context and syntactical features for aspect-based sentiment analysis
CN113641586B (en) Software source code defect detection method, system, electronic equipment and storage medium
CN106845530B (en) character detection method and device
CN111985229B (en) Sequence labeling method and device and computer equipment
CN111222317A (en) Sequence labeling method, system and computer equipment
CN111444320A (en) Text retrieval method and device, computer equipment and storage medium
CN111723569A (en) Event extraction method and device and computer readable storage medium
CN111259134B (en) Entity identification method, equipment and computer readable storage medium
CN110046707B (en) Evaluation optimization method and system of neural network model
CN110968725B (en) Image content description information generation method, electronic device and storage medium
CN113010683B (en) Entity relationship identification method and system based on improved graph attention network
CN112199473A (en) Multi-turn dialogue method and device in knowledge question-answering system
Gnanasekaran et al. Using Recurrent Neural Networks for Classification of Natural Language-based Non-functional Requirements.
CN112420205A (en) Entity recognition model generation method and device and computer readable storage medium
CN111143517B (en) Human selection label prediction method, device, equipment and storage medium
CN117151222B (en) Domain knowledge guided emergency case entity attribute and relation extraction method thereof, electronic equipment and storage medium
CN110008344B (en) Method for automatically marking data structure label on code
CN116882413A (en) Chinese entity identification method, device and equipment
CN115034302B (en) Relation extraction method, device, equipment and medium for optimizing information fusion strategy
Salar et al. Improving loss function for deep convolutional neural network applied in automatic image annotation
Do et al. Facing the most difficult case of semantic role labeling: A collaboration of word embeddings and co-training
CN113822013A (en) Labeling method and device for text data, computer equipment and storage medium
CN111723568A (en) Event extraction method and device and computer readable storage medium
Hauser et al. An improved assessing requirements quality with ML methods
US11734156B2 (en) Crash localization using crash frame sequence labelling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant