CN114861666A - Entity classification model training method and device and computer readable storage medium - Google Patents

Entity classification model training method and device and computer readable storage medium Download PDF

Info

Publication number
CN114861666A
CN114861666A CN202210468468.8A CN202210468468A CN114861666A CN 114861666 A CN114861666 A CN 114861666A CN 202210468468 A CN202210468468 A CN 202210468468A CN 114861666 A CN114861666 A CN 114861666A
Authority
CN
China
Prior art keywords
entity
text
template
classification model
original text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210468468.8A
Other languages
Chinese (zh)
Inventor
廖佳玲
都金涛
祝慧佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202210468468.8A priority Critical patent/CN114861666A/en
Publication of CN114861666A publication Critical patent/CN114861666A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/186Templates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the specification provides a training method and a training device for an entity classification model and a computer-readable storage medium. The method comprises the following steps. Acquiring a first sample set, wherein the first sample set comprises a first training sample, the first training sample comprises a first original text and a first template text corresponding to the first original text, and the first template text is used for describing a first entity in the first original text and a plurality of first entity types corresponding to the first entity; inputting the first original text into a first entity classification model to obtain a first prediction result, wherein the first prediction result comprises a first probability score of each character in the first template text which is sequentially output; determining a first prediction loss corresponding to the first training sample according to the first prediction result; and training the first entity classification model based on the prediction loss corresponding to each training sample in the first sample set, so that the entity classification model learns the information of the original text to realize multi-label classification of the entity.

Description

Entity classification model training method and device and computer readable storage medium
Technical Field
One or more embodiments of the present description relate to the field of machine learning, and more particularly, to a method, apparatus, and computer-readable storage medium for entity classification model training.
Background
With the development of deep learning technology, an Artificial Intelligence (AI) model obtained based on the deep learning technology has gained increasingly wide attention, and is widely applied to text entity type identification, that is, an entity in a text is identified through the AI model, and the identified entity is classified to determine the entity type to which the entity belongs.
Currently, entities in text belong to only one entity type, but in some possible scenarios, one entity in text may belong to multiple entity types, and the diversification of entity types of a single entity poses a challenge to the classification of entities in text.
Disclosure of Invention
One or more embodiments of the present specification describe a method and an apparatus for training an entity classification model, which construct a template text of an original text (for describing an entity in the original text and a plurality of entity types of the entity), train the entity classification model with the goal of maximizing a probability score of each word in the template text, and enable the entity classification model to learn information of the original text to realize classification of multiple entity types of the entity.
According to a first aspect, there is provided a method for training an entity classification model, comprising:
obtaining a first sample set, wherein the first sample set comprises a first training sample, the first training sample comprises a first original text and a first template text corresponding to the first original text, and the first template text is used for describing a first entity in the first original text and a plurality of first entity types corresponding to the first entity;
inputting the first original text into a first entity classification model to obtain a first prediction result, wherein the first prediction result comprises a first probability score of each character in a first template text which is sequentially output;
determining a first prediction loss corresponding to the first training sample according to the first prediction result;
and training the first entity classification model based on the prediction loss corresponding to each training sample in the first sample set.
According to a possible implementation, the first training sample further includes a second template text, where the second template text is used to describe non-entities in the first original text that do not belong to the plurality of first entity types; the first prediction result further comprises sequentially outputting a second probability score of each word in the second template text.
According to a possible implementation manner, the first sample set further includes a second training sample, where the second training sample includes a second original text and a third template text corresponding to the second original text, and the third template text is used to describe a second entity in the second original text and a single second entity type corresponding to the second entity; the method further comprises the following steps: inputting the second original text into the first entity classification model to obtain a second prediction result, wherein the second prediction result comprises a second probability score of each character in a third template text which is sequentially output; and determining a second prediction loss corresponding to the second training sample at least according to the second prediction result.
In one embodiment, the second training sample further comprises a fourth template text describing non-entities in the second original text other than the second entity type; the second prediction result further comprises a fourth probability score of each word in the fourth template text which is output sequentially.
In one embodiment, the plurality of first entity types includes the second entity type.
In one embodiment, the method further comprises: obtaining a single-type template, wherein the single-type template comprises an entity slot position and an entity type slot position; and filling a second entity in the second original text and the single first entity type corresponding to the second entity into the entity slot position and the entity type slot position in the single-type template respectively to obtain a third template text.
According to a possible embodiment, the method further comprises: acquiring a multi-type template corresponding to the first original text, wherein the multi-type template comprises an entity slot position and a plurality of entity type slot positions; filling a first entity in the first original text and a plurality of first entity types corresponding to the first entity in an entity slot position and a plurality of entity type slot positions in the multi-type template to obtain a first template sample; the number of the plurality of entity type slots and the number of the plurality of first entity types are the same.
According to a possible embodiment, the method further comprises: obtaining a second sample set, wherein the second sample set comprises a plurality of third training samples, each third training sample comprises a third original text and a corresponding fifth template text, and the fifth template text is used for describing a third entity in the corresponding third original text and a single third entity type corresponding to the third entity; inputting the third original text into a second entity classification model to obtain a third prediction result, wherein the third prediction result comprises a fifth probability score of each word in a fifth template text which is sequentially output; and training the second entity classification model based on a third prediction result corresponding to each third training sample in the second sample set, and taking the trained second entity classification model as the first entity classification model.
In one embodiment, the third entity type is any one of the plurality of first entity types; the number of samples in the second set of samples is greater than the number of samples in the first set of samples.
According to a possible implementation, the first entity classification model comprises an encoder and a decoder; inputting the first original text into a first entity classification model to obtain a first prediction result, wherein the first prediction result comprises: encoding the first original text by using the encoder to obtain an encoding vector; decoding the coding vector by using the decoder, and outputting word probability distribution corresponding to each of a plurality of moments; for each target time in the plurality of times, determining a probability of a target word from a word probability distribution corresponding to the target time, and including the probability of the target word in the first template text, the target word being a word numbered corresponding to the target time.
According to a possible implementation manner, the plurality of first entity types include a mobile phone number and an account number of an application program.
According to a second aspect, there is provided a training apparatus for an entity classification model, comprising:
the system comprises a sample set acquisition module, a first training sample acquisition module and a second training sample acquisition module, wherein the first training sample comprises a first original text and a corresponding first template text, and the first template text is used for describing a first entity in the first original text and a plurality of first entity types corresponding to the first entity;
the prediction module is configured to input the first original text into a first entity classification model to obtain a first prediction result, wherein the first prediction result comprises a first probability score of each word in a first template text which is sequentially output;
a loss determination module configured to determine a first prediction loss corresponding to the first training sample according to the first prediction result;
a training module configured to train the first entity classification model based on the predicted loss corresponding to each training sample in the first set of samples.
According to a third aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first aspect.
According to a fourth aspect, there is provided a computing device comprising a memory and a processor, wherein the memory has stored therein executable code, and wherein the processor, when executing the executable code, implements the method of the first aspect.
In an embodiment of the present specification, a template text of an original text (for describing one entity in the original text and a plurality of entity types of the entity) is constructed, with a goal of maximizing a probability score of each word in the template text, and an entity classification model is trained, so that the entity classification model learns information of the original text to realize classification of multiple entity types of the entity.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 illustrates a schematic diagram of entity classification by an entity classification model in one aspect;
FIG. 2 illustrates a flow diagram of a scheme for training an entity classification model in one embodiment;
FIG. 3 illustrates a schematic diagram that determines a probability distribution of template text in one embodiment;
FIG. 4 illustrates a flow diagram of a method of entity classification model training in one embodiment;
FIG. 5 is a block diagram of an apparatus for entity classification model training in accordance with one embodiment.
Detailed Description
The scheme provided by the specification is described below with reference to the accompanying drawings.
Entity extraction, also known as Named Entity Recognition (NER), is mainly tasked with recognizing text ranges of Entities in text and classifying them into predefined entity categories, where Entities may include people, place names, organizations, dates, currencies, percentages, etc., which are the basis of question-answering systems, translation systems, and knowledge maps.
Currently, an entity in a text is identified by an entity classification model, and the entity is classified into one of a plurality of preset entity categories. FIG. 1 shows a schematic diagram of entity classification by an entity classification model in one aspect. As shown in FIG. 1, the entity classification model includes an encoder and a CRF (Conditional Random Fields) layer, which determines the vectorized representation direction of each word in a text in practical applicationsAmounts w1, w2, …, wM; inputting w1, w2, … and wM into an encoder to obtain classification probability distribution P of w1, w2, … and wM 1 、P 2 、…、P M Here, the classification probability distribution includes presetting probability values of each of a plurality of classes 0, A, B, …, N, where 0 denotes other classes, typically non-entity classes; finally, M outputs are used to obtain categories corresponding to w1, w2, w3, … and wM, and words corresponding to the same category (excluding category 0) are combined to be used as an entity. Wherein a representative vector of a word may be understood as a vector resulting from an initial encoding of the word, thereby facilitating the quantization of the information of the word. In one example, the representation vector is fused with at least token _ embedding and position embedding. For any word in the text, looking up a word vector table (including the respective coding vectors of a large number of words), wherein the coding vector of the matched word is token _ embedding of the word; a rank bit vector table (including the respective code vectors of a plurality of sequence numbers) is looked up based on the sequence number of the word (obtained by ranking each word in the text), and the code vector of the matching sequence number is the position embedding of the word.
However, in some possible scenarios, one entity in a large amount of text for training an entity classification model belongs to one entity class, but one entity in a small number of samples belongs to a plurality of entity classes, and one entity class serves as one label, which raises the problem of small samples in multi-label classification. In addition, since one entity category of a plurality of entity categories to which an entity belongs may dominate, it is difficult to predict other entity categories, however, these entity categories may have important meanings, and therefore, it is necessary to let the entity classification model learn information of small samples to realize multi-label classification, however, the output of the entity classification model is a single category of each word in a text, and a plurality of entity types of the entity cannot be predicted. Therefore, the problem of multi-label classification of small samples becomes an urgent problem to be solved.
For example, a small sample of a multi-label classification may be a case transcript text. In order to facilitate understanding of the problem of the small sample of the multi-label classification, the description is given in detail below by taking the case script text as an example.
The process of the case data electronic entry system completely depends on manual work, after a case is recorded on paper, a case entry file is obtained, and workers fill each element (which can also be understood as an entity) of the case in the case entry file into a page form item by item. There are two problems with this: 1) errors such as hand mistake can occur in manual entry; 2) for massive case scenario record files, the manual entry efficiency is low, dozens of elements of each case need to be filled in item by item, and the case scenario record files are time-consuming and easy to miss. Therefore, the extraction of the elements (entities) of the case writing text is carried out through the entity classification model, so that partial automation is brought, and the time for manually inputting the system is reduced; meanwhile, the method ensures that the elements (entities) are selected from the case script text, and reduces the possibility of hand errors in the typing process.
In the process of extracting elements (entities) in case entry texts by using an entity classification model, a sample set with labels is constructed by using massive case entry texts; the inventor finds that a small number of elements (entities) simultaneously have a plurality of entity types, namely, accord with a plurality of labels, a large number of entities only have one entity type, namely, a single label, and in addition, only a small number of manually labeled data simultaneously label the entity types, which belong to the entities, so that the problem of a small sample of a multi-label is solved. For example: the case writing text comprises' suspect account number: 17712345678". The '17712345678' can be the name of the suspect payment bank account or the phone number of the suspect, and the classification of the suspect account is necessary, so that the '17712345678' needs to be mapped into a plurality of labels such as the 'phone number of the suspect' and the 'suspect payment bank account' through an entity classification model; whether the mobile phone number is the 'mobile phone number' is well judged in the mapping process, but whether the mobile phone number is the 'bank payment account number' is difficult to judge through the context, so that the mobile phone number is predicted to be the 'mobile phone number of a suspect' by the entity classification model in the mapping process, and the 'bank payment account number of the suspect' is difficult to predict basically; however, for actual business, when a mobile phone number is really a precious payment account number in the case entry text, it has great significance to predict the label of the "precious payment account number of suspect". Therefore, there is a need for an entity classification model to learn information about small samples to predict an entity, and multiple labels for the entity.
To solve the above-mentioned small sample problem of multi-label classification, in some embodiments of the present specification, a template text of an original text (for describing an entity in the original text and a plurality of labels of the entity) is constructed, and the number of outputs of the entity classification template is designed to be adapted to the number of words of the template text (i.e., one output corresponds to one word), and an entity classification model is trained with the goal of maximizing a probability score of each word in the template text, so that the entity classification model learns information of a small amount of the original text (including multi-labeled entities) to realize multi-label classification of the entity.
It should be noted that although the entity in the original text has multiple tags (for convenience of description and distinction, the original text is referred to as multi-tag original text), the number of multi-tag original texts is small, and the text of the entity having a single tag (for convenience of description and distinction, the text is referred to as single-tag original text) is large. Based on the embodiment of the specification, the following technical ideas are provided: the entity classification model is trained through a large amount of single-label original texts, so that the entity classification model learns the context information of single labels firstly, then the entity classification model is trained again through a small amount of multi-label original texts, so that the entity classification model learns the relationship among the multiple labels fully on the basis of the learned context information of the single labels, the entity classification model can learn the relationship among the multiple labels by fully utilizing the information of small samples, the entity is predicted more accurately, and the multiple labels of the entity solve the problem of the small samples classified by the multiple labels. It is noted that, through the above technical idea, the trained entity classification model can also predict an entity and a single label of the entity.
The following describes a training scheme of the entity classification model provided in the embodiments of the present specification.
FIG. 2 illustrates a schematic diagram of training an entity classification model in one embodiment. As shown in fig. 2, the process of training the entity classification model is as follows: constructing a template text of a single-label original text (used for explaining a single entity in the single-label original text and a label corresponding to the entity), taking the single-label original text and each template text corresponding to the single-label original text as a single-label training sample, and further obtaining a plurality of single-label training samples, wherein fig. 2 shows single-label training samples 1, 2 and …; constructing a template text of a multi-label original text (used for explaining a single entity in the original text and a plurality of labels corresponding to the entity), taking the multi-label original text and each template text corresponding to the multi-label original text as a multi-label training sample, and further obtaining a plurality of multi-label training samples, wherein fig. 2 shows multi-label training samples 1, 2 and …; designing two data sets based on the training samples, wherein one data set (called a single label sample set for distinguishing) only comprises single label training samples, and the other data set (called a multi-label sample set for distinguishing) at least comprises multi-label training samples; then, training an entity classification model based on the single label sample set, so that the entity classification model learns the information of the context of the single label, and obtaining the trained entity classification model; and then, training the trained entity classification model based on the multi-label sample set, so that the entity classification model fully learns the relation among the multiple labels on the basis of the learned context information of the single label, and a final entity classification model is obtained.
It should be noted that the number of template texts corresponding to the single-label original text and the multi-label original text is not specifically limited, the number of template texts needs to consider the number of entities having labels in the text, and further, in order to ensure that the entity classification model learns the difference between the entities and the non-entities, the number of non-entities (selected from contents other than the entities having no labels in the text) may also need to be considered.
Here, the multi-label original text and the single-label original text are original texts that need to be subjected to entity Recognition and classification, the original texts may be manually entered, or may be obtained by recognizing a picture including the original texts through an OCR (Optical Character Recognition) technology, which is not specifically limited in this embodiment of the specification. For example, a plurality of original texts can be obtained by performing OCR recognition on the case entry file.
For example, the multiple labels that an entity in the multi-label original text has may include a mobile phone number and an account number of an application. For example, the application may be a pay-for-use, a nail, or the like. It should be understood that the multi-label is merely exemplary and not limiting, and the multi-label needs to be determined in connection with the actual scene.
Where the multi-label original text may also include single-label entities.
The template text can be multi-label template text, single-label template text or non-entity template text. The multi-label template text is used to describe an entity and multiple labels of the entity. The template text of a single label is used to describe an entity and the single label of that entity. The template text of the non-entity is used to illustrate the non-entity beyond the entity type.
Specifically, a template library may be designed, and the plurality of templates in the template library may be single-tag templates (including one entity slot and one entity-type slot), multi-tag templates (including one entity slot and a plurality of entity-type slots), and non-entity templates (including one entity slot); correspondingly, for any entity with a label in the multi-label original text, selecting a corresponding template from a template library based on the number of the labels of the entity, and then automatically filling the entity and each label of the entity into an entity slot position and an entity type slot position in the template to obtain a template text (called as a positive template text for easy distinction) corresponding to the multi-label original text; constructing a positive template text for each entity with the label; further, for the multi-label original text, a non-entity template may also be selected, and non-entities without labels in the multi-label original text (for example, consisting of several consecutive non-label words in the text, usually determined by random sampling) are filled into entity slots in the non-entity template, so as to obtain a template text (referred to as a negative template text for easy distinction). In practical applications, the number of negative template texts of the original text may be 1.5 times that of the positive template text. The obtaining mode of the template text of the single-label original text is similar to that of the multi-label original text, and the details are not repeated here.
For example, a multi-label template: [ candiate span ] is both an [ entity type ] element and an [ entity type ] element; single label template: [ candinate span ] is an [ entity type ] element; non-entity template: the candinate span is not a factor. Wherein [ candidate span ] represents an entity slot; [ entity type ] represents an entity type slot.
Example 1, for single-label original text: the bad plum calls the king victim (mobile phone number 17733300000); wherein the label of [17733300000] is: [ suspect cell phone number ]; correspondingly, the number of the positive template texts corresponding to the original text is 1, which is: 17733300000 is the suspect's cell phone number; there may be a plurality of corresponding negative template texts, which may be: "a call is not a factor", "a call is not a factor to a victim", and "a call is not a factor".
Example 2, for multi-label original text: the victim transfers 5000 yuan to the king payment treasure 153 x 00; wherein, 2 labels of [153 × 00] are [ suspect mobile phone number ], [ suspect payment account number ]; correspondingly, the number of the positive template texts corresponding to the original text is 1, which is: 153 × 00 is the suspect mobile phone number and the Payment account number; there may be a plurality of corresponding negative template texts, for example, there may be: "it is not an element to the king", "it is not an element to transfer", "it is not an element to the victim" to the king ".
Example 3, for multi-tag original text (including entities with single tags): the bad plum calls a victim king (mobile phone number 17733300000) to make 10000 blocks of the Paibao account number abc @163. com; wherein, the label of [17733300000] is [ suspect mobile phone number ], [ abc @163.com ] is [ suspect payment account number ], [ suspect mailbox ]; correspondingly, the number of the positive template texts corresponding to the original text is 2, and the number of the positive template texts is respectively as follows: 17733300000 is the suspect's cell phone number; abc @163.com is both a suspect pay account number and a suspect mailbox; there may be more than one corresponding negative template text, for example: a certain call is not a factor; it is not a factor to the victim; giving it is not a factor.
It should be noted that, in the embodiment of the present disclosure, only two tags are taken as an example, and are not specifically limited, and in an actual application, the number of tags of a single entity is not limited, and may be 3, or 4, or even more, and specifically needs to be determined in combination with an actual situation.
And part or all of the training samples in the multi-label sample set are multi-label training samples. It should be noted that, in consideration of the fact that the number of multi-label original texts is small, the model effect of the entity classification model may not be guaranteed, in practical application, part of training samples in the multi-label sample set are multi-label training samples, and part of training samples are single-label training samples; for example, the multi-label training samples account for 25% of the multi-label sample set, and the single-label training samples account for 75% of the multi-label sample set.
Notably, to ensure that the entity classification model can learn the relationships between different tags, it is preferable that the entity of the multi-tagged original text has multiple tags, including a single tag that an entity in the single-tagged original text has.
Each training sample in the single-label sample set is a single-label training sample, and the single-label training sample is a training sample except the multi-label sample set. It is noted that all labels of the single-label exemplar set may be completely different from, partially identical to, or completely identical to all labels of the multi-label exemplar set. For example, all tag sets of the single tag sample set may include a mobile phone number, a mailbox, and a payroll account number; all the labels of the multi-label sample set can include a mobile phone number, a mailbox, a Payment Bank account number, a mobile phone number + mailbox, a mobile phone number + Payment Bank account number, a mobile phone number + mailbox + Payment Bank account number.
The training process of the entity classification model is described in detail next.
An entity classification model is first trained through a single-label sample set. In practical application, for any single-label training sample in a single-label sample set, a single-label original text in the single-label training sample is used as input of an entity classification model, and the entity classification model is sequentially output to obtain respective word probability distribution (for convenience of description and distinction, referred to as first probability distribution) at multiple moments; then, the sequence of the multiple time instants is used as the sequence of the words of the template texts in the single-label training sample, and a word probability distribution (referred to as a second probability distribution for convenience of description and distinction) of each template text in the single-label training sample is obtained. Wherein the first probability distribution comprises respective probability scores of a plurality of preset words; the second probability distribution includes a probability score for each word in the corresponding template text.
FIG. 3 illustrates a schematic diagram that determines a probability distribution of template text in one embodiment. As shown in FIG. 3, the entity classification model is an end-to-end model, comprising an encoder and a decoder; the single-label training sample is assumed to comprise a single-label original text and N template texts; inputting the single-label original text into the entity classification model, and sequentially outputting t by the entity classification model 1 Time t 2 First probability distribution P of time … … 1 、P 2 …, obtaining an output result; determining a second probability distribution Q of the N template texts based on the output result 1 、Q 2 、…、Q N . Here, t 1 Time t 2 The order of the moments … … is adapted to the order of the words in the template texts, e.g. t 1 The moment corresponds to the first word c in the template text 1 11 、t 2 The moment corresponds to the second word c in the template text 1 12 . The first probability distribution comprises probability scores of a plurality of preset words X1, X2, X3 and … respectively, and dimensions of the vectors are represented by the vectors, wherein the number of the vectors is the same as that of the plurality of preset words X1, X2, X3 and …. How to determine the second probability distribution Q corresponding to the template text 1 is explained below 1 And other template texts are similar and are not described in detail. For the first word c in template text 1 11 The first probability distribution P 1 Chinese character c 11 As a word c in the template text 1 11 Probability score of Q 11 And by analogy, determining the character c in the template text 1 11 Word c thereafter 12 …, respectively 12 …, and obtaining a second probability distribution Q 1 . It should be noted that the output number of the entity classification model, i.e. the number of the first probability distribution, needs to satisfy the number of words of the template text with the largest number of words, for example, if the template text has 20 words, the entity classification template has at least 20 time outputs, and the 20 first probability distributions are obtained by sequentially outputting in practical applications. It should be understood that, in practical applications, the representation vector after vectorization representation of each word in the original text of the single label is input to an encoder (not shown in the figure), and the vectorization representation method is described above with reference to fig. 1 and will not be described again.
Then, for any template text, determining the template prediction loss of the template text based on the second probability distribution of the template text; and further, determining model prediction loss based on the template prediction loss corresponding to each template text in the single label sample set, and training the entity classification model by taking the minimized model prediction loss as a target, so as to maximize the probability score of each word in each template text and obtain the trained entity classification model. The model prediction loss may be an average value of template prediction losses of template texts in the single label sample set. It should be noted that the trained entity classification model can learn the context information of the single label. In one example, the template prediction loss of any template text can be calculated by a loss function shown in the following formula (1);
Figure BDA0003625546790000091
wherein, P i,j Representing the probability score of the jth word in the ith template text; n denotes the number of words in the ith template text.
It should be noted that the above formula (1) is merely an example of the loss function, and is not limited to a specific example, and the loss function may be reasonably designed according to the specific situation of the entity classification model.
And then, continuously training the trained entity classification model through the multi-label sample set to obtain a final entity classification model. The training process is similar to the training process of the entity classification model by the single label sample set, and the difference is only the difference of the training samples, which is not described herein again. It should be noted that the final entity classification model can learn the relationship between multiple labels, and can predict a single entity, and a single label of the entity, or multiple labels.
In summary, based on the training method of the entity classification model provided in the embodiments of the present specification, the entity type model can learn information of multi-label small samples, and predict multiple labels or a single label of a single entity at the same time.
FIG. 4 illustrates a flowchart of a method for training an entity classification model, which may be performed by any computing, processing capable device, apparatus, platform, cluster of apparatuses, according to one embodiment. For convenience of description, specific terms such as training samples, original texts, entities, entity types, sample sets, probability scores, predicted results, template texts, entity classification models are added with first, second and … … before the specific terms to show differences, and here, the first, second and … … have no special meaning, and are only for convenience of distinction and description. In the embodiment shown in fig. 4, the method includes the following steps:
step 41, obtaining a first sample set, where the first sample set includes a first training sample, where the first training sample includes a first original text and a first template text corresponding to the first original text, and the first template text is used to describe a first entity in the first original text and a plurality of first entity types corresponding to the first entity.
The first sample set may correspond to the multi-labeled sample set, which includes at least a first training sample (corresponding to the multi-labeled training sample). The first training sample includes a first original text (corresponding to the multi-labeled original text) and a corresponding first template text (corresponding to the positive template text), which is a text describing a first entity in the first original text and a plurality of first entity types corresponding thereto. In one example, the first template text may be obtained specifically by: determining a first entity in a first original text and a plurality of first entity types corresponding to the first entity; then, acquiring a multi-type template matched with the number of the first entity types, wherein the multi-type template comprises an entity slot position and a plurality of entity type slot positions; and then filling the first entity in the first original text and a plurality of first entity types corresponding to the first entity in an entity slot position and a plurality of entity type slot positions in the multi-type template to obtain a first template sample. See, in particular, example 2 and example 3 above.
In some embodiments, the first training sample may also include negative template text for the first original text, referred to as second template text, which is text describing non-entities in the first original text other than the plurality of first entity types. Which may be generated by a corresponding template in a template library. Specifically, a non-entity template including an entity slot may be obtained; and then filling non-entities except the first entity in the first original text into an entity slot in the non-entity template to obtain a second template sample. See in particular the negative template text in examples 1, 2, 3 above.
In addition, in some possible embodiments, the first original text further includes an entity having a single entity type, and the first training sample further includes a corresponding positive template text, which is a text representing the entity in the first original text and the single entity type corresponding to the entity. Thus, the first training sample may include a plurality of template texts.
Next, at step 42, the first original text is input into the first entity classification model to obtain a first prediction result, which includes sequentially outputting a first probability score of each word of the first template text.
According to one possible embodiment, the first entity classification model includes an encoder for encoding the input text to obtain an encoded vector, and a decoder for decoding the encoded vector to output word probability distributions (corresponding to the first probability distribution) corresponding to the time instants.
Correspondingly, when a first original text is input into the first entity classification model, the encoder encodes the first original text to obtain a coding vector; the decoder decodes the encoded vector and outputs a word probability distribution corresponding to each of the plurality of times. The first probability score may be determined based on a probability distribution for words corresponding to each of the plurality of time instants. Specifically, the sequence of the plurality of time instants corresponds to the sequence of each word in the first template text. For each target time in the plurality of times, determining a probability of a target word from a word probability distribution corresponding to the target time, and including the probability of the target word as the word sequentially numbered in the first template text corresponding to the target time, into the first probability score. In this way, the first entity classification model can be obtained to sequentially output the first probability score of each character in the first template text and to be included in the first prediction result.
When the first training sample further includes the second template text (negative template text), a second probability score for sequentially outputting each word in the second template text may be determined based on a word probability distribution corresponding to each of a plurality of time instants output by the decoder, and the second probability score is included in the first prediction result for the first training sample. The determination of the second probability score is similar to the determination of the first probability score, except that the target word at any target time is the word in the second template text with the sequential number corresponding to the target time.
When the first training sample also includes other template text, a probability score for the template text may be determined in a similar manner, and included in the first prediction result.
Then, in step 43, a first prediction loss corresponding to the first training sample is determined according to the first prediction result. The predicted loss may be determined in the form of various loss functions, which are not limited herein.
The process of determining its predicted loss for a single training sample (multi-labeled sample) is described in detail above. For each sample in the first set of samples, its predicted loss can be determined. Then, at step 44, a first entity classification model is trained based on the predicted loss for each training sample in the first set of samples.
Specifically, the model prediction loss may be determined based on the prediction loss corresponding to each training sample in the first sample set, for example, the sum of the prediction losses corresponding to each training sample in the first sample set; and then, training the entity classification model by taking the minimum model prediction loss as a target, thereby maximizing the probability score of each character in each template text and obtaining the trained entity classification model.
The above describes the processing of multi-labeled exemplars for a multi-labeled exemplar set. As previously described, in some embodiments, the first set of samples may also include a second training sample (corresponding to the single-labeled training sample described above). The second training sample includes a second original text (corresponding to the single-label original text) and a third template text (corresponding to the positive template text) corresponding to the second original text, which is a text describing the second entity in the second original text and its corresponding single second entity type. In one example, the third template text may be specifically determined by the following implementation: acquiring a single-type template comprising an entity slot position and an entity type slot position; and then filling the second entity in the second original text and the single second entity type corresponding to the second entity in an entity slot position and an entity type slot position in the single-type template to obtain a third template sample. Specific examples can be found in example 1 and example 3 above.
In some examples, the second training sample may also include a fourth template text (corresponding to the negative template text described above), which is text describing non-entities in the second original text other than the second entity type. The obtaining method is similar to the process of obtaining the second template text, and is not repeated.
When the second training sample is included in the first sample set, the training process of the model further includes: and inputting the second original text into the first entity classification model to obtain a second prediction result, wherein the second prediction result comprises a third probability score of each word in the third template text which is sequentially output.
Similar to the aforementioned obtaining of the first prediction result, when the second original text is input into the first entity classification model, the decoder outputs a word probability distribution corresponding to each of a plurality of time instants for the second original text, and the second prediction result may be determined based on the word probability distribution. Specifically, each word in the third template text may be used as a target word at each time, and the third probability score for sequentially outputting each word in the third template text may be determined based on the word probability distribution corresponding to each of the plurality of times, and may be included in the second prediction result. And when the second training sample further comprises a fourth template text, taking each word in the fourth template text as a target word at each moment, determining a fourth probability score of each word in the fourth template text which is sequentially output, and classifying the fourth probability score into a second prediction result.
In this way, the second prediction loss corresponding to the second training sample can be determined according to the second prediction result.
In this case, the prediction loss corresponding to each training sample in the first sample set in step 44 includes not only the first prediction loss corresponding to each first training sample but also the second prediction loss corresponding to each second training sample. The first entity classification model is trained based on a combined sample set comprising multi-labeled samples and single-labeled samples.
According to one possible implementation, the first entity classification model is a model pre-trained with a large number of single-label samples. In this implementation, the following process is also performed before the aforementioned step 41.
Acquiring a second sample set (corresponding to the single label sample set), wherein each third training sample (corresponding to the single label training sample in the single label sample set) included in the second sample set includes a third original text and a corresponding fifth template text (corresponding to the positive template text), which is a text describing a third entity in the corresponding third original text and a single entity type corresponding to the third entity; the obtaining mode is similar to the process of obtaining the third template text; in some examples, the third training sample may also include a sixth template text (corresponding to the negative template text described above), which is text describing non-entities in the third original text other than the third entity type. The obtaining mode is similar to the process of obtaining the second template text, and the description is omitted; then, for each third original text of the second sample set, inputting the third original text into a second entity classification model (initial model) to obtain a third prediction result, wherein the obtaining mode is similar to the first prediction result; wherein, the result at least comprises a fifth probability score of sequentially outputting each character of the fifth template text; if the third original text also corresponds to the sixth template text; the third prediction result may further include sequentially outputting a sixth probability score of each word of the sixth template text; and then, training a second entity classification model based on a third prediction result corresponding to each third training sample in the second sample set, and taking the trained second entity classification model as the first entity classification model. The first entity classification model corresponds to the entity classification model trained on the single label sample set. Here, the first entity classification model and the second entity classification model have the same model structure but different model parameters.
Referring back to the above process, in an embodiment of the present specification, a template text of an original text (for describing one entity in the original text and a plurality of tags of the entity) is constructed, the number of outputs of an entity classification template is designed to be adapted to the number of words of the template text (i.e., one output corresponds to one word), and an entity classification model is trained with the goal of maximizing a probability score of each word in the template text, so that the entity classification model learns information of a small amount of the original text (including the multi-tagged entities) to realize multi-tag classification of the entity.
According to an embodiment of another aspect, an apparatus for training an entity classification model is also provided. FIG. 5 is a schematic structural diagram of an entity classification model training apparatus according to an embodiment, which may be deployed in any device, platform, or device cluster having data storage, computation, and processing capabilities. As shown in fig. 5, the apparatus 500 includes:
a sample set obtaining module 51, configured to obtain a first sample set, where the first sample set includes a first training sample, where the first training sample includes a first original text and a first template text corresponding to the first original text, and the first template text is used to describe a first entity in the first original text and a plurality of first entity types corresponding to the first entity;
a prediction module 52 configured to input the first original text into a first entity classification model to obtain a first prediction result, wherein the first prediction result includes sequentially outputting a first probability score of each word in a first template text;
a loss determining module 53, configured to determine a first prediction loss corresponding to the first training sample according to the first prediction result;
a training module 54 configured to train the first entity classification model based on the predicted loss for each training sample in the first set of samples.
In each embodiment, the modules are specifically configured to perform each step in the method described above with reference to fig. 3, and details are not repeated here.
By the device, the multi-label template text of the original text (used for explaining one entity in the original text and a plurality of labels of the entity) is constructed, the constructed multi-label template text is used as the output of the entity classification model, the probability score of each word in the multi-label template text is predicted, the probability score of each word in the multi-label template text is maximized as the target, the entity classification model is trained, and the entity classification model learns a small amount of information of the original text (including the multi-label entity) to realize multi-label classification of the entity.
According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 3.
According to an embodiment of yet another aspect, there is also provided a computing device comprising a memory and a processor, the memory having stored therein executable code, the processor, when executing the executable code, implementing the method described in connection with fig. 3.
Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims (14)

1. A method for training an entity classification model comprises the following steps:
obtaining a first sample set, wherein the first sample set comprises a first training sample, the first training sample comprises a first original text and a first template text corresponding to the first original text, and the first template text is used for describing a first entity in the first original text and a plurality of first entity types corresponding to the first entity;
inputting the first original text into a first entity classification model to obtain a first prediction result, wherein the first prediction result comprises a first probability score of each character in a first template text which is sequentially output;
determining a first prediction loss corresponding to the first training sample at least according to the first prediction result;
and training the first entity classification model based on the prediction loss corresponding to each training sample in the first sample set.
2. The method of claim 1, wherein the first training sample further comprises a second template text describing non-entities of the first original text that do not belong to the plurality of first entity types;
the first prediction result further comprises sequentially outputting a second probability score of each word in the second template text.
3. The method of claim 1, wherein the first sample set further comprises a second training sample, the second training sample comprising a second original text and a corresponding third template text describing a second entity in the second original text and a single second entity type to which the second entity corresponds;
the method further comprises the following steps: inputting the second original text into the first entity classification model to obtain a second prediction result, wherein the second prediction result comprises a third probability score of each word in a third template text which is sequentially output;
and determining a second prediction loss corresponding to the second training sample at least according to the second prediction result.
4. The method of claim 3, wherein the second training sample further comprises fourth template text describing non-entities in the second raw text other than the second entity type;
the second prediction result further comprises sequentially outputting a fourth probability score of each word in the fourth template text.
5. The method of claim 3, wherein the plurality of first entity types includes the second entity type.
6. The method of claim 3, wherein the method further comprises:
obtaining a single-type template, wherein the single-type template comprises an entity slot position and an entity type slot position;
and filling the second entity in the second original text and the single first entity type corresponding to the second entity in the single-type template respectively into the entity slot position and the entity type slot position in the single-type template to obtain a third template text.
7. The method of claim 1, wherein the method further comprises:
acquiring a multi-type template corresponding to the first original text, wherein the multi-type template comprises an entity slot position and a plurality of entity type slot positions;
filling a first entity in the first original text and a plurality of first entity types corresponding to the first entity in entity slot positions and a plurality of entity type slot positions in the multi-type template respectively to obtain a first template sample; the number of the plurality of entity type slots and the number of the plurality of first entity types are the same.
8. The method of claim 1, wherein the method further comprises:
obtaining a second sample set, wherein the second sample set comprises a plurality of third training samples, each third training sample comprises a third original text and a corresponding fifth template text, and the fifth template text is used for describing a third entity in the corresponding third original text and a single third entity type corresponding to the third entity;
inputting the third original text into a second entity classification model to obtain a third prediction result, wherein the third prediction result comprises a fifth probability score of each word in a fifth template text which is sequentially output;
and training the second entity classification model based on a third prediction result corresponding to each third training sample in the second sample set, and taking the trained second entity classification model as the first entity classification model.
9. The method of claim 8, wherein the third entity type is any one of the plurality of first entity types;
the number of samples in the second set of samples is greater than the number of samples in the first set of samples.
10. The method of claim 1, wherein the first entity classification model comprises an encoder and a decoder; inputting the first original text into a first entity classification model to obtain a first prediction result, wherein the first prediction result comprises:
encoding the first original text by using the encoder to obtain an encoding vector;
decoding the coding vector by using the decoder, and outputting word probability distribution corresponding to each of a plurality of moments;
for each target time in the plurality of times, determining a probability of a target word from a word probability distribution corresponding to the target time, and including the probability of the target word as the word sequentially numbered in the first template text corresponding to the target time, into the first probability score.
11. The method of claim 1, wherein the plurality of first entity types comprise a mobile phone number, an account number of an application.
12. An apparatus for training an entity classification model, comprising:
the system comprises a sample set acquisition module, a first training sample acquisition module and a second training sample acquisition module, wherein the first training sample comprises a first original text and a corresponding first template text, and the first template text is used for describing a first entity in the first original text and a plurality of first entity types corresponding to the first entity;
the prediction module is configured to input the first original text into a first entity classification model to obtain a first prediction result, wherein the first prediction result comprises a first probability score of each word in a first template text which is sequentially output;
a loss determination module configured to determine a first prediction loss corresponding to the first training sample according to at least the first prediction result;
a training module configured to train the first entity classification model based on the predicted loss corresponding to each training sample in the first set of samples.
13. A computer-readable storage medium, on which a computer program is stored which, when executed in a computer, causes the computer to carry out the method of any one of claims 1-11.
14. A computing device comprising a memory and a processor, wherein the memory has stored therein executable code that, when executed by the processor, performs the method of any of claims 1-11.
CN202210468468.8A 2022-04-29 2022-04-29 Entity classification model training method and device and computer readable storage medium Pending CN114861666A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210468468.8A CN114861666A (en) 2022-04-29 2022-04-29 Entity classification model training method and device and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210468468.8A CN114861666A (en) 2022-04-29 2022-04-29 Entity classification model training method and device and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN114861666A true CN114861666A (en) 2022-08-05

Family

ID=82635555

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210468468.8A Pending CN114861666A (en) 2022-04-29 2022-04-29 Entity classification model training method and device and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN114861666A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115859988A (en) * 2023-02-08 2023-03-28 成都无糖信息技术有限公司 Entity account extraction method and system for social text

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115859988A (en) * 2023-02-08 2023-03-28 成都无糖信息技术有限公司 Entity account extraction method and system for social text
CN115859988B (en) * 2023-02-08 2023-10-03 成都无糖信息技术有限公司 Entity account extraction method and system for social text

Similar Documents

Publication Publication Date Title
CN111177393B (en) Knowledge graph construction method and device, electronic equipment and storage medium
CN110196982B (en) Method and device for extracting upper-lower relation and computer equipment
CN110795938B (en) Text sequence word segmentation method, device and storage medium
CN111738169B (en) Handwriting formula recognition method based on end-to-end network model
CN112287069B (en) Information retrieval method and device based on voice semantics and computer equipment
CN111985229A (en) Sequence labeling method and device and computer equipment
CN111783471B (en) Semantic recognition method, device, equipment and storage medium for natural language
CN113158656B (en) Ironic content recognition method, ironic content recognition device, electronic device, and storage medium
CN112528029A (en) Text classification model processing method and device, computer equipment and storage medium
CN116245097A (en) Method for training entity recognition model, entity recognition method and corresponding device
CN114780701B (en) Automatic question-answer matching method, device, computer equipment and storage medium
CN112949320A (en) Sequence labeling method, device, equipment and medium based on conditional random field
CN112084779A (en) Entity acquisition method, device, equipment and storage medium for semantic recognition
CN114861666A (en) Entity classification model training method and device and computer readable storage medium
CN112434746B (en) Pre-labeling method based on hierarchical migration learning and related equipment thereof
CN114385694A (en) Data processing method and device, computer equipment and storage medium
WO2024055864A1 (en) Training method and apparatus for implementing ia classification model using rpa and ai
CN115730237B (en) Junk mail detection method, device, computer equipment and storage medium
CN112784015B (en) Information identification method and device, apparatus, medium, and program
CN115294581A (en) Method and device for identifying error characters, electronic equipment and storage medium
CN114882874A (en) End-to-end model training method and device, computer equipment and storage medium
CN110399615B (en) Transaction risk monitoring method and device
CN114299510A (en) Handwritten English line recognition system
CN114692603A (en) Sensitive data identification method, system, device and medium based on CRF
CN113988223A (en) Certificate image recognition method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination