CN116070632A

CN116070632A - Informal text entity tag identification method and device

Info

Publication number: CN116070632A
Application number: CN202211490287.1A
Authority: CN
Inventors: 徐星晨; 朱亮
Original assignee: Jinmao Cloud Technology Service Beijing Co ltd
Current assignee: Jinmao Cloud Technology Service Beijing Co ltd
Priority date: 2022-11-25
Filing date: 2022-11-25
Publication date: 2023-05-05

Abstract

The invention discloses a method and a device for recognizing an informal text entity tag. The method comprises the following steps: dividing an informal text to be identified into short sentences, and carrying out qualitative tag identification by referring to a basic word stock; constructing and training a universal field pre-training model by using the universal field corpus; designing a corresponding entity dictionary by referring to entity tags to be identified; and (3) carrying out sequence labeling on the text in the small sample field according to the entity labels contained in the designed entity dictionary, generating a small sample data set, carrying out fine tuning training by adopting the small sample data set, inputting the informal text to be identified into a fine-tuned pre-training model, and outputting an entity sequence obtained by analyzing the informal text to be identified. The invention can reduce the labor cost of entity identification technology in the real estate industry through innovation of natural language processing technology; and through fine tuning training aiming at the universal language pre-training model, the field adaptation of the model is realized, and the named entity of the informal text is identified.

Description

Informal text entity tag identification method and device

Technical Field

The present disclosure relates to the field of natural language processing technologies, and in particular, to a method and an apparatus for recognizing an informal text entity tag.

Background

User portrait labels currently used in the real estate industry are mostly based on business data, new user labels are extracted from stock text data to enrich CDP (content data protection protocol) to become a break for improving the insight of clients of enterprises, and the labels are required to be obtained from informal texts without bypassing a named entity recognition technology, but named entity recognition can recognize objective entities in texts in general fields, and can not distinguish the expression of the same entity words under different business requirements in the informal texts in the industry fields.

Named entity recognition, also known as sequence labeling, is directed to extracting anything from a piece of text to find out what is desired, possibly a word, or a phrase, and can be classified into three categories depending on the method: (1) dictionary and rule-based methods; (2) a machine learning based method; (3) a method based on deep learning semantic coding.

The method based on dictionary and rules is to construct dictionary by using the prior knowledge, identify potential entities in sentences by word matching, and then screen by some rules or identify entities in sentences by sentence pattern templates; named entity recognition is regarded as a sequence labeling problem in a machine learning-based method, and an algorithm model which is used for performing supervision training is mainly used for predicting entity labels of each word; the method based on the deep learning semantic coding generally comprises the steps of training a word-level self-coding neural network through a large number of corpora, enabling the self-coding neural network to learn the semantics of each word in the corpora for text coding, and training the text with the coding completed and the sequence labeling through a deep learning algorithm or a machine learning algorithm model to enable the text to be capable of identifying corresponding named entities.

The three methods have advantages and disadvantages, the dictionary and rule-based method has the strongest interpretability, very stable performance and higher accuracy, but the method depends on the volume of the dictionary or sentence pattern template, and the recall rate is rapidly reduced and continuously increased along with the diversification of words, the diversification of text structures and the gradual increase of the dictionary and sentence pattern template, so that the maintenance cost is often increased, and the rules in different fields are not commonly used; the method based on machine learning can identify the entity in the text from vocabulary resources, vocabulary models and statistical data obtained from huge corpus, and the effect is in direct proportion to the corpus enrichment degree to a certain extent, so that the problem is that the sequence labeling of the field text needs to be carried out by inputting a large amount of manpower in order to reach the available standard in the industry field outside the general field; the method based on the deep learning semantic coding avoids the construction of a large number of artificial features and realizes the end-to-end model effect, but a large amount of manpower is needed to be invested for sequence labeling in order to realize the end-to-end model.

The three methods have still good recognition effects in the formal text, but still are still not mature in the informal text, the recognized named entity is an objective entity word, and the label expression of the same entity word under different business requirements in the industry field text cannot be distinguished, for example, 3 ten thousand in a house with 3 ten thousand and 3000 rented in hand and 3000 are objective entity 'amount' and the task target of 3 ten thousand in the business need to be recognized and 3000 in the house can not be given.

In conclusion, the method based on the dictionary and the rule in the three methods is high-accuracy and low-recall, depends on the volume of the dictionary or sentence pattern template, and lacks the field adaptation capability; the scheme based on the machine learning and deep learning algorithm model is not suitable for informal texts, and the texts need to be subjected to sequence labeling by manpower, and is dependent on corpus accumulation; and none of the three methods can identify more complex entity tags.

Disclosure of Invention

Based on the above, the technical problems, namely, the method and the device for recognizing the entity tag of the informal text, are provided, and the technical problems that the existing entity recognition method needs to be put into large manpower and is not suitable for the informal text are solved.

In order to achieve the above object, the present application provides the following technical solutions:

in a first aspect, a method for identifying an informal text entity tag includes:

s1, dividing an informal text to be identified into short sentences, referring to a basic word stock, matching the informal text to be identified, and carrying out qualitative tag identification;

s2, constructing and training a universal field pre-training model by using a universal field corpus;

s3, designing a corresponding physical dictionary by referring to the physical label to be identified, and setting a using mode of the physical dictionary;

s4, carrying out sequence labeling on the text in the small sample field according to the entity labels contained in the entity dictionary designed in the step S3, and generating a small sample data set;

and S5, performing fine tuning training on the universal field pre-training model by adopting the small sample data set, inputting the informal text to be identified into the fine-tuned pre-training model, and outputting an entity sequence obtained by analyzing the informal text to be identified.

Optionally, the step S1 includes:

loading a compound word stock, preprocessing the informal text to be identified, and dividing the informal text to be identified into short sentences according to punctuation;

inputting the short sentences obtained by segmentation into a text matrix, and referring to a basic word stock to match the word scaling class labels in the short sentences to obtain text data T1 finishing preliminary labeling;

performing secondary matching on the text data T1 by using word filter class labels by referring to a basic word stock to obtain text data T2 with secondary labeling completed;

taking the dialogue text with the label in the text data T2 and a plurality of pieces of dialogue text around the dialogue text as target text T3;

constructing a sliding window in the target text T3, continuously generating long text T4 consisting of short sentences in the window, and calculating the similarity of each long text T4 and each long text sentence vector of the corresponding label in the basic word stock;

and for each tag contained in each long text T4, if the similarity between at least one long text sentence vector and the long text T4 in the basic word stock reaches a preset similarity threshold, judging that the current tag is a qualitative tag L1 of the long text T4.

Further optionally, the sliding window has a length of 5.

Optionally, step S2 includes:

s21, loading a general field corpus;

s22, acquiring texts from the corpus of the general field;

s23, performing MASK operation on the random characters in the acquired text;

s24, predicting the text subjected to MASK operation by using the basic model;

s25, after the general field corpus completes traversal, calculating the difference between the predicted result and the actual result;

s26, if the difference value does not reach the preset corresponding threshold value, reloading the universal field corpus to continue training the model, and repeating the steps S22-S26;

and S27, if the difference value reaches a preset corresponding threshold value, training is finished, and a universal field pre-training model is obtained.

Optionally, the base model is a RoBERTa model.

Optionally, the physical dictionary includes 4 physical tags, which are area, price, proportion and cell name respectively; each entity tag has a corresponding serial number.

Optionally, step S4 includes:

s41, acquiring the entity dictionary;

s42, inputting entity information into a marking platform;

s43, making a labeling standard;

s44, extracting data from the small sample field text and importing the small sample field text into a marking platform;

s45, manually marking data in the text in the small sample field;

s46, preprocessing the marked data by referring to the using mode of the entity dictionary, and generating a small sample data set with marks completed.

Optionally, step S5 includes:

s51, loading the universal field pre-training model and a small sample data set, and constructing training data by using the small sample data set;

s52, defining an optimizer and a learning rate, and training the universal field pre-training model through the training data;

s53, inputting the informal text to be identified into a pre-training model when the model evaluation index reaches a preset corresponding threshold; when the model evaluation index does not reach the preset corresponding threshold, repeating the step S52;

s54, removing redundant dimensions of the output vector of the pre-training model, and removing incomplete entity sequences of the output vector of the pre-training model;

s55, positioning the entity sequence index of the complete recognition of the output vector of the pre-training model, analyzing the entity sequence according to the entity dictionary, and carrying out structural output.

Further optionally, the method further comprises:

inputting the obtained long text T4 into a pre-training model for fine-tuning training to perform entity identification, and combining with a corresponding qualitative label to perform redundancy elimination;

extracting a sample result for artificial verification, and judging whether the accuracy of the total sample, the recall rate of the total sample and the accuracy of each entity label respectively reach corresponding standards;

and if the accuracy of the total sample, the recall rate of the total sample and the accuracy of each entity label respectively reach corresponding standards, judging that the pretrained model for fine adjustment training reaches the usable degree.

In a second aspect, an informal text entity tag recognition apparatus includes:

the qualitative tag recognition module is used for dividing the informal text to be recognized into short sentences, referring to a basic word stock to match the informal text to be recognized, and carrying out qualitative tag recognition;

a training module of the universal field pre-training model for constructing and training the universal field pre-training model by using the corpus of the universal field,

the entity dictionary design module is used for designing a corresponding entity dictionary by referring to entity labels to be identified and setting a use mode of the entity dictionary;

the small sample data set generation module is used for carrying out sequence labeling on the text in the small sample field according to the entity labels contained in the entity dictionary designed by the entity dictionary design module to generate a small sample data set;

and the entity sequence analysis module is used for carrying out fine tuning training on the universal field pre-training model by adopting the small sample data set, inputting the informal text to be identified into the fine-tuned pre-training model, and outputting an entity sequence obtained by analyzing the informal text to be identified.

The invention has at least the following beneficial effects:

the embodiment of the invention provides an informal text entity tag identification method, which comprises the steps of dividing an informal text to be identified into short sentences, carrying out qualitative tag identification by referring to a basic word stock, constructing and training a universal field pre-training model by using a universal field corpus, designing an entity dictionary, carrying out sequence labeling on a small sample field text according to entity tags contained in the designed entity dictionary, generating a small sample dataset, carrying out fine tuning training by adopting the small sample dataset, inputting the pre-training model after the informal text to be identified is finely tuned, and outputting an entity sequence obtained by analyzing the informal text to be identified; the labor cost of the entity identification technology in the real estate industry can be reduced through innovation of the natural language processing technology; performing fine tuning training on a universal language pre-training model to realize field adaptation of the model and identify a named entity of an informal text; and the task target of identifying the entity tag according to the service requirement is realized by combining the use of the qualitative tag identification model.

The method for recognizing the entity tag of the informal text provided by the embodiment of the invention not only can be used for the local production industry, but also is suitable for recognizing entity tags of texts in other fields which lack a large number of sequence labeling corpuses; the selection of the pre-training model is diversified and can be flexible and changeable.

Drawings

FIG. 1 is a flowchart of a method for recognizing an informal text entity tag according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a specific flow of qualitative label labeling in an embodiment of the present invention;

FIG. 3 is a schematic flow diagram of an MLM task in accordance with one embodiment of the invention;

FIG. 4 is a schematic diagram of input and output of a pre-training model according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a physical dictionary in accordance with one embodiment of the present invention;

FIG. 6 is a diagram illustrating the use of a physical dictionary in one embodiment of the present invention;

FIG. 7 is a schematic diagram of a small sample dataset construction flow in accordance with an embodiment of the present invention;

FIG. 8 is a schematic diagram of a fine tuning training and model output flow in accordance with one embodiment of the present invention;

FIG. 9 is a block diagram of a module architecture of an apparatus for recognizing an informal text entity tag according to an embodiment of the present invention;

fig. 10 is an internal structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

In one embodiment, an informal text entity tag identification method is provided, which aims to extract entity tags for compound business requirements from informal text by mining high-value entity tags in stock remark information generated by business, and assisting a business customized tag system and a natural language processing algorithm. The original business data is remark text left by the real estate industry consultant in the visiting system after each time of receiving the visitor, and the remark text is informal text. The prepositive data acquisition is based on an intelligent case field platform of China Jin Mao and a Jin Maoyun-technology text label labeling model, and is stored in a Jin Maoyun-technology big data processing platform, and text data is read in a two-dimensional table form. As shown in fig. 1, the method comprises the steps of:

s1, dividing the informal text to be recognized into short sentences, referring to a basic word stock, matching the informal text to be recognized, and performing qualitative tag recognition.

The qualitative label identification is text semantic point identification with coarse granularity, for example, the qualitative label hit by the text '130 more than 130 flat 4 rooms under the atlas of the middle sea purchased in 19 years' is 'cell information' and 'house', and the entity label is 'cell name': in the atlanto-atlas, the housing area: 130 times flat, housing types: the method has the advantages that 4 rooms are reserved, the text phrases with qualitative labels are firstly identified, the complexity of a follow-up entity label identification algorithm can be effectively reduced, and the accuracy is improved.

Further, as shown in fig. 2, step S1 includes:

loading a compound word stock, preprocessing an informal text to be identified, and dividing the informal text to be identified into short sentences according to punctuation;

performing secondary matching on the text data T1 by using word filter class labels by referring to the basic word stock to obtain text data T2 with the completed secondary labeling;

for each tag contained in each long text T4, if the similarity between at least one long text sentence vector and the long text T4 in the basic word stock reaches a preset similarity threshold, determining that the current tag is the qualitative tag L1 of the long text T4.

In other words, the invention adopts Jin Maoyun service to put forward a text label labeling model based on a compound word stock to carry out qualitative label recognition before, and the workflow of the model is as shown in figure 2, firstly, the text is preprocessed, long text is segmented into short sentences according to punctuation, and the short sentences are taken as basic units to be input into the model; firstly, recognizing word-scaling class labels according to a basic word stock, then adopting a word stock which is not the word-scaling class labels to carry out secondary matching, positioning matched short sentences, obtaining a dialogue text in which a current short sentence is positioned and 5 dialogues up and down, and constructing a moving window with the length of 5 to slide in the short sentences in the 5 dialogue texts; and generating text sentence vectors in a window based on a word2vec model, continuously calculating the similarity between the text sentence vectors and long text sentence vectors in a word library corpus of matched labels, judging that the labels are matched if one long text corpus under the same label reaches a threshold value, and then recording a matching log and outputting a result.

The model can output qualitative labels and corresponding sentences existing in the text, then analyze the matching log into key value pairs (keys: long text T4, values: qualitative labels) through text preprocessing, and store the key value pairs in a two-dimensional table.

S2, constructing and training a universal field pre-training model by using the universal field corpus.

Further, as shown in fig. 3, step S2 includes:

s21, loading a general field corpus;

s22, acquiring texts from the corpus of the general field;

s23, performing MASK operation on the random characters in the acquired text;

s24, predicting the text subjected to MASK operation by using the basic model;

In other words, in order to reduce the labor input and avoid the defect of lack of industrial domain sequence labeling corpus, the method selects the public general domain corpus as training corpus, constructs and trains a general text pre-training model, trains by taking the RoBERTa model as a basic model and taking an MLM task as a target, and the MLM task flow is shown in figure 3.

Through this step, a Chinese model (available semantic coding model) that understands the semantics of the generic text can be trained, the input and output of which is shown in FIG. 4, for example.

S3, designing a corresponding physical dictionary by referring to the physical label to be identified, and setting a using mode of the physical dictionary.

The qualitative labels in step S1 can be decomposed into more fine-grained entity labels according to the service requirement, but the entity labels of the character strings cannot be input into the pre-training model, so the invention designs a corresponding entity dictionary by referring to the word format of the BIO and the entity labels to be identified, as shown in fig. 5, 4 entity labels are required to be designed in total, namely the area, the price, the proportion and the cell name, and each entity label has a corresponding serial number. Illustrated in fig. 6 is the manner in which the physical dictionary is used.

S4, carrying out sequence labeling on the small sample field text according to the entity labels contained in the entity dictionary designed in the step S3, and generating a small sample data set of the real estate industry.

Further, as shown in fig. 7, step S4 includes:

s41, acquiring an entity dictionary;

s42, inputting entity information into a marking platform;

s43, making a labeling standard;

s45, manually marking data in the text in the small sample field;

In other words, the pre-training model can only produce text codes, a small amount of manpower is required to refer to the entity labels set in the step S3 to perform sequence labeling on the small sample field text, the sequence labeling is used for fine-tuning training of the pre-training model, so that the task of entity identification can be realized, and the data set construction flow is shown in fig. 7.

And S5, performing fine tuning training on the universal field pre-training model by adopting a small sample data set, inputting the informal text to be identified into the fine-tuned pre-training model, and outputting an entity sequence obtained by analyzing the informal text to be identified.

Further, as shown in fig. 8, step S5 includes:

s51, loading a general field pre-training model and a small sample data set, and constructing training data by using the small sample data set;

s52, defining an optimizer and a learning rate, and training a general field pre-training model through training data;

In other words, the training model in step S3 is fine-tuned by using the small sample data set constructed in step S4, and the model output is preprocessed, and the entity recognition model continues to train as shown in fig. 8.

Further, the method also comprises a model test evaluation flow, which specifically comprises the following steps:

That is, the long text T4 generated in step S1 is input into the model in step S5 for entity recognition, and then the qualitative label corresponding to the text is combined for redundancy removal, so that the entity label corresponding to the qualitative label is left, and the sample result is extracted for manual verification, wherein the accuracy (total number of correctly recognized entity labels/total number of recognized entity labels) of the total sample is 88.26%, the recall rate (total number of correctly recognized entity labels/total number of actually present entity expressions) of the total sample is 72.59%, and the average accuracy of each entity label reaches 90.09%, so that the model can reach the available degree, and if the model evaluation index needs to be lifted, the small sample data size in step S4 can be lifted.

The method for identifying the entity tag of the informal text provided by the embodiment of the invention can reduce the labor cost investment of entity identification and is also suitable for identifying the entity tag of the informal text: through innovation of natural language processing technology, the labor cost of entity identification technology in the real estate industry can be reduced; by carrying out fine tuning training on the universal language pre-training model, the field adaptation of the model can be realized, and the named entity of the informal text can be identified; and the task target of identifying the entity tag according to the service requirement is realized by combining the use of the qualitative tag identification model.

The method for recognizing the entity tag of the informal text can be used for the local production industry and is also suitable for recognizing entity tags of texts in other fields which lack a large number of sequence labeling corpuses; the selection of the pre-training model is diversified, and the model can be flexible and changeable;

the method can improve the effect of the informal text entity recognition technology under the condition that the field labeling text corpus is limited; the qualitative labels are used as weak classifiers, and the identification effect of the model on the entity labels is improved, and meanwhile, the identification capability of the service labels is given.

In summary, the invention can reduce the labor cost of entity identification investment, identify the naming entity of the informal text, and identify the entity label according to the service requirement.

It should be understood that, although the steps in the flowcharts of fig. 1-8 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in FIGS. 1-8 may include multiple steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of the steps or stages in other steps or other steps.

In one embodiment, as shown in fig. 9, there is provided an informal text entity tag recognition apparatus including the following program modules:

the qualitative tag recognition module 901 is configured to segment an informal text to be recognized into short sentences, match the informal text to be recognized with reference to a basic word stock, and perform qualitative tag recognition;

a general domain pre-training model training module 902 for constructing and training a general domain pre-training model using the general domain corpus,

the physical dictionary design module 903 is configured to design a corresponding physical dictionary with reference to the physical label to be identified, and set a usage mode of the physical dictionary;

the small sample data set generating module 904 is configured to perform sequence labeling on the small sample field text according to the entity tag included in the entity dictionary designed by the entity dictionary design module, so as to generate a small sample data set;

the entity sequence analysis module 905 is configured to perform fine tuning training on the universal field pre-training model by using the small sample data set, input the informal text to be identified into the fine-tuned pre-training model, and output an entity sequence obtained by analyzing the informal text to be identified.

For a specific limitation of an informal text entity tag recognition apparatus, reference may be made to the above limitation of an informal text entity tag recognition method, and no further description is given here. Each of the above-described modules in an informal text entity tag recognition apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a terminal, and an internal structure diagram thereof may be as shown in fig. 10. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a method of informal text entity tag identification. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structure shown in fig. 10 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, including a memory and a processor, the memory having stored therein a computer program, involving all or part of the flow of the methods of the embodiments described above.

In one embodiment, a computer readable storage medium having a computer program stored thereon is provided, involving all or part of the flow of the methods of the embodiments described above.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static RandomAccess Memory, SRAM) or dynamic random access memory (Dynamic RandomAccess Memory, DRAM), and the like.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. A method for identifying an informal text entity tag, comprising:

2. The method for recognizing an informal text entity tag according to claim 1, wherein step S1 comprises:

3. The method of informal text entity tag identification of claim 2, wherein the sliding window has a length of 5.

4. The method for recognizing an informal text entity tag according to claim 1, wherein step S2 comprises:

s21, loading a general field corpus;

s22, acquiring texts from the corpus of the general field;

s23, performing MASK operation on the random characters in the acquired text;

s24, predicting the text subjected to MASK operation by using the basic model;

5. The method of informal text entity tag identification of claim 4, wherein the base model is a RoBERTa model.

6. The method for recognizing entity tags of informal text according to claim 1, wherein the entity dictionary comprises 4 entity tags, which are area, price, proportion and cell name, respectively; each entity tag has a corresponding serial number.

7. The method for recognizing an informal text entity tag according to claim 1, wherein step S4 comprises:

s41, acquiring the entity dictionary;

s42, inputting entity information into a marking platform;

s43, making a labeling standard;

s45, manually marking data in the text in the small sample field;

8. The method for recognizing an informal text entity tag according to claim 1, wherein step S5 comprises:

9. The method of informal text entity tag identification of claim 2, the method further comprising:

10. An informal text entity tag recognition apparatus, comprising: