CN116070632A - Informal text entity tag identification method and device - Google Patents

Informal text entity tag identification method and device Download PDF

Info

Publication number
CN116070632A
CN116070632A CN202211490287.1A CN202211490287A CN116070632A CN 116070632 A CN116070632 A CN 116070632A CN 202211490287 A CN202211490287 A CN 202211490287A CN 116070632 A CN116070632 A CN 116070632A
Authority
CN
China
Prior art keywords
text
entity
training
informal
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211490287.1A
Other languages
Chinese (zh)
Inventor
徐星晨
朱亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jinmao Cloud Technology Service Beijing Co ltd
Original Assignee
Jinmao Cloud Technology Service Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jinmao Cloud Technology Service Beijing Co ltd filed Critical Jinmao Cloud Technology Service Beijing Co ltd
Priority to CN202211490287.1A priority Critical patent/CN116070632A/en
Publication of CN116070632A publication Critical patent/CN116070632A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method and a device for recognizing an informal text entity tag. The method comprises the following steps: dividing an informal text to be identified into short sentences, and carrying out qualitative tag identification by referring to a basic word stock; constructing and training a universal field pre-training model by using the universal field corpus; designing a corresponding entity dictionary by referring to entity tags to be identified; and (3) carrying out sequence labeling on the text in the small sample field according to the entity labels contained in the designed entity dictionary, generating a small sample data set, carrying out fine tuning training by adopting the small sample data set, inputting the informal text to be identified into a fine-tuned pre-training model, and outputting an entity sequence obtained by analyzing the informal text to be identified. The invention can reduce the labor cost of entity identification technology in the real estate industry through innovation of natural language processing technology; and through fine tuning training aiming at the universal language pre-training model, the field adaptation of the model is realized, and the named entity of the informal text is identified.

Description

Informal text entity tag identification method and device
Technical Field
The present disclosure relates to the field of natural language processing technologies, and in particular, to a method and an apparatus for recognizing an informal text entity tag.
Background
User portrait labels currently used in the real estate industry are mostly based on business data, new user labels are extracted from stock text data to enrich CDP (content data protection protocol) to become a break for improving the insight of clients of enterprises, and the labels are required to be obtained from informal texts without bypassing a named entity recognition technology, but named entity recognition can recognize objective entities in texts in general fields, and can not distinguish the expression of the same entity words under different business requirements in the informal texts in the industry fields.
Named entity recognition, also known as sequence labeling, is directed to extracting anything from a piece of text to find out what is desired, possibly a word, or a phrase, and can be classified into three categories depending on the method: (1) dictionary and rule-based methods; (2) a machine learning based method; (3) a method based on deep learning semantic coding.
The method based on dictionary and rules is to construct dictionary by using the prior knowledge, identify potential entities in sentences by word matching, and then screen by some rules or identify entities in sentences by sentence pattern templates; named entity recognition is regarded as a sequence labeling problem in a machine learning-based method, and an algorithm model which is used for performing supervision training is mainly used for predicting entity labels of each word; the method based on the deep learning semantic coding generally comprises the steps of training a word-level self-coding neural network through a large number of corpora, enabling the self-coding neural network to learn the semantics of each word in the corpora for text coding, and training the text with the coding completed and the sequence labeling through a deep learning algorithm or a machine learning algorithm model to enable the text to be capable of identifying corresponding named entities.
The three methods have advantages and disadvantages, the dictionary and rule-based method has the strongest interpretability, very stable performance and higher accuracy, but the method depends on the volume of the dictionary or sentence pattern template, and the recall rate is rapidly reduced and continuously increased along with the diversification of words, the diversification of text structures and the gradual increase of the dictionary and sentence pattern template, so that the maintenance cost is often increased, and the rules in different fields are not commonly used; the method based on machine learning can identify the entity in the text from vocabulary resources, vocabulary models and statistical data obtained from huge corpus, and the effect is in direct proportion to the corpus enrichment degree to a certain extent, so that the problem is that the sequence labeling of the field text needs to be carried out by inputting a large amount of manpower in order to reach the available standard in the industry field outside the general field; the method based on the deep learning semantic coding avoids the construction of a large number of artificial features and realizes the end-to-end model effect, but a large amount of manpower is needed to be invested for sequence labeling in order to realize the end-to-end model.
The three methods have still good recognition effects in the formal text, but still are still not mature in the informal text, the recognized named entity is an objective entity word, and the label expression of the same entity word under different business requirements in the industry field text cannot be distinguished, for example, 3 ten thousand in a house with 3 ten thousand and 3000 rented in hand and 3000 are objective entity 'amount' and the task target of 3 ten thousand in the business need to be recognized and 3000 in the house can not be given.
In conclusion, the method based on the dictionary and the rule in the three methods is high-accuracy and low-recall, depends on the volume of the dictionary or sentence pattern template, and lacks the field adaptation capability; the scheme based on the machine learning and deep learning algorithm model is not suitable for informal texts, and the texts need to be subjected to sequence labeling by manpower, and is dependent on corpus accumulation; and none of the three methods can identify more complex entity tags.
Disclosure of Invention
Based on the above, the technical problems, namely, the method and the device for recognizing the entity tag of the informal text, are provided, and the technical problems that the existing entity recognition method needs to be put into large manpower and is not suitable for the informal text are solved.
In order to achieve the above object, the present application provides the following technical solutions:
in a first aspect, a method for identifying an informal text entity tag includes:
s1, dividing an informal text to be identified into short sentences, referring to a basic word stock, matching the informal text to be identified, and carrying out qualitative tag identification;
s2, constructing and training a universal field pre-training model by using a universal field corpus;
s3, designing a corresponding physical dictionary by referring to the physical label to be identified, and setting a using mode of the physical dictionary;
s4, carrying out sequence labeling on the text in the small sample field according to the entity labels contained in the entity dictionary designed in the step S3, and generating a small sample data set;
and S5, performing fine tuning training on the universal field pre-training model by adopting the small sample data set, inputting the informal text to be identified into the fine-tuned pre-training model, and outputting an entity sequence obtained by analyzing the informal text to be identified.
Optionally, the step S1 includes:
loading a compound word stock, preprocessing the informal text to be identified, and dividing the informal text to be identified into short sentences according to punctuation;
inputting the short sentences obtained by segmentation into a text matrix, and referring to a basic word stock to match the word scaling class labels in the short sentences to obtain text data T1 finishing preliminary labeling;
performing secondary matching on the text data T1 by using word filter class labels by referring to a basic word stock to obtain text data T2 with secondary labeling completed;
taking the dialogue text with the label in the text data T2 and a plurality of pieces of dialogue text around the dialogue text as target text T3;
constructing a sliding window in the target text T3, continuously generating long text T4 consisting of short sentences in the window, and calculating the similarity of each long text T4 and each long text sentence vector of the corresponding label in the basic word stock;
and for each tag contained in each long text T4, if the similarity between at least one long text sentence vector and the long text T4 in the basic word stock reaches a preset similarity threshold, judging that the current tag is a qualitative tag L1 of the long text T4.
Further optionally, the sliding window has a length of 5.
Optionally, step S2 includes:
s21, loading a general field corpus;
s22, acquiring texts from the corpus of the general field;
s23, performing MASK operation on the random characters in the acquired text;
s24, predicting the text subjected to MASK operation by using the basic model;
s25, after the general field corpus completes traversal, calculating the difference between the predicted result and the actual result;
s26, if the difference value does not reach the preset corresponding threshold value, reloading the universal field corpus to continue training the model, and repeating the steps S22-S26;
and S27, if the difference value reaches a preset corresponding threshold value, training is finished, and a universal field pre-training model is obtained.
Optionally, the base model is a RoBERTa model.
Optionally, the physical dictionary includes 4 physical tags, which are area, price, proportion and cell name respectively; each entity tag has a corresponding serial number.
Optionally, step S4 includes:
s41, acquiring the entity dictionary;
s42, inputting entity information into a marking platform;
s43, making a labeling standard;
s44, extracting data from the small sample field text and importing the small sample field text into a marking platform;
s45, manually marking data in the text in the small sample field;
s46, preprocessing the marked data by referring to the using mode of the entity dictionary, and generating a small sample data set with marks completed.
Optionally, step S5 includes:
s51, loading the universal field pre-training model and a small sample data set, and constructing training data by using the small sample data set;
s52, defining an optimizer and a learning rate, and training the universal field pre-training model through the training data;
s53, inputting the informal text to be identified into a pre-training model when the model evaluation index reaches a preset corresponding threshold; when the model evaluation index does not reach the preset corresponding threshold, repeating the step S52;
s54, removing redundant dimensions of the output vector of the pre-training model, and removing incomplete entity sequences of the output vector of the pre-training model;
s55, positioning the entity sequence index of the complete recognition of the output vector of the pre-training model, analyzing the entity sequence according to the entity dictionary, and carrying out structural output.
Further optionally, the method further comprises:
inputting the obtained long text T4 into a pre-training model for fine-tuning training to perform entity identification, and combining with a corresponding qualitative label to perform redundancy elimination;
extracting a sample result for artificial verification, and judging whether the accuracy of the total sample, the recall rate of the total sample and the accuracy of each entity label respectively reach corresponding standards;
and if the accuracy of the total sample, the recall rate of the total sample and the accuracy of each entity label respectively reach corresponding standards, judging that the pretrained model for fine adjustment training reaches the usable degree.
In a second aspect, an informal text entity tag recognition apparatus includes:
the qualitative tag recognition module is used for dividing the informal text to be recognized into short sentences, referring to a basic word stock to match the informal text to be recognized, and carrying out qualitative tag recognition;
a training module of the universal field pre-training model for constructing and training the universal field pre-training model by using the corpus of the universal field,
the entity dictionary design module is used for designing a corresponding entity dictionary by referring to entity labels to be identified and setting a use mode of the entity dictionary;
the small sample data set generation module is used for carrying out sequence labeling on the text in the small sample field according to the entity labels contained in the entity dictionary designed by the entity dictionary design module to generate a small sample data set;
and the entity sequence analysis module is used for carrying out fine tuning training on the universal field pre-training model by adopting the small sample data set, inputting the informal text to be identified into the fine-tuned pre-training model, and outputting an entity sequence obtained by analyzing the informal text to be identified.
The invention has at least the following beneficial effects:
the embodiment of the invention provides an informal text entity tag identification method, which comprises the steps of dividing an informal text to be identified into short sentences, carrying out qualitative tag identification by referring to a basic word stock, constructing and training a universal field pre-training model by using a universal field corpus, designing an entity dictionary, carrying out sequence labeling on a small sample field text according to entity tags contained in the designed entity dictionary, generating a small sample dataset, carrying out fine tuning training by adopting the small sample dataset, inputting the pre-training model after the informal text to be identified is finely tuned, and outputting an entity sequence obtained by analyzing the informal text to be identified; the labor cost of the entity identification technology in the real estate industry can be reduced through innovation of the natural language processing technology; performing fine tuning training on a universal language pre-training model to realize field adaptation of the model and identify a named entity of an informal text; and the task target of identifying the entity tag according to the service requirement is realized by combining the use of the qualitative tag identification model.
The method for recognizing the entity tag of the informal text provided by the embodiment of the invention not only can be used for the local production industry, but also is suitable for recognizing entity tags of texts in other fields which lack a large number of sequence labeling corpuses; the selection of the pre-training model is diversified and can be flexible and changeable.
Drawings
FIG. 1 is a flowchart of a method for recognizing an informal text entity tag according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a specific flow of qualitative label labeling in an embodiment of the present invention;
FIG. 3 is a schematic flow diagram of an MLM task in accordance with one embodiment of the invention;
FIG. 4 is a schematic diagram of input and output of a pre-training model according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a physical dictionary in accordance with one embodiment of the present invention;
FIG. 6 is a diagram illustrating the use of a physical dictionary in one embodiment of the present invention;
FIG. 7 is a schematic diagram of a small sample dataset construction flow in accordance with an embodiment of the present invention;
FIG. 8 is a schematic diagram of a fine tuning training and model output flow in accordance with one embodiment of the present invention;
FIG. 9 is a block diagram of a module architecture of an apparatus for recognizing an informal text entity tag according to an embodiment of the present invention;
fig. 10 is an internal structural diagram of a computer device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
In one embodiment, an informal text entity tag identification method is provided, which aims to extract entity tags for compound business requirements from informal text by mining high-value entity tags in stock remark information generated by business, and assisting a business customized tag system and a natural language processing algorithm. The original business data is remark text left by the real estate industry consultant in the visiting system after each time of receiving the visitor, and the remark text is informal text. The prepositive data acquisition is based on an intelligent case field platform of China Jin Mao and a Jin Maoyun-technology text label labeling model, and is stored in a Jin Maoyun-technology big data processing platform, and text data is read in a two-dimensional table form. As shown in fig. 1, the method comprises the steps of:
s1, dividing the informal text to be recognized into short sentences, referring to a basic word stock, matching the informal text to be recognized, and performing qualitative tag recognition.
The qualitative label identification is text semantic point identification with coarse granularity, for example, the qualitative label hit by the text '130 more than 130 flat 4 rooms under the atlas of the middle sea purchased in 19 years' is 'cell information' and 'house', and the entity label is 'cell name': in the atlanto-atlas, the housing area: 130 times flat, housing types: the method has the advantages that 4 rooms are reserved, the text phrases with qualitative labels are firstly identified, the complexity of a follow-up entity label identification algorithm can be effectively reduced, and the accuracy is improved.
Further, as shown in fig. 2, step S1 includes:
loading a compound word stock, preprocessing an informal text to be identified, and dividing the informal text to be identified into short sentences according to punctuation;
inputting the short sentences obtained by segmentation into a text matrix, and referring to a basic word stock to match the word scaling class labels in the short sentences to obtain text data T1 finishing preliminary labeling;
performing secondary matching on the text data T1 by using word filter class labels by referring to the basic word stock to obtain text data T2 with the completed secondary labeling;
taking the dialogue text with the label in the text data T2 and a plurality of pieces of dialogue text around the dialogue text as target text T3;
constructing a sliding window in the target text T3, continuously generating long text T4 consisting of short sentences in the window, and calculating the similarity of each long text T4 and each long text sentence vector of the corresponding label in the basic word stock;
for each tag contained in each long text T4, if the similarity between at least one long text sentence vector and the long text T4 in the basic word stock reaches a preset similarity threshold, determining that the current tag is the qualitative tag L1 of the long text T4.
In other words, the invention adopts Jin Maoyun service to put forward a text label labeling model based on a compound word stock to carry out qualitative label recognition before, and the workflow of the model is as shown in figure 2, firstly, the text is preprocessed, long text is segmented into short sentences according to punctuation, and the short sentences are taken as basic units to be input into the model; firstly, recognizing word-scaling class labels according to a basic word stock, then adopting a word stock which is not the word-scaling class labels to carry out secondary matching, positioning matched short sentences, obtaining a dialogue text in which a current short sentence is positioned and 5 dialogues up and down, and constructing a moving window with the length of 5 to slide in the short sentences in the 5 dialogue texts; and generating text sentence vectors in a window based on a word2vec model, continuously calculating the similarity between the text sentence vectors and long text sentence vectors in a word library corpus of matched labels, judging that the labels are matched if one long text corpus under the same label reaches a threshold value, and then recording a matching log and outputting a result.
The model can output qualitative labels and corresponding sentences existing in the text, then analyze the matching log into key value pairs (keys: long text T4, values: qualitative labels) through text preprocessing, and store the key value pairs in a two-dimensional table.
S2, constructing and training a universal field pre-training model by using the universal field corpus.
Further, as shown in fig. 3, step S2 includes:
s21, loading a general field corpus;
s22, acquiring texts from the corpus of the general field;
s23, performing MASK operation on the random characters in the acquired text;
s24, predicting the text subjected to MASK operation by using the basic model;
s25, after the general field corpus completes traversal, calculating the difference between the predicted result and the actual result;
s26, if the difference value does not reach the preset corresponding threshold value, reloading the universal field corpus to continue training the model, and repeating the steps S22-S26;
and S27, if the difference value reaches a preset corresponding threshold value, training is finished, and a universal field pre-training model is obtained.
In other words, in order to reduce the labor input and avoid the defect of lack of industrial domain sequence labeling corpus, the method selects the public general domain corpus as training corpus, constructs and trains a general text pre-training model, trains by taking the RoBERTa model as a basic model and taking an MLM task as a target, and the MLM task flow is shown in figure 3.
Through this step, a Chinese model (available semantic coding model) that understands the semantics of the generic text can be trained, the input and output of which is shown in FIG. 4, for example.
S3, designing a corresponding physical dictionary by referring to the physical label to be identified, and setting a using mode of the physical dictionary.
The qualitative labels in step S1 can be decomposed into more fine-grained entity labels according to the service requirement, but the entity labels of the character strings cannot be input into the pre-training model, so the invention designs a corresponding entity dictionary by referring to the word format of the BIO and the entity labels to be identified, as shown in fig. 5, 4 entity labels are required to be designed in total, namely the area, the price, the proportion and the cell name, and each entity label has a corresponding serial number. Illustrated in fig. 6 is the manner in which the physical dictionary is used.
S4, carrying out sequence labeling on the small sample field text according to the entity labels contained in the entity dictionary designed in the step S3, and generating a small sample data set of the real estate industry.
Further, as shown in fig. 7, step S4 includes:
s41, acquiring an entity dictionary;
s42, inputting entity information into a marking platform;
s43, making a labeling standard;
s44, extracting data from the small sample field text and importing the small sample field text into a marking platform;
s45, manually marking data in the text in the small sample field;
s46, preprocessing the marked data by referring to the using mode of the entity dictionary, and generating a small sample data set with marks completed.
In other words, the pre-training model can only produce text codes, a small amount of manpower is required to refer to the entity labels set in the step S3 to perform sequence labeling on the small sample field text, the sequence labeling is used for fine-tuning training of the pre-training model, so that the task of entity identification can be realized, and the data set construction flow is shown in fig. 7.
And S5, performing fine tuning training on the universal field pre-training model by adopting a small sample data set, inputting the informal text to be identified into the fine-tuned pre-training model, and outputting an entity sequence obtained by analyzing the informal text to be identified.
Further, as shown in fig. 8, step S5 includes:
s51, loading a general field pre-training model and a small sample data set, and constructing training data by using the small sample data set;
s52, defining an optimizer and a learning rate, and training a general field pre-training model through training data;
s53, inputting the informal text to be identified into a pre-training model when the model evaluation index reaches a preset corresponding threshold; when the model evaluation index does not reach the preset corresponding threshold, repeating the step S52;
s54, removing redundant dimensions of the output vector of the pre-training model, and removing incomplete entity sequences of the output vector of the pre-training model;
s55, positioning the entity sequence index of the complete recognition of the output vector of the pre-training model, analyzing the entity sequence according to the entity dictionary, and carrying out structural output.
In other words, the training model in step S3 is fine-tuned by using the small sample data set constructed in step S4, and the model output is preprocessed, and the entity recognition model continues to train as shown in fig. 8.
Further, the method also comprises a model test evaluation flow, which specifically comprises the following steps:
inputting the obtained long text T4 into a pre-training model for fine-tuning training to perform entity identification, and combining with a corresponding qualitative label to perform redundancy elimination;
extracting a sample result for artificial verification, and judging whether the accuracy of the total sample, the recall rate of the total sample and the accuracy of each entity label respectively reach corresponding standards;
and if the accuracy of the total sample, the recall rate of the total sample and the accuracy of each entity label respectively reach corresponding standards, judging that the pretrained model for fine adjustment training reaches the usable degree.
That is, the long text T4 generated in step S1 is input into the model in step S5 for entity recognition, and then the qualitative label corresponding to the text is combined for redundancy removal, so that the entity label corresponding to the qualitative label is left, and the sample result is extracted for manual verification, wherein the accuracy (total number of correctly recognized entity labels/total number of recognized entity labels) of the total sample is 88.26%, the recall rate (total number of correctly recognized entity labels/total number of actually present entity expressions) of the total sample is 72.59%, and the average accuracy of each entity label reaches 90.09%, so that the model can reach the available degree, and if the model evaluation index needs to be lifted, the small sample data size in step S4 can be lifted.
The method for identifying the entity tag of the informal text provided by the embodiment of the invention can reduce the labor cost investment of entity identification and is also suitable for identifying the entity tag of the informal text: through innovation of natural language processing technology, the labor cost of entity identification technology in the real estate industry can be reduced; by carrying out fine tuning training on the universal language pre-training model, the field adaptation of the model can be realized, and the named entity of the informal text can be identified; and the task target of identifying the entity tag according to the service requirement is realized by combining the use of the qualitative tag identification model.
The method for recognizing the entity tag of the informal text can be used for the local production industry and is also suitable for recognizing entity tags of texts in other fields which lack a large number of sequence labeling corpuses; the selection of the pre-training model is diversified, and the model can be flexible and changeable;
the method can improve the effect of the informal text entity recognition technology under the condition that the field labeling text corpus is limited; the qualitative labels are used as weak classifiers, and the identification effect of the model on the entity labels is improved, and meanwhile, the identification capability of the service labels is given.
In summary, the invention can reduce the labor cost of entity identification investment, identify the naming entity of the informal text, and identify the entity label according to the service requirement.
It should be understood that, although the steps in the flowcharts of fig. 1-8 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in FIGS. 1-8 may include multiple steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of the steps or stages in other steps or other steps.
In one embodiment, as shown in fig. 9, there is provided an informal text entity tag recognition apparatus including the following program modules:
the qualitative tag recognition module 901 is configured to segment an informal text to be recognized into short sentences, match the informal text to be recognized with reference to a basic word stock, and perform qualitative tag recognition;
a general domain pre-training model training module 902 for constructing and training a general domain pre-training model using the general domain corpus,
the physical dictionary design module 903 is configured to design a corresponding physical dictionary with reference to the physical label to be identified, and set a usage mode of the physical dictionary;
the small sample data set generating module 904 is configured to perform sequence labeling on the small sample field text according to the entity tag included in the entity dictionary designed by the entity dictionary design module, so as to generate a small sample data set;
the entity sequence analysis module 905 is configured to perform fine tuning training on the universal field pre-training model by using the small sample data set, input the informal text to be identified into the fine-tuned pre-training model, and output an entity sequence obtained by analyzing the informal text to be identified.
For a specific limitation of an informal text entity tag recognition apparatus, reference may be made to the above limitation of an informal text entity tag recognition method, and no further description is given here. Each of the above-described modules in an informal text entity tag recognition apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a terminal, and an internal structure diagram thereof may be as shown in fig. 10. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a method of informal text entity tag identification. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
It will be appreciated by those skilled in the art that the structure shown in fig. 10 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, including a memory and a processor, the memory having stored therein a computer program, involving all or part of the flow of the methods of the embodiments described above.
In one embodiment, a computer readable storage medium having a computer program stored thereon is provided, involving all or part of the flow of the methods of the embodiments described above.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static RandomAccess Memory, SRAM) or dynamic random access memory (Dynamic RandomAccess Memory, DRAM), and the like.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims (10)

1. A method for identifying an informal text entity tag, comprising:
s1, dividing an informal text to be identified into short sentences, referring to a basic word stock, matching the informal text to be identified, and carrying out qualitative tag identification;
s2, constructing and training a universal field pre-training model by using a universal field corpus;
s3, designing a corresponding physical dictionary by referring to the physical label to be identified, and setting a using mode of the physical dictionary;
s4, carrying out sequence labeling on the text in the small sample field according to the entity labels contained in the entity dictionary designed in the step S3, and generating a small sample data set;
and S5, performing fine tuning training on the universal field pre-training model by adopting the small sample data set, inputting the informal text to be identified into the fine-tuned pre-training model, and outputting an entity sequence obtained by analyzing the informal text to be identified.
2. The method for recognizing an informal text entity tag according to claim 1, wherein step S1 comprises:
loading a compound word stock, preprocessing the informal text to be identified, and dividing the informal text to be identified into short sentences according to punctuation;
inputting the short sentences obtained by segmentation into a text matrix, and referring to a basic word stock to match the word scaling class labels in the short sentences to obtain text data T1 finishing preliminary labeling;
performing secondary matching on the text data T1 by using word filter class labels by referring to a basic word stock to obtain text data T2 with secondary labeling completed;
taking the dialogue text with the label in the text data T2 and a plurality of pieces of dialogue text around the dialogue text as target text T3;
constructing a sliding window in the target text T3, continuously generating long text T4 consisting of short sentences in the window, and calculating the similarity of each long text T4 and each long text sentence vector of the corresponding label in the basic word stock;
and for each tag contained in each long text T4, if the similarity between at least one long text sentence vector and the long text T4 in the basic word stock reaches a preset similarity threshold, judging that the current tag is a qualitative tag L1 of the long text T4.
3. The method of informal text entity tag identification of claim 2, wherein the sliding window has a length of 5.
4. The method for recognizing an informal text entity tag according to claim 1, wherein step S2 comprises:
s21, loading a general field corpus;
s22, acquiring texts from the corpus of the general field;
s23, performing MASK operation on the random characters in the acquired text;
s24, predicting the text subjected to MASK operation by using the basic model;
s25, after the general field corpus completes traversal, calculating the difference between the predicted result and the actual result;
s26, if the difference value does not reach the preset corresponding threshold value, reloading the universal field corpus to continue training the model, and repeating the steps S22-S26;
and S27, if the difference value reaches a preset corresponding threshold value, training is finished, and a universal field pre-training model is obtained.
5. The method of informal text entity tag identification of claim 4, wherein the base model is a RoBERTa model.
6. The method for recognizing entity tags of informal text according to claim 1, wherein the entity dictionary comprises 4 entity tags, which are area, price, proportion and cell name, respectively; each entity tag has a corresponding serial number.
7. The method for recognizing an informal text entity tag according to claim 1, wherein step S4 comprises:
s41, acquiring the entity dictionary;
s42, inputting entity information into a marking platform;
s43, making a labeling standard;
s44, extracting data from the small sample field text and importing the small sample field text into a marking platform;
s45, manually marking data in the text in the small sample field;
s46, preprocessing the marked data by referring to the using mode of the entity dictionary, and generating a small sample data set with marks completed.
8. The method for recognizing an informal text entity tag according to claim 1, wherein step S5 comprises:
s51, loading the universal field pre-training model and a small sample data set, and constructing training data by using the small sample data set;
s52, defining an optimizer and a learning rate, and training the universal field pre-training model through the training data;
s53, inputting the informal text to be identified into a pre-training model when the model evaluation index reaches a preset corresponding threshold; when the model evaluation index does not reach the preset corresponding threshold, repeating the step S52;
s54, removing redundant dimensions of the output vector of the pre-training model, and removing incomplete entity sequences of the output vector of the pre-training model;
s55, positioning the entity sequence index of the complete recognition of the output vector of the pre-training model, analyzing the entity sequence according to the entity dictionary, and carrying out structural output.
9. The method of informal text entity tag identification of claim 2, the method further comprising:
inputting the obtained long text T4 into a pre-training model for fine-tuning training to perform entity identification, and combining with a corresponding qualitative label to perform redundancy elimination;
extracting a sample result for artificial verification, and judging whether the accuracy of the total sample, the recall rate of the total sample and the accuracy of each entity label respectively reach corresponding standards;
and if the accuracy of the total sample, the recall rate of the total sample and the accuracy of each entity label respectively reach corresponding standards, judging that the pretrained model for fine adjustment training reaches the usable degree.
10. An informal text entity tag recognition apparatus, comprising:
the qualitative tag recognition module is used for dividing the informal text to be recognized into short sentences, referring to a basic word stock to match the informal text to be recognized, and carrying out qualitative tag recognition;
a training module of the universal field pre-training model for constructing and training the universal field pre-training model by using the corpus of the universal field,
the entity dictionary design module is used for designing a corresponding entity dictionary by referring to entity labels to be identified and setting a use mode of the entity dictionary;
the small sample data set generation module is used for carrying out sequence labeling on the text in the small sample field according to the entity labels contained in the entity dictionary designed by the entity dictionary design module to generate a small sample data set;
and the entity sequence analysis module is used for carrying out fine tuning training on the universal field pre-training model by adopting the small sample data set, inputting the informal text to be identified into the fine-tuned pre-training model, and outputting an entity sequence obtained by analyzing the informal text to be identified.
CN202211490287.1A 2022-11-25 2022-11-25 Informal text entity tag identification method and device Pending CN116070632A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211490287.1A CN116070632A (en) 2022-11-25 2022-11-25 Informal text entity tag identification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211490287.1A CN116070632A (en) 2022-11-25 2022-11-25 Informal text entity tag identification method and device

Publications (1)

Publication Number Publication Date
CN116070632A true CN116070632A (en) 2023-05-05

Family

ID=86169011

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211490287.1A Pending CN116070632A (en) 2022-11-25 2022-11-25 Informal text entity tag identification method and device

Country Status (1)

Country Link
CN (1) CN116070632A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116541535A (en) * 2023-05-19 2023-08-04 北京理工大学 Automatic knowledge graph construction method, system, equipment and medium
CN116702787A (en) * 2023-08-07 2023-09-05 四川隧唐科技股份有限公司 Long text entity identification method, device, computer equipment and medium
CN116702747A (en) * 2023-05-30 2023-09-05 珠海盈米基金销售有限公司 PDF online reader design method, device, computer equipment and medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116541535A (en) * 2023-05-19 2023-08-04 北京理工大学 Automatic knowledge graph construction method, system, equipment and medium
CN116702747A (en) * 2023-05-30 2023-09-05 珠海盈米基金销售有限公司 PDF online reader design method, device, computer equipment and medium
CN116702787A (en) * 2023-08-07 2023-09-05 四川隧唐科技股份有限公司 Long text entity identification method, device, computer equipment and medium

Similar Documents

Publication Publication Date Title
CN110096570B (en) Intention identification method and device applied to intelligent customer service robot
CN112818093B (en) Evidence document retrieval method, system and storage medium based on semantic matching
CN116070632A (en) Informal text entity tag identification method and device
CN110222330B (en) Semantic recognition method and device, storage medium and computer equipment
CN111401058B (en) Attribute value extraction method and device based on named entity recognition tool
CN111462752B (en) Attention mechanism, feature embedding and BI-LSTM (business-to-business) based customer intention recognition method
CN112016313B (en) Spoken language element recognition method and device and warning analysis system
CN115357719B (en) Power audit text classification method and device based on improved BERT model
CN110334186A (en) Data query method, apparatus, computer equipment and computer readable storage medium
CN112036184A (en) Entity identification method, device, computer device and storage medium based on BilSTM network model and CRF model
CN111274817A (en) Intelligent software cost measurement method based on natural language processing technology
CN113486178B (en) Text recognition model training method, text recognition method, device and medium
CN111079432A (en) Text detection method and device, electronic equipment and storage medium
CN113204967B (en) Resume named entity identification method and system
CN114298035A (en) Text recognition desensitization method and system thereof
CN114580424A (en) Labeling method and device for named entity identification of legal document
CN114036950A (en) Medical text named entity recognition method and system
CN115525757A (en) Contract abstract generation method and device and contract key information extraction model training method
CN110968661A (en) Event extraction method and system, computer readable storage medium and electronic device
CN112347780B (en) Judicial fact finding generation method, device and medium based on deep neural network
CN114416991A (en) Method and system for analyzing text emotion reason based on prompt
CN117725211A (en) Text classification method and system based on self-constructed prompt template
CN116089605A (en) Text emotion analysis method based on transfer learning and improved word bag model
CN116362247A (en) Entity extraction method based on MRC framework
CN111339760A (en) Method and device for training lexical analysis model, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination