CN114169332A

CN114169332A - Deep learning model-based address named entity identification tuning method

Info

Publication number: CN114169332A
Application number: CN202111443614.3A
Authority: CN
Inventors: 冯纯博; 卫海智; 李钊辉; 黄洋
Original assignee: Kexun Jialian Information Technology Co ltd
Current assignee: Kexun Jialian Information Technology Co ltd
Priority date: 2021-11-30
Filing date: 2021-11-30
Publication date: 2022-03-11

Abstract

The invention relates to natural language recognition, in particular to an address named entity recognition tuning method based on a deep learning model, which comprises the steps of collecting industry linguistic data of related fields, constructing an industry entity dictionary, collecting on-line Chinese data, carrying out manual marking according to a task target to generate a template, carrying out data enhancement on the template and entity names in the industry entity dictionary, carrying out data expansion, carrying out mask mechanism optimization in a pre-training stage of a neural network language model by utilizing the unmarked industry linguistic data and the entity dictionary, carrying out model fine tuning on the neural network language model aiming at a downstream recognition task, selecting the neural network language model with the highest test precision as an output model, collecting on-line real-time data, and storing an entity of which the output model prediction result is lower than a confidence coefficient threshold value in a log file; the technical scheme provided by the invention can effectively overcome the defects that the model optimization in the prior art needs to depend on a large amount of labeled data and the model identification effect is poor.

Description

Deep learning model-based address named entity identification tuning method

Technical Field

The invention relates to natural language recognition, in particular to an address named entity recognition tuning method based on a deep learning model.

Background

Named entity recognition task is a very common task in the field of natural language processing, the purpose of which is to recognize entities of a specific type in natural language text. The application of named entity identification is very wide, for example, information such as names, telephones, articles, detailed addresses and the like of express senders needs to be identified in the express industry; the information such as a name of a person, a place name, an organization name and the like needs to be identified in the news media industry; in the medical industry, the information such as the names of patients and doctors, pathological names, symptoms, medication names, administration instructions and the like needs to be identified; the method is used for extracting information such as protein, DNA and the like in the field of bioinformatics.

Named entity recognition tasks are typically modeled as character-level sequence tagging tasks, i.e., for a string of input character sequences, a named entity recognition model needs to predict the named entity tag corresponding to each character. Currently, there are two typical named entity recognition models in practical application of natural language processing technology.

The first mode is based on LSTM (long short term memory neural network) and CRF (conditional random field) models, the training process of the model is to carry out Embedding (word vector coding) operation on the input Chinese sequence according to character coding, then the input Chinese sequence is sent into the LSTM neural network, a bidirectional LSTM model is generally adopted for considering context information, and a CRF layer neural network is generally adopted for modeling the model in the following for constraining the relation between entities and the law of state transition between the entities. The named entity model based on the LSTM and the CRF is simple in structure, small in parameter quantity, small in occupied computing resource and fast in running, but has the defects of low prediction precision and low generalization capability.

The second approach is based on a deeper pre-trained model, which essentially replaces the role of LSTM in the first approach with a pre-trained model to characterize richer and more complex semantic relationships. The current mainstream pre-training model is a series of models based on Transformer, such as Bert, GPT, etc.

The method comprises the steps that a Bert Model carries out entity recognition, firstly, pre-training is carried out, namely a Mask Language Model (MLM) and a Next Sequence Prediction (NSP) are respectively carried out through an upper Sentence Prediction; and then fine-tuning a specific task, extracting the vector characteristics of each word of the Chinese sequence by using Bert to classify the label of each word, and then constraining the label value to a reasonable range in a CRF stage. The model training process based on the Bert and CRF is adopted, and the knowledge of prior grammar semantics and the like can be obtained from massive unlabeled data in a pre-training mode.

But in the Bert model, only the features of each word are considered, and the relationships of the chinese entities are ignored. In order to improve the effect, a large amount of corpora must be relied on for model fine tuning, and an effect bottleneck also exists.

The two modes have obvious defects in the model effect and the training process, the two modes extract Chinese vector characteristics based on character level, the relation and the characteristics of Chinese words cannot be considered, and meanwhile, in order to improve the model precision, a large amount of labeled data are relied on, and a large amount of misidentification data of the human analysis model is needed. In the first mode, although the model is simple in structure and short in training time, the improvement effect is difficult; although the second mode utilizes a pre-training model and incorporates the prior knowledge of Chinese in the model initialization stage, the considered relation is only the character level relation, but the information among the entities is ignored, the effect in the named entity recognition task is slightly improved, and a large amount of data is relied on.

Disclosure of Invention

Technical problem to be solved

Aiming at the defects in the prior art, the invention provides the address named entity identification tuning method based on the deep learning model, which can effectively overcome the defects that the model optimization in the prior art needs to depend on a large amount of labeled data and the model identification effect is poor.

(II) technical scheme

In order to achieve the purpose, the invention is realized by the following technical scheme:

a tuning method for address named entity recognition based on a deep learning model comprises the following steps:

s1, collecting industry corpora of related fields, and constructing an industry entity dictionary;

s2, collecting online Chinese data, manually marking according to a task target to generate a template, performing data enhancement on the template and an entity name in an industry entity dictionary, and then performing data expansion;

s3, performing mask mechanism optimization in the pre-training stage of the neural network language model by using the unmarked industry linguistic data and the entity dictionary;

s4, performing model fine tuning on the neural network language model aiming at the downstream recognition task, and selecting the neural network language model with the highest test precision as an output model;

and S5, collecting online real-time data, storing the entity with the output model prediction result lower than the confidence coefficient threshold value in a log file, and optimizing the output model by using the log file.

Preferably, the collecting of the industry corpora of the related fields in S1 and the constructing of the industry entity dictionary include:

s1, integrating the existing public entity dictionary in the existing field to form a 'public entity dictionary';

s2, constructing a series of rules for matching the entities according to experience by experts in the field, performing expert experience matching on the collected public corpus by using a character string matching or pattern matching method and combining key words, special words or structure rule entity characteristics, extracting the entities and constructing an expert entity dictionary;

s3, integrating the public entity dictionary and the expert entity dictionary to construct an experience entity dictionary;

s4, counting the occurrence frequency of words in an unsupervised mode, recalling a large number of undetermined entities through the words, calculating the degree of freedom and the compactness of the undetermined entities, and screening out the entities through setting a threshold value to form an unsupervised entity dictionary;

s5, selecting a small amount of linguistic data to recall candidate words according to word frequency, screening the candidate words through frequency, completeness, information content and co-occurrence degree, and taking the screened candidate words and cross words in an experience entity dictionary as a positive sample set during training;

s6, randomly sampling other vocabularies by using negative sampling to form a negative sample set, and training a Bert model by using the positive sample set and the negative sample set;

s7, using the trained Bert model to score the quality of all expected recall entities, and selecting effective entities;

s8, performing type prediction on the vocabularies through an AutoNER model to form a supervised entity dictionary;

s9, integrating the unsupervised entity dictionary and the supervised entity dictionary to construct a mining entity dictionary.

Preferably, the data enhancement of the entity names in the template and the industry entity dictionary in S2 includes:

and (3) carrying out nuclear fission sampling on entity names in the template and the industry entity dictionary: and selecting a small number of samples from the random samples, and selecting related and similar contents for replacement by using a user dictionary according to the characteristics of the small number of samples so as to generate a new small number of samples, and performing data enhancement in the same way.

Preferably, the data expansion in S2 includes:

and carrying out data expansion according to the industry corpus template, wherein the adopted expansion method comprises a pseudo tag strategy, upsampling, slot position replacement and synonym replacement.

Preferably, in S3, the mask mechanism optimization is performed in the pre-training phase of the neural network language model by using the unlabeled industry corpus and the entity dictionary, and includes:

and optimizing the neural network language model into a target mask strategy according to the actual entity according to the random mask strategy of the word.

Preferably, the target masking policy degenerates to a random masking policy when no actual "entity" is present.

Preferably, model refinement of the neural network language model for the downstream recognition task in S4 includes:

and performing downstream business training on the neural network language model by using the data after data enhancement and data expansion, and introducing the characteristics of the transfer between label types captured by a CRF layer of the neural network language model to finely adjust the learning rate of the model and the CRF layer.

Preferably, the step of saving in the log file the entity whose output model prediction result is lower than the confidence threshold in S5 includes:

the output model obtains a confidence coefficient related to the label category after each word in the Chinese sequence to be recognized is processed by a SoftMax function, the confidence coefficient of each entity in the Chinese sequence to be recognized is an average value of the sum of the confidence coefficients of each word, meanwhile, a confidence coefficient threshold value is artificially set according to data distribution characteristics, and the entities with the confidence coefficients lower than the confidence coefficient threshold value are stored in a log file.

(III) advantageous effects

Compared with the prior art, the tuning method for identifying the address named entity based on the deep learning model integrates the prior knowledge of the Chinese entity, fully utilizes the characteristics of feature distribution data and the convenience of iterative optimization of the model and the like, does not need to rely on a large amount of labeled data in the model fine tuning process, and is a technical scheme convenient for model tuning and iterative optimization.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 is a schematic flow chart of construction of an industry entity dictionary in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

A tuning method for address named entity recognition based on a deep learning model is disclosed, as shown in figure 1, industry corpora in related fields are collected, and an industry entity dictionary is constructed.

There are problems with continuing named entity recognition for many areas of expertise, such as: the diversity of entities, the existence of synonyms, abbreviations and the like in Chinese, so that the same entity often has a plurality of expression modes (such as ' China Industrial and commercial Bank ' for short ' industry and industry), the medicines in the medical industry generally have three names of trade names, common names and chemical names (such as ' cold medicine ', ' Gankang ', ' Compound Paracetamol ' and ' N- (4-hydroxyphenyl) acetamide molecule '), the meaning uncertainty is strong, and different entity meanings in the Chinese exist to represent one meaning and one meaning (such as ' apple ', which can be ' apple company ' or ' fruit '). Therefore, firstly, aiming at the field, industry corpora of the related field are collected, rules and a user dictionary are specified, and meanwhile, an industry entity dictionary is built.

The construction of the industry entity dictionary can be carried out in a supervision, unsupervised, remote supervision and other modes, and the industry corpus sources can be public data sets, encyclopedia entries, self-built information databases, user search logs, unstructured user comments and the like in the industry.

As shown in fig. 2, first, an existing public entity dictionary in the existing field is integrated to form a "public entity dictionary"; then, a series of rules for matching the entities are established according to experience by experts in the field, a character string matching or pattern matching method is used, entity characteristics such as key words, special words or structural rules are combined, expert experience matching is carried out on the collected public corpus, entities are extracted, an expert entity dictionary is established, and the open entity dictionary and the expert entity dictionary are integrated to establish an experience entity dictionary; then, counting the occurrence frequency of the vocabulary in an unsupervised mode, recalling a large number of undetermined entities through the vocabulary frequency, calculating the degree of freedom and the compactness of the undetermined entities, and screening out the entities through setting a threshold value to form an unsupervised entity dictionary; then, selecting a small amount of linguistic data to recall candidate words according to word frequency, screening the candidate words through frequency, completeness, information content and co-occurrence degree, taking the screened candidate words and cross words in an experience entity dictionary as a positive sample set during training, randomly sampling other words by using negative sampling to form a negative sample set, and training a Bert model by using the positive sample set and the negative sample set.

Using a trained Bert model to score the quality of all expected recall entities, and selecting effective entities; and finally, performing type prediction on the words through an AutoNER model to form a supervised entity dictionary, and integrating the unsupervised entity dictionary and the supervised entity dictionary to construct a mined entity dictionary.

Collecting on-line Chinese data, carrying out manual marking according to a task target to generate a template, carrying out data enhancement on the template and an entity name in an industry entity dictionary, and then carrying out data expansion.

Carrying out data enhancement on entity names in a template and an industry entity dictionary, wherein the data enhancement comprises the following steps:

Data expansion comprises:

Generally, the pre-training model is finely tuned for a specific task by using the own training set, however, the data of the own training set may have disadvantages, such as unbalanced label class distribution, less training data, and the like.

Aiming at the problem of label class distribution imbalance, the original data can be selected to be downsampled, and high-quality data with proper proportion of each class is selected as a training set, but the method is only suitable for the condition of large data quantity, otherwise, the training set data after downsampling is less, and a good model fine tuning effect cannot be obtained.

The process of nuclear fission is that uranium-235 atoms are bombarded by thermal neutrons, the nuclei are split into 2-4 neutrons, the neutrons generated by the splitting further strike other uranium-235 atoms, and the like, so that a chain reaction is formed. The concept of 'nuclear fission' sampling is derived from a nuclear fission process, and is a sampling method aiming at searching an expected target body which accords with the self-set characteristics from a sparse population.

In real life, such scenes are often found, for example, people who have participated in a meeting, people who are engaged in a professional direction, people who have a minority, and the like, so that the group is often small in probability, and the probability that the group is only one in ten thousand in a specific area is even lower. If a conventional sampling method is used to obtain such a person sample, tens of thousands of persons need to be screened, which is often impractical and costly. The specific method for the nuclear fission sampling comprises the following steps: a sampling method includes the steps that firstly, some persons are randomly selected from a crowd to serve as investigation objects, a needed small amount of samples are screened out through the investigation, then the investigation is continued according to clues provided by the samples, and the like, a chain reaction is formed, and therefore a large amount of small samples are obtained.

By using the unmarked industry linguistic data and the entity dictionary, mask mechanism optimization is carried out in the pre-training stage of the neural network language model, and the method specifically comprises the following steps:

optimizing the neural network language model into a target mask strategy according to the actual entity according to the random mask strategy of the word, and degrading the target mask strategy into the random mask strategy when the actual entity does not appear.

Aiming at a deep learning language model, the problem of the characteristics of words is not considered, the words are used as mask labels at random according to a certain probability when pre-training is carried out, the minimum element of the method is the words, but the words which are originally connected together are split into single words, so that the inherent information of Chinese text words is not fully utilized in the pre-training stage.

According to the technical scheme, various types of entities are extracted from the industry corpus according to the rule template, the entity names are used as the objects of the mask labels, the difficulty of predicting the mask words in the pre-training stage is increased, the information of the words is fully used, and therefore the pre-training effect can be effectively improved.

And performing model fine tuning on the neural network language model aiming at the downstream recognition task, and selecting the neural network language model with the highest test precision as an output model.

The method comprises the following steps of performing model fine tuning on a neural network language model aiming at a downstream recognition task, wherein the model fine tuning comprises the following steps:

And collecting online real-time data, storing the entity with the output model prediction result lower than the confidence coefficient threshold value in a log file, and optimizing the output model by using the log file.

The method for saving the entity with the output model prediction result lower than the confidence coefficient threshold value in the log file comprises the following steps:

the output model obtains a confidence coefficient related to the label category after each word in the Chinese sequence to be recognized is processed by a SoftMax function, the confidence coefficient of each entity in the Chinese sequence to be recognized is an average value of the sum of the confidence coefficients of each word (averaged according to the word length of the entity), meanwhile, a confidence coefficient threshold value is artificially set according to data distribution characteristics, and the entities with the confidence coefficient lower than the confidence coefficient threshold value are stored in a log file.

According to the technical scheme, after the model is deployed on the line, a data feedback mechanism is introduced, a confidence coefficient threshold value is set artificially according to data distribution characteristics, and the record collection is carried out on the entity lower than the confidence coefficient threshold value for iteration, optimization and updating of the model, so that the closed-loop processing of the whole data flow is realized.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. A tuning method for address named entity recognition based on a deep learning model is characterized in that: the method comprises the following steps:

2. The deep learning model-based tuning method for address-named entity recognition according to claim 1, wherein: collecting industry corpora of related fields in S1, and constructing an industry entity dictionary, wherein the industry entity dictionary comprises the following steps:

3. The deep learning model-based tuning method for address-named entity recognition according to claim 1, wherein: in S2, data enhancement is performed on the entity names in the template and the industry entity dictionary, including:

4. The deep learning model-based tuning method for address-named entity recognition according to claim 3, wherein: the data expansion in the S2 comprises the following steps:

5. The deep learning model-based tuning method for address-named entity recognition according to claim 1, wherein: in S3, using the unmarked industry corpus and the entity dictionary, performing mask mechanism optimization in the pre-training stage of the neural network language model, including:

6. The deep learning model-based tuning method for address-named entity recognition according to claim 5, wherein: when no actual "entity" is present, the target masking policy degenerates to a random masking policy.

7. The deep learning model-based tuning method for address-named entity recognition according to claim 1, wherein: model refinement of the neural network language model for the downstream recognition task in S4 includes:

8. The deep learning model-based tuning method for address-named entity recognition according to claim 1, wherein: in S5, the entity whose output model prediction result is lower than the confidence threshold is saved in a log file, including: