CN114970536A

CN114970536A - Combined lexical analysis method for word segmentation, part of speech tagging and named entity recognition

Info

Publication number: CN114970536A
Application number: CN202210715424.0A
Authority: CN
Inventors: 线岩团; 朱叶芬; 文永华; 王红斌
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2022-06-22
Filing date: 2022-06-22
Publication date: 2022-08-30

Abstract

The invention discloses a joint lexical analysis method for word segmentation, part of speech tagging and named entity recognition, which comprises the steps of decomposing a word segmentation task and a part of speech tagging task into two subtasks of candidate word detection and part of speech category prediction, decomposing an entity recognition task into two subtasks of candidate entity detection and entity category prediction, and performing joint learning on the four tasks by adopting a unified neural network model; and meanwhile, parameters among different tasks are shared. The invention improves the word boundary detection problem in the part of speech tagging task and the entity recognition task by using high-accuracy word segmentation, and can improve the word segmentation precision by using part of speech tagging information. And joint learning is performed by utilizing high relevance among word segmentation, part-of-speech tagging and named entity recognition, so that the model performance is improved.

Description

Combined lexical analysis method for word segmentation, part of speech tagging and named entity recognition

Technical Field

The invention relates to a joint lexical analysis method for word segmentation, part of speech tagging and named entity recognition, and relates to the technical field of natural language processing.

Background

In natural language processing, lexical analysis is a basic key task in natural language processing, and word segmentation, part of speech tagging and named entity recognition in the lexical analysis are all the bases of downstream tasks such as text classification, information retrieval and machine translation.

Although the existing Chinese word segmentation model, part of speech tagging model and named entity recognition model all have certain progress, the model for performing multi-task combination aiming at the three tasks has not yet been developed. The N-grams statistical language model is used for selecting the word combination with the maximum occurrence probability by utilizing the related information between the adjacent words of the context to realize automatic word segmentation. The model has language independence and strong spelling error compatibility, can better process Chinese and English and complex and simple texts, and is a common statistical language model for word segmentation tasks. The type of model is not limited by the text field, but the recognition speed is still to be improved. In past research, word segmentation, part-of-speech tagging, and named entity recognition were generally viewed as separate tasks, generally viewed as sequence tagging tasks,

in recent years, a batch of methods based on a deep learning algorithm appear, the mainstream model is an Encoder-Decoder model, the most representative model is a BilSTM-CRF model, the method inherits the advantages of the deep learning method in the aspect of characteristics, and good effect can be achieved by using word vectors and character vectors without characteristic engineering. BilSTM can capture the semantics of each word in the context, but from the perspective of part-of-speech tagging and named entity recognition, overlong sequences have little predictive significance for word parts-of-speech and entities, and the training using CRF decoding is costly and complex. The tasks in the existing joint lexical analysis method are usually mutually independent in structure, but are sequentially dependent in the data processing process, and the method can introduce error transmission among the tasks and influence the performance of a model. Different from the prior method, the method of the invention provides a brand-new paradigm, and provides a detection-classification framework by a multi-label classification idea. The segmentation is firstly graded, then classification is carried out, the denominator does not need to be recursively calculated like CRF during training, dynamic planning is not needed during prediction, the tasks of word segmentation, part of speech tagging and named entity recognition are carried out simultaneously, data sharing is achieved, and the performance of each task is improved to different degrees.

Disclosure of Invention

The invention aims to provide a joint lexical analysis method for word segmentation, part of speech tagging and named entity recognition, which decomposes a word segmentation task and a part of speech tagging task into two subtasks of candidate word detection and part of speech category prediction, decomposes an entity recognition task into two subtasks of candidate entity detection and entity category prediction, and adopts a unified neural network model to carry out joint learning on the four tasks, thereby realizing the multi-task joint learning of word segmentation, part of speech tagging and named entity recognition.

In order to achieve the technical purpose and achieve the technical effect, the invention is realized by the following technical scheme:

a joint lexical analysis method for word segmentation, part of speech tagging and named entity recognition is disclosed, and the analysis method comprises the following steps: the word segmentation and part-of-speech tagging task is decomposed into two subtasks of candidate word detection and part-of-speech category prediction, the entity recognition task is decomposed into two subtasks of candidate entity detection and entity category prediction, and a unified neural network model is adopted to carry out joint learning on the four tasks; and meanwhile, parameters among different tasks are shared.

A joint lexical analysis method for word segmentation, part of speech tagging and named entity recognition comprises the following steps:

s1: performing data preprocessing on the text obtained from the PFR1998, and matching each character segment with its corresponding label category;

s2: and sequentially obtaining information of each sentence from the data preprocessed in the S1 as input, performing feature coding on the input by using a BERT pre-training language model, obtaining the context semantic vector representation of each word in the sentence, performing candidate word detection and candidate entity detection on all continuous subsequences in the sentence, and predicting word sequence labels, part of speech sequence labels and entity sequence labels in the current sequence by calculating the score probability of the candidate words and the candidate entities.

Further, the preprocessing of the data by the S1 includes:

constructing a part-of-speech tag dictionary and an entity tag dictionary for the words according to the training set;

labeling label types for the character segments by combining position information of the characters in the sentences;

then, each sentence takes the character as an input unit, each character is endowed with a fixed id number through a word segmentation device of a BERT pre-training language model, and a segmentation sequence [ w ] of the sentence is obtained ₁ ,w ₂ ,...,w _n ]Wherein w is _i Indicating the number of the segmentation term in the BERT vocabulary.

Further, the S2 obtains context semantic vector representations of each word in the sentence:

the segmented sequence [ w ] ₁ ,w ₂ ,...,w _n ]After the BERT pre-training language model code is input, the vector representation h is obtained ₁ ,h ₂ ,...,h _n ]Wherein h is _i Is w _i Corresponding vector representation, wherein the vector dimension d is 768;

further, the S2 performs candidate word detection and candidate entity detection on all consecutive subsequences in the sentence, including:

vector sequence [ h ] obtained after coding ₁ ,h ₂ ,...,h _n ]By transforming q _i ＝W _q h _i +b _q And k _i ＝W _k h _i +b _k Wherein W is _q And W _k Are parameters of the model;

obtain a vector sequence [ q ] ₁ ，q ₂ ，...，q _n ]And [ k ] ₁ ，k ₂ ，...，k _n ]Is a feature vector for word segmentation, by q _i And k _j Inner product of (3) calculating a continuous subsequence x _[i：j] Word score s ₁ (i，j)；

Optimal solutions are obtained using a greedy algorithm:

max(s ₁ (i，j)，s ₁ (i，j+1))

similarly, the vector sequence [ h ] obtained after coding ₁ ，h ₂ ，...，h _n ]By transforming r _i ＝W _r h _i +b _r And u _i ＝W _u h _i +b _u Wherein W is _r And W _u Are parameters of the model; obtain the vector sequence [ r ₁ ，r ₂ ，...，r _n ]And [ u ] ₁ ，u ₂ ，...，u _n ]Is to determine whether the feature vector is used by the entity by r _i And u _j Inner product of (3) calculating a continuous subsequence x _[i：j] Word score s ₂ (i，j)，

Further, after candidate word detection and candidate entity detection are performed on all the continuous subsequences in the sentence, part-of-speech category prediction and entity category prediction are performed:

the vector sequence [ h ] obtained after coding ₁ ，h ₂ ，...，h _n ]By passing

The predicted dependency label (label),

wherein the content of the first and second substances,U ⁽¹⁾ is dimension R ^m×d×d The higher order tensor of (m is the number of tags, d is the Biaffine input dimension),

is known at the same time as the posterior probability in the case of i as dep and j as head,

is the posterior probability that i or j is known to be both ends of the dependency (arc);

further, the word sequence tags, part of speech sequence tags, and entity sequence tags in the predicted current sequence are:

by the score s obtained ₁ (i, j) calculating a loss function loss _ ws of the participle:

wherein P is the beginning and end set of words for the sample, and Q is the beginning and end set of all non-words for the sample;

and respectively calculating loss functions (loss _ pos and loss _ ner) of part-of-speech tagging and named entity identification by adopting multi-classification cross entropy:

where C1 is a set of part-of-speech categories,

as a word x _[i：j] The label in the category C1 is,

is the predicted value of the model on category c;

where C2 is a set of entity classes,

is an entity x _[i：j] The label on the category C2 that is,

is the predicted value of the model on category c.

The invention has the beneficial effects that:

the invention relates to a combined lexical analysis method for word segmentation, part of speech tagging and named entity recognition, which is characterized in that a word segmentation task and a part of speech tagging task are decomposed into two subtasks of candidate word detection and part of speech category prediction, an entity recognition task is decomposed into two subtasks of candidate entity detection and entity category prediction, and a unified neural network model is adopted to carry out combined learning on the four tasks; meanwhile, parameters among different tasks are shared;

the combined lexical analysis method for word segmentation, part of speech tagging and named entity recognition is different from a sequence tagging method, the part of speech tagging is regarded as two subtasks of word detection and part of speech classification, and the entity recognition is regarded as two subtasks of entity detection and entity classification. On the basis of obtaining word sequence representation, the method adopts four neural network layers to realize multi-task combination of word segmentation, part of speech tagging and named entity identification.

The invention jointly learns and predicts the three tasks of word segmentation, part of speech tagging and named entity recognition, and implicitly shares one data. By sharing parameters among different tasks, the network capability is weakened to a certain extent, and the generalization capability of each task can be improved.

The invention improves the word boundary detection problem in the part of speech tagging task and the entity recognition task by using high-accuracy word segmentation, and can improve the word segmentation precision by using part of speech tagging information. And joint learning is performed by utilizing high relevance among word segmentation, part-of-speech tagging and named entity recognition, so that the model performance is improved.

Of course, it is not necessary for any product in which the invention is practiced to achieve all of the above-described advantages at the same time.

Drawings

FIG. 1 is a diagram illustrating a method for joint lexical analysis of segmented words, part-of-speech tagging and named entity recognition according to an embodiment of the present invention;

Detailed Description

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

Example 1

As shown in fig. 1, the joint lexical analysis method of word segmentation, part of speech tagging and named entity recognition:

s1: and constructing a part-of-speech tag dictionary and an entity tag dictionary for the words according to the training set. Wherein, the part of speech tags are 45, and the entity tags are 5;

labeling the label type of the character segment according to the position information of the character in the sentence;

then, each sentence takes the character as an input unit, each character is endowed with a fixed id number through a word segmentation device of a BERT pre-training language model, and a segmentation sequence [ w ] of the sentence is obtained ₁ ，w ₂ ，...，w _n ]Wherein w is _i Representing the number of the segmentation item in the BERT vocabulary;

Obtaining a context semantic vector representation for each word in the sentence:

the segmented sequence [ w ] ₁ ，w ₂ ，...，w _n ]After the BERT pre-training language model code is input, the vector representation h is obtained ₁ ，h ₂ ，...，h _n ]Wherein h is _i Is w _i Corresponding vector representation, wherein the vector dimension d is 768;

vector sequence [ h ] obtained after coding ₁ ，h ₂ ，...，h _n ]By transforming q _i ＝W _q h _i +b _q And k _i ＝W _k h _i +b _k Wherein W is _q And W _k Are parameters of the model;

obtain the vector sequence [ q ] ₁ ，q ₂ ，...，q _n ]And [ k ] ₁ ，k ₂ ，...，k _n ]Is a feature vector for word segmentation, by q _i And k _j Inner product of (3) calculating a continuous subsequence x _[i：j] Word score s ₁ (i，j)；

Using a greedy algorithm to obtain an optimal solution:

max(s ₁ (i，j)，s ₁ (i，j+1))

After candidate word detection and candidate entity detection are carried out on all continuous subsequences in the sentence, part of speech category prediction and entity category prediction are carried out:

The predicted dependency label (label),

wherein, U ⁽¹⁾ Is dimension R ^m×d×d The higher order tensor of (m is the number of tags, d is the Biaffine input dimension),

the word sequence tags, part of speech sequence tags and entity sequence tags in the predicted current sequence are as follows:

where C1 is a set of part-of-speech categories,

as a word x _[i：j] The label on the category C1 that is,

is the predicted value of the model in category c;

where C2 is a set of entity classes,

is an entity x _[i：j] The label on the category C2 that is,

is the predicted value of the model on category c.

Example 2

Based on the joint lexical analysis method of word segmentation, part of speech tagging and named entity recognition in embodiment 1, in order to illustrate the effect of the invention, a 1-group comparison experiment is set. The first group of experiments verify the validity of the participle, the part of speech tag and the named entity identification, the second group of experiments verify the validity of the part of speech tag, and the third group of experiments verify the validity of the named entity identification.

Word segmentation performance improvement verification

The method inputs characters in a text into a model, utilizes a BERT pre-training language model to perform characteristic coding on the input to obtain a representation, and further predicts a dependency relationship and a dependency label in a current sequence. In order to verify the effectiveness of the model on word segmentation, the model is compared and analyzed with a plurality of other related models under a people daily report labeling corpus (PFR). The results of the experiment are shown in table 1:

TABLE 1 Chinese participle method Performance comparison

As can be seen from the analysis of Table 1, the F1 value of the method of the invention is higher than that of all other methods, and the effectiveness of the model in the Chinese word segmentation task is proved. The BERT pre-training language model in the method structure can effectively relieve the problem of unknown words during prediction. And the part-of-speech tagging information is also effective for improving the Chinese word segmentation performance.

Part-of-speech tagging performance improvement verification

In order to verify the validity of the proposed model for Chinese part-of-speech tagging, a comparison experiment is carried out on the model and a plurality of related models below under a people daily report tagging corpus (PFR). The results of the experiment are shown in table 2:

TABLE 2 Chinese part-of-speech tagging method Performance comparison

Analysis of table 2 reveals that the F1 values of the model of the present invention exceed all mainstream models in the human daily annotated corpus (PFR). The CNN-based method F1 has the lowest value, which indicates that the neural network model can obtain more deep-level input information in language compared with the conventional machine learning-based method. And the F1 value of the model is higher than that of other two neural network-based models, probably because other models lack text morphological information and internal structure information, so that the model part-of-speech tag prediction is biased. And because the Chinese text lacks word separators and the identification of word boundaries in the Chinese part-of-speech tagging is very challenging, the model performs joint learning on the segmented words so as to improve the capacity of the Chinese part-of-speech model in identifying word boundaries and improve the performance of the part-of-speech tagging.

Named entity identification performance enhancement verification

In order to verify the effectiveness of the proposed model in identifying the Chinese named entity, the invention carries out comparison experiments with a plurality of related models below under a people daily report annotation corpus (PFR). The results of the experiment are shown in table 3:

TABLE 3 Performance comparison of the Chinese named entity recognition methods

Analysis of table 3 reveals that the F1 values of the model of the present invention exceed all mainstream models in the human daily annotated corpus (PFR). The present invention treats entity recognition as a continuous subsequence classification problem in sentences while predicting all entities contained in the sentence. Different entity categories are identified by adopting independent prediction parameters, so that the performance of model named entity identification is improved to a certain extent.

The experimental data prove that the combined task of word segmentation, part of speech tagging and named entity recognition is regarded as a multi-classification problem of continuous subsequences in sentences, and the model performance is improved to a certain extent. The word boundary detection problem in the part of speech tagging task and the entity recognition task is improved by word segmentation with high accuracy, and the word segmentation precision can be improved by using part of speech tagging information. Performing joint learning by utilizing high relevance among word segmentation, part-of-speech tagging and named entity recognition, so that a Chinese lexical analysis model obtains better performance; the problem of unknown words during prediction can be effectively relieved by selecting a proper pre-training language model; a multi-task joint learning mode is adopted, and implicit data sharing is performed, which is equivalent to implicit data enhancement. The method can share parameters, weaken the network capacity to a certain extent and prevent overfitting. The invention provides a combined model and a device for identifying participles, parts of speech and named entities, which are used for marking and naming Chinese participles and parts of speech.

In the description herein, references to the description of "one embodiment," "an example," "a specific example" or the like are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best understand the invention for and utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims

1. A joint lexical analysis method for word segmentation, part of speech tagging and named entity recognition is characterized by comprising the following steps: the word segmentation and part-of-speech tagging task is decomposed into two subtasks of candidate word detection and part-of-speech category prediction, the entity recognition task is decomposed into two subtasks of candidate entity detection and entity category prediction, and a unified neural network model is adopted to carry out joint learning on the four tasks; and meanwhile, parameters among different tasks are shared.

2. The method of joint lexical analysis of segmentation, part of speech tagging and named entity recognition of claim 1, comprising the steps of:

s1: pre-processing the data of the text obtained from the PFR1998 to match each character segment with its corresponding label category;

3. The method of joint lexical analysis of segmentation, part of speech tagging and named entity recognition of claim 2, comprising the steps of: the S1 preprocessing the data includes:

4. The method of joint lexical analysis of segmentation, part-of-speech tagging and named entity recognition of claim 3, comprising the steps of:

for the preprocessed data, obtaining context semantic vector representation of each word in the sentence:

the segmented sequence [ w ] ₁ ,w ₂ ,...,w _n ]After the BERT pre-training language model code is input, the vector representation h is obtained ₁ ,h ₂ ,...,h _n ]Wherein h is _i Is w _i The corresponding vector represents, and the vector dimension d equals 768.

5. The method of joint lexical analysis of segmentation, part of speech tagging and named entity recognition of claim 4, comprising the steps of:

performing candidate word detection and candidate entity detection on all continuous subsequences in the sentence, including:

obtain a vector sequence [ q ] ₁ ,q ₂ ,…,q _n ]And [ k ] ₁ ,k ₂ ,...,k _n ]Is a feature vector for word segmentation, by q _i And k _j Inner product of (3) calculating a continuous subsequence x _[i：j] Word score s ₁ (i，j)；

Using a greedy algorithm to obtain an optimal solution:

max(s ₁ (i，j)，s ₁ (i，j+1))

similarly, the vector sequence [ h ] obtained after coding ₁ ,h ₂ ,...,h _n ]By transforming r _i ＝W _r h _i +b _r And u _i ＝W _u h _i +b _u Wherein W is _r And W _u Are parameters of the model; obtain the vector sequence [ r ₁ ,r ₂ ,...,r _n ]And [ u ] ₁ ,u ₂ ,...,u _n ]Is to determine whether the feature vector is used by the entity by r _i And u _j Inner product of (3) calculating a continuous subsequence x _[i：j] Word score s ₂ (i，j)，

6. The method of joint lexical analysis of segmentation, part of speech tagging and named entity recognition of claim 5, comprising the steps of:

the vector sequence [ h ] obtained after coding ₁ ,h ₂ ,...,h _n ]By passing

The predicted dependency label (label),

is the posterior probability that i or j is known to be both ends of the dependency (arc).

7. The method of joint lexical analysis of segmentation, part-of-speech tagging and named entity recognition of claim 6, comprising the steps of:

after the part of speech category prediction and the entity category prediction, predicting word sequence labels, part of speech sequence labels and entity sequence labels in the current sequence as follows:

wherein the content of the first and second substances,c1 is a set of part-of-speech categories,

as a word x _[i：j] The label in the category C1 is,

is the predicted value of the model in category c;

where C2 is a set of entity classes,

is an entity x _[i：j] The label on the category X2 that is,

is the predicted value of the model on category c.