CN114970536A - Combined lexical analysis method for word segmentation, part of speech tagging and named entity recognition - Google Patents
Combined lexical analysis method for word segmentation, part of speech tagging and named entity recognition Download PDFInfo
- Publication number
- CN114970536A CN114970536A CN202210715424.0A CN202210715424A CN114970536A CN 114970536 A CN114970536 A CN 114970536A CN 202210715424 A CN202210715424 A CN 202210715424A CN 114970536 A CN114970536 A CN 114970536A
- Authority
- CN
- China
- Prior art keywords
- word
- entity
- speech
- speech tagging
- sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000011218 segmentation Effects 0.000 title claims abstract description 59
- 238000004458 analytical method Methods 0.000 title claims abstract description 27
- 238000001514 detection method Methods 0.000 claims abstract description 33
- 238000003062 neural network model Methods 0.000 claims abstract description 6
- 239000013598 vector Substances 0.000 claims description 38
- 238000000034 method Methods 0.000 claims description 27
- 238000012549 training Methods 0.000 claims description 17
- 230000006870 function Effects 0.000 claims description 6
- 230000001131 transforming effect Effects 0.000 claims description 6
- 238000002372 labelling Methods 0.000 claims description 4
- 238000007781 pre-processing Methods 0.000 claims description 4
- 239000000126 substance Substances 0.000 claims description 2
- 238000002474 experimental method Methods 0.000 description 9
- 230000000694 effects Effects 0.000 description 3
- 238000003058 natural language processing Methods 0.000 description 3
- 238000012795 verification Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000877 morphologic effect Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/374—Thesaurus
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Evolutionary Computation (AREA)
- Databases & Information Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a joint lexical analysis method for word segmentation, part of speech tagging and named entity recognition, which comprises the steps of decomposing a word segmentation task and a part of speech tagging task into two subtasks of candidate word detection and part of speech category prediction, decomposing an entity recognition task into two subtasks of candidate entity detection and entity category prediction, and performing joint learning on the four tasks by adopting a unified neural network model; and meanwhile, parameters among different tasks are shared. The invention improves the word boundary detection problem in the part of speech tagging task and the entity recognition task by using high-accuracy word segmentation, and can improve the word segmentation precision by using part of speech tagging information. And joint learning is performed by utilizing high relevance among word segmentation, part-of-speech tagging and named entity recognition, so that the model performance is improved.
Description
Technical Field
The invention relates to a joint lexical analysis method for word segmentation, part of speech tagging and named entity recognition, and relates to the technical field of natural language processing.
Background
In natural language processing, lexical analysis is a basic key task in natural language processing, and word segmentation, part of speech tagging and named entity recognition in the lexical analysis are all the bases of downstream tasks such as text classification, information retrieval and machine translation.
Although the existing Chinese word segmentation model, part of speech tagging model and named entity recognition model all have certain progress, the model for performing multi-task combination aiming at the three tasks has not yet been developed. The N-grams statistical language model is used for selecting the word combination with the maximum occurrence probability by utilizing the related information between the adjacent words of the context to realize automatic word segmentation. The model has language independence and strong spelling error compatibility, can better process Chinese and English and complex and simple texts, and is a common statistical language model for word segmentation tasks. The type of model is not limited by the text field, but the recognition speed is still to be improved. In past research, word segmentation, part-of-speech tagging, and named entity recognition were generally viewed as separate tasks, generally viewed as sequence tagging tasks,
in recent years, a batch of methods based on a deep learning algorithm appear, the mainstream model is an Encoder-Decoder model, the most representative model is a BilSTM-CRF model, the method inherits the advantages of the deep learning method in the aspect of characteristics, and good effect can be achieved by using word vectors and character vectors without characteristic engineering. BilSTM can capture the semantics of each word in the context, but from the perspective of part-of-speech tagging and named entity recognition, overlong sequences have little predictive significance for word parts-of-speech and entities, and the training using CRF decoding is costly and complex. The tasks in the existing joint lexical analysis method are usually mutually independent in structure, but are sequentially dependent in the data processing process, and the method can introduce error transmission among the tasks and influence the performance of a model. Different from the prior method, the method of the invention provides a brand-new paradigm, and provides a detection-classification framework by a multi-label classification idea. The segmentation is firstly graded, then classification is carried out, the denominator does not need to be recursively calculated like CRF during training, dynamic planning is not needed during prediction, the tasks of word segmentation, part of speech tagging and named entity recognition are carried out simultaneously, data sharing is achieved, and the performance of each task is improved to different degrees.
Disclosure of Invention
The invention aims to provide a joint lexical analysis method for word segmentation, part of speech tagging and named entity recognition, which decomposes a word segmentation task and a part of speech tagging task into two subtasks of candidate word detection and part of speech category prediction, decomposes an entity recognition task into two subtasks of candidate entity detection and entity category prediction, and adopts a unified neural network model to carry out joint learning on the four tasks, thereby realizing the multi-task joint learning of word segmentation, part of speech tagging and named entity recognition.
In order to achieve the technical purpose and achieve the technical effect, the invention is realized by the following technical scheme:
a joint lexical analysis method for word segmentation, part of speech tagging and named entity recognition is disclosed, and the analysis method comprises the following steps: the word segmentation and part-of-speech tagging task is decomposed into two subtasks of candidate word detection and part-of-speech category prediction, the entity recognition task is decomposed into two subtasks of candidate entity detection and entity category prediction, and a unified neural network model is adopted to carry out joint learning on the four tasks; and meanwhile, parameters among different tasks are shared.
A joint lexical analysis method for word segmentation, part of speech tagging and named entity recognition comprises the following steps:
s1: performing data preprocessing on the text obtained from the PFR1998, and matching each character segment with its corresponding label category;
s2: and sequentially obtaining information of each sentence from the data preprocessed in the S1 as input, performing feature coding on the input by using a BERT pre-training language model, obtaining the context semantic vector representation of each word in the sentence, performing candidate word detection and candidate entity detection on all continuous subsequences in the sentence, and predicting word sequence labels, part of speech sequence labels and entity sequence labels in the current sequence by calculating the score probability of the candidate words and the candidate entities.
Further, the preprocessing of the data by the S1 includes:
constructing a part-of-speech tag dictionary and an entity tag dictionary for the words according to the training set;
labeling label types for the character segments by combining position information of the characters in the sentences;
then, each sentence takes the character as an input unit, each character is endowed with a fixed id number through a word segmentation device of a BERT pre-training language model, and a segmentation sequence [ w ] of the sentence is obtained 1 ,w 2 ,...,w n ]Wherein w is i Indicating the number of the segmentation term in the BERT vocabulary.
Further, the S2 obtains context semantic vector representations of each word in the sentence:
the segmented sequence [ w ] 1 ,w 2 ,...,w n ]After the BERT pre-training language model code is input, the vector representation h is obtained 1 ,h 2 ,...,h n ]Wherein h is i Is w i Corresponding vector representation, wherein the vector dimension d is 768;
further, the S2 performs candidate word detection and candidate entity detection on all consecutive subsequences in the sentence, including:
vector sequence [ h ] obtained after coding 1 ,h 2 ,...,h n ]By transforming q i =W q h i +b q And k i =W k h i +b k Wherein W is q And W k Are parameters of the model;
obtain a vector sequence [ q ] 1 ,q 2 ,...,q n ]And [ k ] 1 ,k 2 ,...,k n ]Is a feature vector for word segmentation, by q i And k j Inner product of (3) calculating a continuous subsequence x [i:j] Word score s 1 (i,j);
Optimal solutions are obtained using a greedy algorithm:
max(s 1 (i,j),s 1 (i,j+1))
similarly, the vector sequence [ h ] obtained after coding 1 ,h 2 ,...,h n ]By transforming r i =W r h i +b r And u i =W u h i +b u Wherein W is r And W u Are parameters of the model; obtain the vector sequence [ r 1 ,r 2 ,...,r n ]And [ u ] 1 ,u 2 ,...,u n ]Is to determine whether the feature vector is used by the entity by r i And u j Inner product of (3) calculating a continuous subsequence x [i:j] Word score s 2 (i,j),
Further, after candidate word detection and candidate entity detection are performed on all the continuous subsequences in the sentence, part-of-speech category prediction and entity category prediction are performed:
the vector sequence [ h ] obtained after coding 1 ,h 2 ,...,h n ]By passingThe predicted dependency label (label),
wherein the content of the first and second substances,U (1) is dimension R m×d×d The higher order tensor of (m is the number of tags, d is the Biaffine input dimension),is known at the same time as the posterior probability in the case of i as dep and j as head,is the posterior probability that i or j is known to be both ends of the dependency (arc);
further, the word sequence tags, part of speech sequence tags, and entity sequence tags in the predicted current sequence are:
by the score s obtained 1 (i, j) calculating a loss function loss _ ws of the participle:
wherein P is the beginning and end set of words for the sample, and Q is the beginning and end set of all non-words for the sample;
and respectively calculating loss functions (loss _ pos and loss _ ner) of part-of-speech tagging and named entity identification by adopting multi-classification cross entropy:
where C1 is a set of part-of-speech categories,as a word x [i:j] The label in the category C1 is,is the predicted value of the model on category c;
where C2 is a set of entity classes,is an entity x [i:j] The label on the category C2 that is,is the predicted value of the model on category c.
The invention has the beneficial effects that:
the invention relates to a combined lexical analysis method for word segmentation, part of speech tagging and named entity recognition, which is characterized in that a word segmentation task and a part of speech tagging task are decomposed into two subtasks of candidate word detection and part of speech category prediction, an entity recognition task is decomposed into two subtasks of candidate entity detection and entity category prediction, and a unified neural network model is adopted to carry out combined learning on the four tasks; meanwhile, parameters among different tasks are shared;
the combined lexical analysis method for word segmentation, part of speech tagging and named entity recognition is different from a sequence tagging method, the part of speech tagging is regarded as two subtasks of word detection and part of speech classification, and the entity recognition is regarded as two subtasks of entity detection and entity classification. On the basis of obtaining word sequence representation, the method adopts four neural network layers to realize multi-task combination of word segmentation, part of speech tagging and named entity identification.
The invention jointly learns and predicts the three tasks of word segmentation, part of speech tagging and named entity recognition, and implicitly shares one data. By sharing parameters among different tasks, the network capability is weakened to a certain extent, and the generalization capability of each task can be improved.
The invention improves the word boundary detection problem in the part of speech tagging task and the entity recognition task by using high-accuracy word segmentation, and can improve the word segmentation precision by using part of speech tagging information. And joint learning is performed by utilizing high relevance among word segmentation, part-of-speech tagging and named entity recognition, so that the model performance is improved.
Of course, it is not necessary for any product in which the invention is practiced to achieve all of the above-described advantages at the same time.
Drawings
FIG. 1 is a diagram illustrating a method for joint lexical analysis of segmented words, part-of-speech tagging and named entity recognition according to an embodiment of the present invention;
Detailed Description
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
Example 1
As shown in fig. 1, the joint lexical analysis method of word segmentation, part of speech tagging and named entity recognition:
s1: and constructing a part-of-speech tag dictionary and an entity tag dictionary for the words according to the training set. Wherein, the part of speech tags are 45, and the entity tags are 5;
labeling the label type of the character segment according to the position information of the character in the sentence;
then, each sentence takes the character as an input unit, each character is endowed with a fixed id number through a word segmentation device of a BERT pre-training language model, and a segmentation sequence [ w ] of the sentence is obtained 1 ,w 2 ,...,w n ]Wherein w is i Representing the number of the segmentation item in the BERT vocabulary;
s2: and sequentially obtaining information of each sentence from the data preprocessed in the S1 as input, performing feature coding on the input by using a BERT pre-training language model, obtaining the context semantic vector representation of each word in the sentence, performing candidate word detection and candidate entity detection on all continuous subsequences in the sentence, and predicting word sequence labels, part of speech sequence labels and entity sequence labels in the current sequence by calculating the score probability of the candidate words and the candidate entities.
Obtaining a context semantic vector representation for each word in the sentence:
the segmented sequence [ w ] 1 ,w 2 ,...,w n ]After the BERT pre-training language model code is input, the vector representation h is obtained 1 ,h 2 ,...,h n ]Wherein h is i Is w i Corresponding vector representation, wherein the vector dimension d is 768;
further, the S2 performs candidate word detection and candidate entity detection on all consecutive subsequences in the sentence, including:
vector sequence [ h ] obtained after coding 1 ,h 2 ,...,h n ]By transforming q i =W q h i +b q And k i =W k h i +b k Wherein W is q And W k Are parameters of the model;
obtain the vector sequence [ q ] 1 ,q 2 ,...,q n ]And [ k ] 1 ,k 2 ,...,k n ]Is a feature vector for word segmentation, by q i And k j Inner product of (3) calculating a continuous subsequence x [i:j] Word score s 1 (i,j);
Using a greedy algorithm to obtain an optimal solution:
max(s 1 (i,j),s 1 (i,j+1))
similarly, the vector sequence [ h ] obtained after coding 1 ,h 2 ,...,h n ]By transforming r i =W r h i +b r And u i =W u h i +b u Wherein W is r And W u Are parameters of the model; obtain the vector sequence [ r 1 ,r 2 ,...,r n ]And [ u ] 1 ,u 2 ,...,u n ]Is to determine whether the feature vector is used by the entity by r i And u j Inner product of (3) calculating a continuous subsequence x [i:j] Word score s 2 (i,j),
After candidate word detection and candidate entity detection are carried out on all continuous subsequences in the sentence, part of speech category prediction and entity category prediction are carried out:
the vector sequence [ h ] obtained after coding 1 ,h 2 ,...,h n ]By passingThe predicted dependency label (label),
wherein, U (1) Is dimension R m×d×d The higher order tensor of (m is the number of tags, d is the Biaffine input dimension),is known at the same time as the posterior probability in the case of i as dep and j as head,is the posterior probability that i or j is known to be both ends of the dependency (arc);
the word sequence tags, part of speech sequence tags and entity sequence tags in the predicted current sequence are as follows:
by the score s obtained 1 (i, j) calculating a loss function loss _ ws of the participle:
wherein P is the beginning and end set of words for the sample, and Q is the beginning and end set of all non-words for the sample;
and respectively calculating loss functions (loss _ pos and loss _ ner) of part-of-speech tagging and named entity identification by adopting multi-classification cross entropy:
where C1 is a set of part-of-speech categories,as a word x [i:j] The label on the category C1 that is,is the predicted value of the model in category c;
where C2 is a set of entity classes,is an entity x [i:j] The label on the category C2 that is,is the predicted value of the model on category c.
Example 2
Based on the joint lexical analysis method of word segmentation, part of speech tagging and named entity recognition in embodiment 1, in order to illustrate the effect of the invention, a 1-group comparison experiment is set. The first group of experiments verify the validity of the participle, the part of speech tag and the named entity identification, the second group of experiments verify the validity of the part of speech tag, and the third group of experiments verify the validity of the named entity identification.
Word segmentation performance improvement verification
The method inputs characters in a text into a model, utilizes a BERT pre-training language model to perform characteristic coding on the input to obtain a representation, and further predicts a dependency relationship and a dependency label in a current sequence. In order to verify the effectiveness of the model on word segmentation, the model is compared and analyzed with a plurality of other related models under a people daily report labeling corpus (PFR). The results of the experiment are shown in table 1:
TABLE 1 Chinese participle method Performance comparison
As can be seen from the analysis of Table 1, the F1 value of the method of the invention is higher than that of all other methods, and the effectiveness of the model in the Chinese word segmentation task is proved. The BERT pre-training language model in the method structure can effectively relieve the problem of unknown words during prediction. And the part-of-speech tagging information is also effective for improving the Chinese word segmentation performance.
Part-of-speech tagging performance improvement verification
In order to verify the validity of the proposed model for Chinese part-of-speech tagging, a comparison experiment is carried out on the model and a plurality of related models below under a people daily report tagging corpus (PFR). The results of the experiment are shown in table 2:
TABLE 2 Chinese part-of-speech tagging method Performance comparison
Analysis of table 2 reveals that the F1 values of the model of the present invention exceed all mainstream models in the human daily annotated corpus (PFR). The CNN-based method F1 has the lowest value, which indicates that the neural network model can obtain more deep-level input information in language compared with the conventional machine learning-based method. And the F1 value of the model is higher than that of other two neural network-based models, probably because other models lack text morphological information and internal structure information, so that the model part-of-speech tag prediction is biased. And because the Chinese text lacks word separators and the identification of word boundaries in the Chinese part-of-speech tagging is very challenging, the model performs joint learning on the segmented words so as to improve the capacity of the Chinese part-of-speech model in identifying word boundaries and improve the performance of the part-of-speech tagging.
Named entity identification performance enhancement verification
In order to verify the effectiveness of the proposed model in identifying the Chinese named entity, the invention carries out comparison experiments with a plurality of related models below under a people daily report annotation corpus (PFR). The results of the experiment are shown in table 3:
TABLE 3 Performance comparison of the Chinese named entity recognition methods
Analysis of table 3 reveals that the F1 values of the model of the present invention exceed all mainstream models in the human daily annotated corpus (PFR). The present invention treats entity recognition as a continuous subsequence classification problem in sentences while predicting all entities contained in the sentence. Different entity categories are identified by adopting independent prediction parameters, so that the performance of model named entity identification is improved to a certain extent.
The experimental data prove that the combined task of word segmentation, part of speech tagging and named entity recognition is regarded as a multi-classification problem of continuous subsequences in sentences, and the model performance is improved to a certain extent. The word boundary detection problem in the part of speech tagging task and the entity recognition task is improved by word segmentation with high accuracy, and the word segmentation precision can be improved by using part of speech tagging information. Performing joint learning by utilizing high relevance among word segmentation, part-of-speech tagging and named entity recognition, so that a Chinese lexical analysis model obtains better performance; the problem of unknown words during prediction can be effectively relieved by selecting a proper pre-training language model; a multi-task joint learning mode is adopted, and implicit data sharing is performed, which is equivalent to implicit data enhancement. The method can share parameters, weaken the network capacity to a certain extent and prevent overfitting. The invention provides a combined model and a device for identifying participles, parts of speech and named entities, which are used for marking and naming Chinese participles and parts of speech.
In the description herein, references to the description of "one embodiment," "an example," "a specific example" or the like are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best understand the invention for and utilize the invention. The invention is limited only by the claims and their full scope and equivalents.
Claims (7)
1. A joint lexical analysis method for word segmentation, part of speech tagging and named entity recognition is characterized by comprising the following steps: the word segmentation and part-of-speech tagging task is decomposed into two subtasks of candidate word detection and part-of-speech category prediction, the entity recognition task is decomposed into two subtasks of candidate entity detection and entity category prediction, and a unified neural network model is adopted to carry out joint learning on the four tasks; and meanwhile, parameters among different tasks are shared.
2. The method of joint lexical analysis of segmentation, part of speech tagging and named entity recognition of claim 1, comprising the steps of:
s1: pre-processing the data of the text obtained from the PFR1998 to match each character segment with its corresponding label category;
s2: and sequentially obtaining information of each sentence from the data preprocessed in the S1 as input, performing feature coding on the input by using a BERT pre-training language model, obtaining the context semantic vector representation of each word in the sentence, performing candidate word detection and candidate entity detection on all continuous subsequences in the sentence, and predicting word sequence labels, part of speech sequence labels and entity sequence labels in the current sequence by calculating the score probability of the candidate words and the candidate entities.
3. The method of joint lexical analysis of segmentation, part of speech tagging and named entity recognition of claim 2, comprising the steps of: the S1 preprocessing the data includes:
constructing a part-of-speech tag dictionary and an entity tag dictionary for the words according to the training set;
labeling the label type of the character segment according to the position information of the character in the sentence;
then, each sentence takes the character as an input unit, each character is endowed with a fixed id number through a word segmentation device of a BERT pre-training language model, and a segmentation sequence [ w ] of the sentence is obtained 1 ,w 2 ,...,w n ]Wherein w is i Indicating the number of the segmentation term in the BERT vocabulary.
4. The method of joint lexical analysis of segmentation, part-of-speech tagging and named entity recognition of claim 3, comprising the steps of:
for the preprocessed data, obtaining context semantic vector representation of each word in the sentence:
the segmented sequence [ w ] 1 ,w 2 ,...,w n ]After the BERT pre-training language model code is input, the vector representation h is obtained 1 ,h 2 ,...,h n ]Wherein h is i Is w i The corresponding vector represents, and the vector dimension d equals 768.
5. The method of joint lexical analysis of segmentation, part of speech tagging and named entity recognition of claim 4, comprising the steps of:
performing candidate word detection and candidate entity detection on all continuous subsequences in the sentence, including:
vector sequence [ h ] obtained after coding 1 ,h 2 ,...,h n ]By transforming q i =W q h i +b q And k i =W k h i +b k Wherein W is q And W k Are parameters of the model;
obtain a vector sequence [ q ] 1 ,q 2 ,…,q n ]And [ k ] 1 ,k 2 ,...,k n ]Is a feature vector for word segmentation, by q i And k j Inner product of (3) calculating a continuous subsequence x [i:j] Word score s 1 (i,j);
Using a greedy algorithm to obtain an optimal solution:
max(s 1 (i,j),s 1 (i,j+1))
similarly, the vector sequence [ h ] obtained after coding 1 ,h 2 ,...,h n ]By transforming r i =W r h i +b r And u i =W u h i +b u Wherein W is r And W u Are parameters of the model; obtain the vector sequence [ r 1 ,r 2 ,...,r n ]And [ u ] 1 ,u 2 ,...,u n ]Is to determine whether the feature vector is used by the entity by r i And u j Inner product of (3) calculating a continuous subsequence x [i:j] Word score s 2 (i,j),
6. The method of joint lexical analysis of segmentation, part of speech tagging and named entity recognition of claim 5, comprising the steps of:
after candidate word detection and candidate entity detection are carried out on all continuous subsequences in the sentence, part of speech category prediction and entity category prediction are carried out:
the vector sequence [ h ] obtained after coding 1 ,h 2 ,...,h n ]By passingThe predicted dependency label (label),
wherein, U (1) Is dimension R m×d×d The higher order tensor of (m is the number of tags, d is the Biaffine input dimension),is known at the same time as the posterior probability in the case of i as dep and j as head,is the posterior probability that i or j is known to be both ends of the dependency (arc).
7. The method of joint lexical analysis of segmentation, part-of-speech tagging and named entity recognition of claim 6, comprising the steps of:
after the part of speech category prediction and the entity category prediction, predicting word sequence labels, part of speech sequence labels and entity sequence labels in the current sequence as follows:
by the score s obtained 1 (i, j) calculating a loss function loss _ ws of the participle:
wherein P is the beginning and end set of words for the sample, and Q is the beginning and end set of all non-words for the sample;
and respectively calculating loss functions (loss _ pos and loss _ ner) of part-of-speech tagging and named entity identification by adopting multi-classification cross entropy:
wherein the content of the first and second substances,c1 is a set of part-of-speech categories,as a word x [i:j] The label in the category C1 is,is the predicted value of the model in category c;
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210715424.0A CN114970536A (en) | 2022-06-22 | 2022-06-22 | Combined lexical analysis method for word segmentation, part of speech tagging and named entity recognition |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210715424.0A CN114970536A (en) | 2022-06-22 | 2022-06-22 | Combined lexical analysis method for word segmentation, part of speech tagging and named entity recognition |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114970536A true CN114970536A (en) | 2022-08-30 |
Family
ID=82966410
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210715424.0A Pending CN114970536A (en) | 2022-06-22 | 2022-06-22 | Combined lexical analysis method for word segmentation, part of speech tagging and named entity recognition |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114970536A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115879421A (en) * | 2023-02-16 | 2023-03-31 | 之江实验室 | Sentence ordering method and device for enhancing BART pre-training task |
CN116776887A (en) * | 2023-08-18 | 2023-09-19 | 昆明理工大学 | Negative sampling remote supervision entity identification method based on sample similarity calculation |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9202176B1 (en) * | 2011-08-08 | 2015-12-01 | Gravity.Com, Inc. | Entity analysis system |
CN109871538A (en) * | 2019-02-18 | 2019-06-11 | 华南理工大学 | A kind of Chinese electronic health record name entity recognition method |
CN111259987A (en) * | 2020-02-20 | 2020-06-09 | 民生科技有限责任公司 | Method for extracting event main body based on BERT (belief-based regression analysis) multi-model fusion |
CN111695053A (en) * | 2020-06-12 | 2020-09-22 | 上海智臻智能网络科技股份有限公司 | Sequence labeling method, data processing device and readable storage medium |
CN112101040A (en) * | 2020-08-20 | 2020-12-18 | 淮阴工学院 | Ancient poetry semantic retrieval method based on knowledge graph |
CN113011189A (en) * | 2021-03-26 | 2021-06-22 | 深圳壹账通智能科技有限公司 | Method, device and equipment for extracting open entity relationship and storage medium |
CN113360667A (en) * | 2021-05-31 | 2021-09-07 | 安徽大学 | Biomedical trigger word detection and named entity identification method based on multitask learning |
US20210357585A1 (en) * | 2017-03-13 | 2021-11-18 | Arizona Board Of Regents On Behalf Of The University Of Arizona | Methods for extracting and assessing information from literature documents |
CN113806646A (en) * | 2020-06-12 | 2021-12-17 | 上海智臻智能网络科技股份有限公司 | Sequence labeling system and training system of sequence labeling model |
-
2022
- 2022-06-22 CN CN202210715424.0A patent/CN114970536A/en active Pending
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9202176B1 (en) * | 2011-08-08 | 2015-12-01 | Gravity.Com, Inc. | Entity analysis system |
US20210357585A1 (en) * | 2017-03-13 | 2021-11-18 | Arizona Board Of Regents On Behalf Of The University Of Arizona | Methods for extracting and assessing information from literature documents |
CN109871538A (en) * | 2019-02-18 | 2019-06-11 | 华南理工大学 | A kind of Chinese electronic health record name entity recognition method |
CN111259987A (en) * | 2020-02-20 | 2020-06-09 | 民生科技有限责任公司 | Method for extracting event main body based on BERT (belief-based regression analysis) multi-model fusion |
CN111695053A (en) * | 2020-06-12 | 2020-09-22 | 上海智臻智能网络科技股份有限公司 | Sequence labeling method, data processing device and readable storage medium |
CN113806646A (en) * | 2020-06-12 | 2021-12-17 | 上海智臻智能网络科技股份有限公司 | Sequence labeling system and training system of sequence labeling model |
CN112101040A (en) * | 2020-08-20 | 2020-12-18 | 淮阴工学院 | Ancient poetry semantic retrieval method based on knowledge graph |
CN113011189A (en) * | 2021-03-26 | 2021-06-22 | 深圳壹账通智能科技有限公司 | Method, device and equipment for extracting open entity relationship and storage medium |
CN113360667A (en) * | 2021-05-31 | 2021-09-07 | 安徽大学 | Biomedical trigger word detection and named entity identification method based on multitask learning |
Non-Patent Citations (3)
Title |
---|
ZHICHANG ZHANG等: "a joint learning framework with bert for spoken language understanding", IEEE ACCESS, 20 November 2019 (2019-11-20), pages 49 - 58 * |
吴俊;程;郝瀚;艾力亚尔・艾则孜;刘菲雪;苏亦坡;: "基于BERT嵌入BiLSTM-CRF模型的中文专业术语抽取研究", 情报学报, no. 04, 24 April 2020 (2020-04-24), pages 409 - 418 * |
朱叶芬等: "基于局部Transformer的泰语分词和词性标注联合模型", 智能***学报, vol. 19, no. 2, 16 November 2023 (2023-11-16), pages 401 - 410 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115879421A (en) * | 2023-02-16 | 2023-03-31 | 之江实验室 | Sentence ordering method and device for enhancing BART pre-training task |
CN115879421B (en) * | 2023-02-16 | 2024-01-09 | 之江实验室 | Sentence ordering method and device for enhancing BART pre-training task |
CN116776887A (en) * | 2023-08-18 | 2023-09-19 | 昆明理工大学 | Negative sampling remote supervision entity identification method based on sample similarity calculation |
CN116776887B (en) * | 2023-08-18 | 2023-10-31 | 昆明理工大学 | Negative sampling remote supervision entity identification method based on sample similarity calculation |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8527262B2 (en) | Systems and methods for automatic semantic role labeling of high morphological text for natural language processing applications | |
Collobert et al. | A unified architecture for natural language processing: Deep neural networks with multitask learning | |
Belinkov et al. | Arabic diacritization with recurrent neural networks | |
CN112541356B (en) | Method and system for recognizing biomedical named entities | |
CN112115238A (en) | Question-answering method and system based on BERT and knowledge base | |
US20230069935A1 (en) | Dialog system answering method based on sentence paraphrase recognition | |
Carbonell et al. | Joint recognition of handwritten text and named entities with a neural end-to-end model | |
CN114970536A (en) | Combined lexical analysis method for word segmentation, part of speech tagging and named entity recognition | |
Gao et al. | Named entity recognition method of Chinese EMR based on BERT-BiLSTM-CRF | |
Szarvas et al. | A highly accurate Named Entity corpus for Hungarian | |
CN113191148A (en) | Rail transit entity identification method based on semi-supervised learning and clustering | |
CN111783461A (en) | Named entity identification method based on syntactic dependency relationship | |
CN113177102B (en) | Text classification method and device, computing equipment and computer readable medium | |
CN112905736A (en) | Unsupervised text emotion analysis method based on quantum theory | |
Moeng et al. | Canonical and surface morphological segmentation for nguni languages | |
CN114491024A (en) | Small sample-based specific field multi-label text classification method | |
Wosiak | Automated extraction of information from Polish resume documents in the IT recruitment process | |
CN115481635A (en) | Address element analysis method and system | |
Seeha et al. | ThaiLMCut: Unsupervised pretraining for Thai word segmentation | |
CN114579695A (en) | Event extraction method, device, equipment and storage medium | |
CN114416991A (en) | Method and system for analyzing text emotion reason based on prompt | |
Ahmad et al. | Machine and deep learning methods with manual and automatic labelling for news classification in bangla language | |
CN116562291A (en) | Chinese nested named entity recognition method based on boundary detection | |
CN115759102A (en) | Chinese poetry wine culture named entity recognition method | |
Tolegen et al. | Voted-perceptron approach for Kazakh morphological disambiguation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |