CN114970536A - Combined lexical analysis method for word segmentation, part of speech tagging and named entity recognition - Google Patents

Combined lexical analysis method for word segmentation, part of speech tagging and named entity recognition Download PDF

Info

Publication number
CN114970536A
CN114970536A CN202210715424.0A CN202210715424A CN114970536A CN 114970536 A CN114970536 A CN 114970536A CN 202210715424 A CN202210715424 A CN 202210715424A CN 114970536 A CN114970536 A CN 114970536A
Authority
CN
China
Prior art keywords
word
entity
speech
speech tagging
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210715424.0A
Other languages
Chinese (zh)
Inventor
线岩团
朱叶芬
文永华
王红斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202210715424.0A priority Critical patent/CN114970536A/en
Publication of CN114970536A publication Critical patent/CN114970536A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a joint lexical analysis method for word segmentation, part of speech tagging and named entity recognition, which comprises the steps of decomposing a word segmentation task and a part of speech tagging task into two subtasks of candidate word detection and part of speech category prediction, decomposing an entity recognition task into two subtasks of candidate entity detection and entity category prediction, and performing joint learning on the four tasks by adopting a unified neural network model; and meanwhile, parameters among different tasks are shared. The invention improves the word boundary detection problem in the part of speech tagging task and the entity recognition task by using high-accuracy word segmentation, and can improve the word segmentation precision by using part of speech tagging information. And joint learning is performed by utilizing high relevance among word segmentation, part-of-speech tagging and named entity recognition, so that the model performance is improved.

Description

Combined lexical analysis method for word segmentation, part of speech tagging and named entity recognition
Technical Field
The invention relates to a joint lexical analysis method for word segmentation, part of speech tagging and named entity recognition, and relates to the technical field of natural language processing.
Background
In natural language processing, lexical analysis is a basic key task in natural language processing, and word segmentation, part of speech tagging and named entity recognition in the lexical analysis are all the bases of downstream tasks such as text classification, information retrieval and machine translation.
Although the existing Chinese word segmentation model, part of speech tagging model and named entity recognition model all have certain progress, the model for performing multi-task combination aiming at the three tasks has not yet been developed. The N-grams statistical language model is used for selecting the word combination with the maximum occurrence probability by utilizing the related information between the adjacent words of the context to realize automatic word segmentation. The model has language independence and strong spelling error compatibility, can better process Chinese and English and complex and simple texts, and is a common statistical language model for word segmentation tasks. The type of model is not limited by the text field, but the recognition speed is still to be improved. In past research, word segmentation, part-of-speech tagging, and named entity recognition were generally viewed as separate tasks, generally viewed as sequence tagging tasks,
in recent years, a batch of methods based on a deep learning algorithm appear, the mainstream model is an Encoder-Decoder model, the most representative model is a BilSTM-CRF model, the method inherits the advantages of the deep learning method in the aspect of characteristics, and good effect can be achieved by using word vectors and character vectors without characteristic engineering. BilSTM can capture the semantics of each word in the context, but from the perspective of part-of-speech tagging and named entity recognition, overlong sequences have little predictive significance for word parts-of-speech and entities, and the training using CRF decoding is costly and complex. The tasks in the existing joint lexical analysis method are usually mutually independent in structure, but are sequentially dependent in the data processing process, and the method can introduce error transmission among the tasks and influence the performance of a model. Different from the prior method, the method of the invention provides a brand-new paradigm, and provides a detection-classification framework by a multi-label classification idea. The segmentation is firstly graded, then classification is carried out, the denominator does not need to be recursively calculated like CRF during training, dynamic planning is not needed during prediction, the tasks of word segmentation, part of speech tagging and named entity recognition are carried out simultaneously, data sharing is achieved, and the performance of each task is improved to different degrees.
Disclosure of Invention
The invention aims to provide a joint lexical analysis method for word segmentation, part of speech tagging and named entity recognition, which decomposes a word segmentation task and a part of speech tagging task into two subtasks of candidate word detection and part of speech category prediction, decomposes an entity recognition task into two subtasks of candidate entity detection and entity category prediction, and adopts a unified neural network model to carry out joint learning on the four tasks, thereby realizing the multi-task joint learning of word segmentation, part of speech tagging and named entity recognition.
In order to achieve the technical purpose and achieve the technical effect, the invention is realized by the following technical scheme:
a joint lexical analysis method for word segmentation, part of speech tagging and named entity recognition is disclosed, and the analysis method comprises the following steps: the word segmentation and part-of-speech tagging task is decomposed into two subtasks of candidate word detection and part-of-speech category prediction, the entity recognition task is decomposed into two subtasks of candidate entity detection and entity category prediction, and a unified neural network model is adopted to carry out joint learning on the four tasks; and meanwhile, parameters among different tasks are shared.
A joint lexical analysis method for word segmentation, part of speech tagging and named entity recognition comprises the following steps:
s1: performing data preprocessing on the text obtained from the PFR1998, and matching each character segment with its corresponding label category;
s2: and sequentially obtaining information of each sentence from the data preprocessed in the S1 as input, performing feature coding on the input by using a BERT pre-training language model, obtaining the context semantic vector representation of each word in the sentence, performing candidate word detection and candidate entity detection on all continuous subsequences in the sentence, and predicting word sequence labels, part of speech sequence labels and entity sequence labels in the current sequence by calculating the score probability of the candidate words and the candidate entities.
Further, the preprocessing of the data by the S1 includes:
constructing a part-of-speech tag dictionary and an entity tag dictionary for the words according to the training set;
labeling label types for the character segments by combining position information of the characters in the sentences;
then, each sentence takes the character as an input unit, each character is endowed with a fixed id number through a word segmentation device of a BERT pre-training language model, and a segmentation sequence [ w ] of the sentence is obtained 1 ,w 2 ,...,w n ]Wherein w is i Indicating the number of the segmentation term in the BERT vocabulary.
Further, the S2 obtains context semantic vector representations of each word in the sentence:
the segmented sequence [ w ] 1 ,w 2 ,...,w n ]After the BERT pre-training language model code is input, the vector representation h is obtained 1 ,h 2 ,...,h n ]Wherein h is i Is w i Corresponding vector representation, wherein the vector dimension d is 768;
further, the S2 performs candidate word detection and candidate entity detection on all consecutive subsequences in the sentence, including:
vector sequence [ h ] obtained after coding 1 ,h 2 ,...,h n ]By transforming q i =W q h i +b q And k i =W k h i +b k Wherein W is q And W k Are parameters of the model;
obtain a vector sequence [ q ] 1 ,q 2 ,...,q n ]And [ k ] 1 ,k 2 ,...,k n ]Is a feature vector for word segmentation, by q i And k j Inner product of (3) calculating a continuous subsequence x [i:j] Word score s 1 (i,j);
Figure BDA0003708619180000031
Optimal solutions are obtained using a greedy algorithm:
max(s 1 (i,j),s 1 (i,j+1))
similarly, the vector sequence [ h ] obtained after coding 1 ,h 2 ,...,h n ]By transforming r i =W r h i +b r And u i =W u h i +b u Wherein W is r And W u Are parameters of the model; obtain the vector sequence [ r 1 ,r 2 ,...,r n ]And [ u ] 1 ,u 2 ,...,u n ]Is to determine whether the feature vector is used by the entity by r i And u j Inner product of (3) calculating a continuous subsequence x [i:j] Word score s 2 (i,j),
Figure BDA0003708619180000032
Further, after candidate word detection and candidate entity detection are performed on all the continuous subsequences in the sentence, part-of-speech category prediction and entity category prediction are performed:
the vector sequence [ h ] obtained after coding 1 ,h 2 ,...,h n ]By passing
Figure BDA0003708619180000033
The predicted dependency label (label),
Figure BDA0003708619180000034
wherein the content of the first and second substances,U (1) is dimension R m×d×d The higher order tensor of (m is the number of tags, d is the Biaffine input dimension),
Figure BDA0003708619180000035
is known at the same time as the posterior probability in the case of i as dep and j as head,
Figure BDA0003708619180000036
is the posterior probability that i or j is known to be both ends of the dependency (arc);
further, the word sequence tags, part of speech sequence tags, and entity sequence tags in the predicted current sequence are:
by the score s obtained 1 (i, j) calculating a loss function loss _ ws of the participle:
Figure BDA0003708619180000037
wherein P is the beginning and end set of words for the sample, and Q is the beginning and end set of all non-words for the sample;
and respectively calculating loss functions (loss _ pos and loss _ ner) of part-of-speech tagging and named entity identification by adopting multi-classification cross entropy:
Figure BDA0003708619180000041
where C1 is a set of part-of-speech categories,
Figure BDA0003708619180000042
as a word x [i:j] The label in the category C1 is,
Figure BDA0003708619180000043
is the predicted value of the model on category c;
Figure BDA0003708619180000044
where C2 is a set of entity classes,
Figure BDA0003708619180000045
is an entity x [i:j] The label on the category C2 that is,
Figure BDA0003708619180000046
is the predicted value of the model on category c.
The invention has the beneficial effects that:
the invention relates to a combined lexical analysis method for word segmentation, part of speech tagging and named entity recognition, which is characterized in that a word segmentation task and a part of speech tagging task are decomposed into two subtasks of candidate word detection and part of speech category prediction, an entity recognition task is decomposed into two subtasks of candidate entity detection and entity category prediction, and a unified neural network model is adopted to carry out combined learning on the four tasks; meanwhile, parameters among different tasks are shared;
the combined lexical analysis method for word segmentation, part of speech tagging and named entity recognition is different from a sequence tagging method, the part of speech tagging is regarded as two subtasks of word detection and part of speech classification, and the entity recognition is regarded as two subtasks of entity detection and entity classification. On the basis of obtaining word sequence representation, the method adopts four neural network layers to realize multi-task combination of word segmentation, part of speech tagging and named entity identification.
The invention jointly learns and predicts the three tasks of word segmentation, part of speech tagging and named entity recognition, and implicitly shares one data. By sharing parameters among different tasks, the network capability is weakened to a certain extent, and the generalization capability of each task can be improved.
The invention improves the word boundary detection problem in the part of speech tagging task and the entity recognition task by using high-accuracy word segmentation, and can improve the word segmentation precision by using part of speech tagging information. And joint learning is performed by utilizing high relevance among word segmentation, part-of-speech tagging and named entity recognition, so that the model performance is improved.
Of course, it is not necessary for any product in which the invention is practiced to achieve all of the above-described advantages at the same time.
Drawings
FIG. 1 is a diagram illustrating a method for joint lexical analysis of segmented words, part-of-speech tagging and named entity recognition according to an embodiment of the present invention;
Detailed Description
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
Example 1
As shown in fig. 1, the joint lexical analysis method of word segmentation, part of speech tagging and named entity recognition:
s1: and constructing a part-of-speech tag dictionary and an entity tag dictionary for the words according to the training set. Wherein, the part of speech tags are 45, and the entity tags are 5;
labeling the label type of the character segment according to the position information of the character in the sentence;
then, each sentence takes the character as an input unit, each character is endowed with a fixed id number through a word segmentation device of a BERT pre-training language model, and a segmentation sequence [ w ] of the sentence is obtained 1 ,w 2 ,...,w n ]Wherein w is i Representing the number of the segmentation item in the BERT vocabulary;
s2: and sequentially obtaining information of each sentence from the data preprocessed in the S1 as input, performing feature coding on the input by using a BERT pre-training language model, obtaining the context semantic vector representation of each word in the sentence, performing candidate word detection and candidate entity detection on all continuous subsequences in the sentence, and predicting word sequence labels, part of speech sequence labels and entity sequence labels in the current sequence by calculating the score probability of the candidate words and the candidate entities.
Obtaining a context semantic vector representation for each word in the sentence:
the segmented sequence [ w ] 1 ,w 2 ,...,w n ]After the BERT pre-training language model code is input, the vector representation h is obtained 1 ,h 2 ,...,h n ]Wherein h is i Is w i Corresponding vector representation, wherein the vector dimension d is 768;
further, the S2 performs candidate word detection and candidate entity detection on all consecutive subsequences in the sentence, including:
vector sequence [ h ] obtained after coding 1 ,h 2 ,...,h n ]By transforming q i =W q h i +b q And k i =W k h i +b k Wherein W is q And W k Are parameters of the model;
obtain the vector sequence [ q ] 1 ,q 2 ,...,q n ]And [ k ] 1 ,k 2 ,...,k n ]Is a feature vector for word segmentation, by q i And k j Inner product of (3) calculating a continuous subsequence x [i:j] Word score s 1 (i,j);
Figure BDA0003708619180000051
Using a greedy algorithm to obtain an optimal solution:
max(s 1 (i,j),s 1 (i,j+1))
similarly, the vector sequence [ h ] obtained after coding 1 ,h 2 ,...,h n ]By transforming r i =W r h i +b r And u i =W u h i +b u Wherein W is r And W u Are parameters of the model; obtain the vector sequence [ r 1 ,r 2 ,...,r n ]And [ u ] 1 ,u 2 ,...,u n ]Is to determine whether the feature vector is used by the entity by r i And u j Inner product of (3) calculating a continuous subsequence x [i:j] Word score s 2 (i,j),
Figure BDA0003708619180000061
After candidate word detection and candidate entity detection are carried out on all continuous subsequences in the sentence, part of speech category prediction and entity category prediction are carried out:
the vector sequence [ h ] obtained after coding 1 ,h 2 ,...,h n ]By passing
Figure BDA0003708619180000062
The predicted dependency label (label),
Figure BDA0003708619180000063
wherein, U (1) Is dimension R m×d×d The higher order tensor of (m is the number of tags, d is the Biaffine input dimension),
Figure BDA0003708619180000064
is known at the same time as the posterior probability in the case of i as dep and j as head,
Figure BDA0003708619180000065
is the posterior probability that i or j is known to be both ends of the dependency (arc);
the word sequence tags, part of speech sequence tags and entity sequence tags in the predicted current sequence are as follows:
by the score s obtained 1 (i, j) calculating a loss function loss _ ws of the participle:
Figure BDA0003708619180000066
wherein P is the beginning and end set of words for the sample, and Q is the beginning and end set of all non-words for the sample;
and respectively calculating loss functions (loss _ pos and loss _ ner) of part-of-speech tagging and named entity identification by adopting multi-classification cross entropy:
Figure BDA0003708619180000067
where C1 is a set of part-of-speech categories,
Figure BDA0003708619180000068
as a word x [i:j] The label on the category C1 that is,
Figure BDA0003708619180000069
is the predicted value of the model in category c;
Figure BDA0003708619180000071
where C2 is a set of entity classes,
Figure BDA0003708619180000072
is an entity x [i:j] The label on the category C2 that is,
Figure BDA0003708619180000073
is the predicted value of the model on category c.
Example 2
Based on the joint lexical analysis method of word segmentation, part of speech tagging and named entity recognition in embodiment 1, in order to illustrate the effect of the invention, a 1-group comparison experiment is set. The first group of experiments verify the validity of the participle, the part of speech tag and the named entity identification, the second group of experiments verify the validity of the part of speech tag, and the third group of experiments verify the validity of the named entity identification.
Word segmentation performance improvement verification
The method inputs characters in a text into a model, utilizes a BERT pre-training language model to perform characteristic coding on the input to obtain a representation, and further predicts a dependency relationship and a dependency label in a current sequence. In order to verify the effectiveness of the model on word segmentation, the model is compared and analyzed with a plurality of other related models under a people daily report labeling corpus (PFR). The results of the experiment are shown in table 1:
TABLE 1 Chinese participle method Performance comparison
Figure BDA0003708619180000074
As can be seen from the analysis of Table 1, the F1 value of the method of the invention is higher than that of all other methods, and the effectiveness of the model in the Chinese word segmentation task is proved. The BERT pre-training language model in the method structure can effectively relieve the problem of unknown words during prediction. And the part-of-speech tagging information is also effective for improving the Chinese word segmentation performance.
Part-of-speech tagging performance improvement verification
In order to verify the validity of the proposed model for Chinese part-of-speech tagging, a comparison experiment is carried out on the model and a plurality of related models below under a people daily report tagging corpus (PFR). The results of the experiment are shown in table 2:
TABLE 2 Chinese part-of-speech tagging method Performance comparison
Figure BDA0003708619180000075
Analysis of table 2 reveals that the F1 values of the model of the present invention exceed all mainstream models in the human daily annotated corpus (PFR). The CNN-based method F1 has the lowest value, which indicates that the neural network model can obtain more deep-level input information in language compared with the conventional machine learning-based method. And the F1 value of the model is higher than that of other two neural network-based models, probably because other models lack text morphological information and internal structure information, so that the model part-of-speech tag prediction is biased. And because the Chinese text lacks word separators and the identification of word boundaries in the Chinese part-of-speech tagging is very challenging, the model performs joint learning on the segmented words so as to improve the capacity of the Chinese part-of-speech model in identifying word boundaries and improve the performance of the part-of-speech tagging.
Named entity identification performance enhancement verification
In order to verify the effectiveness of the proposed model in identifying the Chinese named entity, the invention carries out comparison experiments with a plurality of related models below under a people daily report annotation corpus (PFR). The results of the experiment are shown in table 3:
TABLE 3 Performance comparison of the Chinese named entity recognition methods
Figure BDA0003708619180000081
Analysis of table 3 reveals that the F1 values of the model of the present invention exceed all mainstream models in the human daily annotated corpus (PFR). The present invention treats entity recognition as a continuous subsequence classification problem in sentences while predicting all entities contained in the sentence. Different entity categories are identified by adopting independent prediction parameters, so that the performance of model named entity identification is improved to a certain extent.
The experimental data prove that the combined task of word segmentation, part of speech tagging and named entity recognition is regarded as a multi-classification problem of continuous subsequences in sentences, and the model performance is improved to a certain extent. The word boundary detection problem in the part of speech tagging task and the entity recognition task is improved by word segmentation with high accuracy, and the word segmentation precision can be improved by using part of speech tagging information. Performing joint learning by utilizing high relevance among word segmentation, part-of-speech tagging and named entity recognition, so that a Chinese lexical analysis model obtains better performance; the problem of unknown words during prediction can be effectively relieved by selecting a proper pre-training language model; a multi-task joint learning mode is adopted, and implicit data sharing is performed, which is equivalent to implicit data enhancement. The method can share parameters, weaken the network capacity to a certain extent and prevent overfitting. The invention provides a combined model and a device for identifying participles, parts of speech and named entities, which are used for marking and naming Chinese participles and parts of speech.
In the description herein, references to the description of "one embodiment," "an example," "a specific example" or the like are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best understand the invention for and utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims (7)

1. A joint lexical analysis method for word segmentation, part of speech tagging and named entity recognition is characterized by comprising the following steps: the word segmentation and part-of-speech tagging task is decomposed into two subtasks of candidate word detection and part-of-speech category prediction, the entity recognition task is decomposed into two subtasks of candidate entity detection and entity category prediction, and a unified neural network model is adopted to carry out joint learning on the four tasks; and meanwhile, parameters among different tasks are shared.
2. The method of joint lexical analysis of segmentation, part of speech tagging and named entity recognition of claim 1, comprising the steps of:
s1: pre-processing the data of the text obtained from the PFR1998 to match each character segment with its corresponding label category;
s2: and sequentially obtaining information of each sentence from the data preprocessed in the S1 as input, performing feature coding on the input by using a BERT pre-training language model, obtaining the context semantic vector representation of each word in the sentence, performing candidate word detection and candidate entity detection on all continuous subsequences in the sentence, and predicting word sequence labels, part of speech sequence labels and entity sequence labels in the current sequence by calculating the score probability of the candidate words and the candidate entities.
3. The method of joint lexical analysis of segmentation, part of speech tagging and named entity recognition of claim 2, comprising the steps of: the S1 preprocessing the data includes:
constructing a part-of-speech tag dictionary and an entity tag dictionary for the words according to the training set;
labeling the label type of the character segment according to the position information of the character in the sentence;
then, each sentence takes the character as an input unit, each character is endowed with a fixed id number through a word segmentation device of a BERT pre-training language model, and a segmentation sequence [ w ] of the sentence is obtained 1 ,w 2 ,...,w n ]Wherein w is i Indicating the number of the segmentation term in the BERT vocabulary.
4. The method of joint lexical analysis of segmentation, part-of-speech tagging and named entity recognition of claim 3, comprising the steps of:
for the preprocessed data, obtaining context semantic vector representation of each word in the sentence:
the segmented sequence [ w ] 1 ,w 2 ,...,w n ]After the BERT pre-training language model code is input, the vector representation h is obtained 1 ,h 2 ,...,h n ]Wherein h is i Is w i The corresponding vector represents, and the vector dimension d equals 768.
5. The method of joint lexical analysis of segmentation, part of speech tagging and named entity recognition of claim 4, comprising the steps of:
performing candidate word detection and candidate entity detection on all continuous subsequences in the sentence, including:
vector sequence [ h ] obtained after coding 1 ,h 2 ,...,h n ]By transforming q i =W q h i +b q And k i =W k h i +b k Wherein W is q And W k Are parameters of the model;
obtain a vector sequence [ q ] 1 ,q 2 ,…,q n ]And [ k ] 1 ,k 2 ,...,k n ]Is a feature vector for word segmentation, by q i And k j Inner product of (3) calculating a continuous subsequence x [i:j] Word score s 1 (i,j);
Figure FDA0003708619170000021
Using a greedy algorithm to obtain an optimal solution:
max(s 1 (i,j),s 1 (i,j+1))
similarly, the vector sequence [ h ] obtained after coding 1 ,h 2 ,...,h n ]By transforming r i =W r h i +b r And u i =W u h i +b u Wherein W is r And W u Are parameters of the model; obtain the vector sequence [ r 1 ,r 2 ,...,r n ]And [ u ] 1 ,u 2 ,...,u n ]Is to determine whether the feature vector is used by the entity by r i And u j Inner product of (3) calculating a continuous subsequence x [i:j] Word score s 2 (i,j),
Figure FDA0003708619170000022
6. The method of joint lexical analysis of segmentation, part of speech tagging and named entity recognition of claim 5, comprising the steps of:
after candidate word detection and candidate entity detection are carried out on all continuous subsequences in the sentence, part of speech category prediction and entity category prediction are carried out:
the vector sequence [ h ] obtained after coding 1 ,h 2 ,...,h n ]By passing
Figure FDA0003708619170000023
The predicted dependency label (label),
Figure FDA0003708619170000024
wherein, U (1) Is dimension R m×d×d The higher order tensor of (m is the number of tags, d is the Biaffine input dimension),
Figure FDA0003708619170000025
is known at the same time as the posterior probability in the case of i as dep and j as head,
Figure FDA0003708619170000026
is the posterior probability that i or j is known to be both ends of the dependency (arc).
7. The method of joint lexical analysis of segmentation, part-of-speech tagging and named entity recognition of claim 6, comprising the steps of:
after the part of speech category prediction and the entity category prediction, predicting word sequence labels, part of speech sequence labels and entity sequence labels in the current sequence as follows:
by the score s obtained 1 (i, j) calculating a loss function loss _ ws of the participle:
Figure FDA0003708619170000031
wherein P is the beginning and end set of words for the sample, and Q is the beginning and end set of all non-words for the sample;
and respectively calculating loss functions (loss _ pos and loss _ ner) of part-of-speech tagging and named entity identification by adopting multi-classification cross entropy:
Figure FDA0003708619170000032
wherein the content of the first and second substances,c1 is a set of part-of-speech categories,
Figure FDA0003708619170000033
as a word x [i:j] The label in the category C1 is,
Figure FDA0003708619170000034
is the predicted value of the model in category c;
Figure FDA0003708619170000035
where C2 is a set of entity classes,
Figure FDA0003708619170000036
is an entity x [i:j] The label on the category X2 that is,
Figure FDA0003708619170000037
is the predicted value of the model on category c.
CN202210715424.0A 2022-06-22 2022-06-22 Combined lexical analysis method for word segmentation, part of speech tagging and named entity recognition Pending CN114970536A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210715424.0A CN114970536A (en) 2022-06-22 2022-06-22 Combined lexical analysis method for word segmentation, part of speech tagging and named entity recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210715424.0A CN114970536A (en) 2022-06-22 2022-06-22 Combined lexical analysis method for word segmentation, part of speech tagging and named entity recognition

Publications (1)

Publication Number Publication Date
CN114970536A true CN114970536A (en) 2022-08-30

Family

ID=82966410

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210715424.0A Pending CN114970536A (en) 2022-06-22 2022-06-22 Combined lexical analysis method for word segmentation, part of speech tagging and named entity recognition

Country Status (1)

Country Link
CN (1) CN114970536A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115879421A (en) * 2023-02-16 2023-03-31 之江实验室 Sentence ordering method and device for enhancing BART pre-training task
CN116776887A (en) * 2023-08-18 2023-09-19 昆明理工大学 Negative sampling remote supervision entity identification method based on sample similarity calculation

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9202176B1 (en) * 2011-08-08 2015-12-01 Gravity.Com, Inc. Entity analysis system
CN109871538A (en) * 2019-02-18 2019-06-11 华南理工大学 A kind of Chinese electronic health record name entity recognition method
CN111259987A (en) * 2020-02-20 2020-06-09 民生科技有限责任公司 Method for extracting event main body based on BERT (belief-based regression analysis) multi-model fusion
CN111695053A (en) * 2020-06-12 2020-09-22 上海智臻智能网络科技股份有限公司 Sequence labeling method, data processing device and readable storage medium
CN112101040A (en) * 2020-08-20 2020-12-18 淮阴工学院 Ancient poetry semantic retrieval method based on knowledge graph
CN113011189A (en) * 2021-03-26 2021-06-22 深圳壹账通智能科技有限公司 Method, device and equipment for extracting open entity relationship and storage medium
CN113360667A (en) * 2021-05-31 2021-09-07 安徽大学 Biomedical trigger word detection and named entity identification method based on multitask learning
US20210357585A1 (en) * 2017-03-13 2021-11-18 Arizona Board Of Regents On Behalf Of The University Of Arizona Methods for extracting and assessing information from literature documents
CN113806646A (en) * 2020-06-12 2021-12-17 上海智臻智能网络科技股份有限公司 Sequence labeling system and training system of sequence labeling model

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9202176B1 (en) * 2011-08-08 2015-12-01 Gravity.Com, Inc. Entity analysis system
US20210357585A1 (en) * 2017-03-13 2021-11-18 Arizona Board Of Regents On Behalf Of The University Of Arizona Methods for extracting and assessing information from literature documents
CN109871538A (en) * 2019-02-18 2019-06-11 华南理工大学 A kind of Chinese electronic health record name entity recognition method
CN111259987A (en) * 2020-02-20 2020-06-09 民生科技有限责任公司 Method for extracting event main body based on BERT (belief-based regression analysis) multi-model fusion
CN111695053A (en) * 2020-06-12 2020-09-22 上海智臻智能网络科技股份有限公司 Sequence labeling method, data processing device and readable storage medium
CN113806646A (en) * 2020-06-12 2021-12-17 上海智臻智能网络科技股份有限公司 Sequence labeling system and training system of sequence labeling model
CN112101040A (en) * 2020-08-20 2020-12-18 淮阴工学院 Ancient poetry semantic retrieval method based on knowledge graph
CN113011189A (en) * 2021-03-26 2021-06-22 深圳壹账通智能科技有限公司 Method, device and equipment for extracting open entity relationship and storage medium
CN113360667A (en) * 2021-05-31 2021-09-07 安徽大学 Biomedical trigger word detection and named entity identification method based on multitask learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ZHICHANG ZHANG等: "a joint learning framework with bert for spoken language understanding", IEEE ACCESS, 20 November 2019 (2019-11-20), pages 49 - 58 *
吴俊;程;郝瀚;艾力亚尔・艾则孜;刘菲雪;苏亦坡;: "基于BERT嵌入BiLSTM-CRF模型的中文专业术语抽取研究", 情报学报, no. 04, 24 April 2020 (2020-04-24), pages 409 - 418 *
朱叶芬等: "基于局部Transformer的泰语分词和词性标注联合模型", 智能***学报, vol. 19, no. 2, 16 November 2023 (2023-11-16), pages 401 - 410 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115879421A (en) * 2023-02-16 2023-03-31 之江实验室 Sentence ordering method and device for enhancing BART pre-training task
CN115879421B (en) * 2023-02-16 2024-01-09 之江实验室 Sentence ordering method and device for enhancing BART pre-training task
CN116776887A (en) * 2023-08-18 2023-09-19 昆明理工大学 Negative sampling remote supervision entity identification method based on sample similarity calculation
CN116776887B (en) * 2023-08-18 2023-10-31 昆明理工大学 Negative sampling remote supervision entity identification method based on sample similarity calculation

Similar Documents

Publication Publication Date Title
US8527262B2 (en) Systems and methods for automatic semantic role labeling of high morphological text for natural language processing applications
Collobert et al. A unified architecture for natural language processing: Deep neural networks with multitask learning
Belinkov et al. Arabic diacritization with recurrent neural networks
CN112541356B (en) Method and system for recognizing biomedical named entities
CN112115238A (en) Question-answering method and system based on BERT and knowledge base
US20230069935A1 (en) Dialog system answering method based on sentence paraphrase recognition
Carbonell et al. Joint recognition of handwritten text and named entities with a neural end-to-end model
CN114970536A (en) Combined lexical analysis method for word segmentation, part of speech tagging and named entity recognition
Gao et al. Named entity recognition method of Chinese EMR based on BERT-BiLSTM-CRF
Szarvas et al. A highly accurate Named Entity corpus for Hungarian
CN113191148A (en) Rail transit entity identification method based on semi-supervised learning and clustering
CN111783461A (en) Named entity identification method based on syntactic dependency relationship
CN113177102B (en) Text classification method and device, computing equipment and computer readable medium
CN112905736A (en) Unsupervised text emotion analysis method based on quantum theory
Moeng et al. Canonical and surface morphological segmentation for nguni languages
CN114491024A (en) Small sample-based specific field multi-label text classification method
Wosiak Automated extraction of information from Polish resume documents in the IT recruitment process
CN115481635A (en) Address element analysis method and system
Seeha et al. ThaiLMCut: Unsupervised pretraining for Thai word segmentation
CN114579695A (en) Event extraction method, device, equipment and storage medium
CN114416991A (en) Method and system for analyzing text emotion reason based on prompt
Ahmad et al. Machine and deep learning methods with manual and automatic labelling for news classification in bangla language
CN116562291A (en) Chinese nested named entity recognition method based on boundary detection
CN115759102A (en) Chinese poetry wine culture named entity recognition method
Tolegen et al. Voted-perceptron approach for Kazakh morphological disambiguation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination