CN114818717A - Chinese named entity recognition method and system fusing vocabulary and syntax information - Google Patents
Chinese named entity recognition method and system fusing vocabulary and syntax information Download PDFInfo
- Publication number
- CN114818717A CN114818717A CN202210575509.3A CN202210575509A CN114818717A CN 114818717 A CN114818717 A CN 114818717A CN 202210575509 A CN202210575509 A CN 202210575509A CN 114818717 A CN114818717 A CN 114818717A
- Authority
- CN
- China
- Prior art keywords
- word
- vector
- information
- syntax
- word set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 44
- 239000013598 vector Substances 0.000 claims abstract description 165
- 230000007246 mechanism Effects 0.000 claims abstract description 26
- 238000013507 mapping Methods 0.000 claims abstract description 15
- 230000002457 bidirectional effect Effects 0.000 claims abstract description 13
- 230000004927 fusion Effects 0.000 claims abstract description 11
- 230000003068 static effect Effects 0.000 claims description 32
- 230000006870 function Effects 0.000 claims description 15
- 238000012549 training Methods 0.000 claims description 13
- 239000000126 substance Substances 0.000 claims description 12
- 238000004364 calculation method Methods 0.000 claims description 10
- 238000011156 evaluation Methods 0.000 claims description 10
- 239000011159 matrix material Substances 0.000 claims description 10
- 230000000694 effects Effects 0.000 claims description 8
- 230000008569 process Effects 0.000 claims description 5
- 230000004913 activation Effects 0.000 claims description 3
- 238000002372 labelling Methods 0.000 claims description 3
- 230000014759 maintenance of location Effects 0.000 claims description 3
- 230000001537 neural effect Effects 0.000 claims description 3
- 230000007704 transition Effects 0.000 claims description 3
- 238000012546 transfer Methods 0.000 claims description 2
- 238000003058 natural language processing Methods 0.000 description 9
- 101001013832 Homo sapiens Mitochondrial peptide methionine sulfoxide reductase Proteins 0.000 description 5
- 102100031767 Mitochondrial peptide methionine sulfoxide reductase Human genes 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 238000002474 experimental method Methods 0.000 description 5
- 230000008520 organization Effects 0.000 description 3
- 238000002679 ablation Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 101100272279 Beauveria bassiana Beas gene Proteins 0.000 description 1
- 238000009411 base construction Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 239000002352 surface water Substances 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/253—Grammatical analysis; Style critique
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a Chinese named entity recognition method and system fusing vocabulary and syntax information, comprising the following steps: step 1, mapping an original input text into a word vector, introducing external vocabulary information by using an improved word set matching algorithm, and integrating the external vocabulary information into the input representation of each word; step 2, extracting context information by using a bidirectional LSTM according to the input representation of the character; step 3, obtaining part-of-speech tags and syntax components from an original input text by using an NLP tool, constructing a syntax vector by using a healthy value memory network, and performing weighted fusion on a context vector and the syntax vector by using a gate control mechanism to obtain a feature vector; and 4, inputting the feature vector into a CRF (conditional random access memory) of a label prediction layer to realize Chinese named entity identification. The invention can solve the problem of insufficient entity boundary information in the Chinese named entity and merge the syntactic information of the input text.
Description
Technical Field
The invention relates to the field of information extraction of natural language processing, in particular to a Chinese named entity recognition method and system fusing vocabulary and syntax information.
Background
Named Entity Recognition (NER) aims at recognizing and classifying entities in text into different categories such as: name of person, place name, organization name, etc. The NER is an important task in NLP and has been widely applied to the fields of relation extraction, question answering, machine translation, knowledge base construction and the like, so that the research and breakthrough of the NER have very important significance.
The named entity recognition of Chinese and English is different, and each word in English can express complete semantic information; in Chinese, only one word or phrase can express a complete meaning in most cases, and Chinese has no obvious characteristics such as vocabulary boundary characters and capitals, so that the entity boundary identification is difficult. However, the word entity boundary is usually the same as the entity boundary, so the word boundary information plays an important role in Chinese Named Entity Recognition (CNER).
The problem of difficult recognition of word boundaries can be solved by introducing external features. In these features, lexical information and syntactic information have important meanings, and can help the CNER model to find corresponding entities. When the external features are used in the existing CNER model, the distinguishing and processing are rarely carried out, and the noise in the features may influence the performance of the model. Therefore, finding a suitable method to integrate extrinsic feature information into the CNER model remains a challenge. In most cases, it is desirable that the CNER model may contain a variety of additional features. Therefore, an effective mechanism is needed to be designed to weight and combine these features to limit the noise information.
Meanwhile, the existing SoftLexion word set matching method depends on static word frequency statistics in a data set, and the word frequency is used for measuring the effect of different words on the Chinese named entity recognition task. Considering that different data sets have different scales and sizes, the problem of too low word frequency exists on small-scale data sets, and the word frequency cannot better reflect the importance of words in some cases. Therefore, a more reasonable method can be found to reasonably weigh the weights of the words in the word set.
Disclosure of Invention
The invention mainly aims to overcome the defects in the prior art, and provides a method and a system for identifying a Chinese named entity by fusing vocabulary and syntax information, wherein the vocabulary and syntax information are fused (fused) (abbreviated as an LSF-CNER model), specifically, external vocabulary information and syntax information of an input text are fused by using a gating unit, and an attention mechanism is introduced into the model to construct a Chinese named entity identification model, so that the accuracy of the identification of the Chinese named entity is expected to be improved. The problems mainly solved by the invention are embodied in the following two aspects: on one hand, the word set matching algorithm after improving each word of the input text sequence splices the matched static word set vector, dynamic word set vector and initial word vector, thereby integrating external vocabulary information into the word vector and solving the problem of insufficient word boundary characteristics of Chinese text. On the other hand, the syntax information of the input text is extracted by using an NLP tool, and the context vectors extracted by a gating mechanism and a bidirectional LSTM are integrated, so that the representation of the feature vectors is enriched, and the syntax information of a deeper level is fused.
The invention adopts the following technical scheme:
on one hand, the Chinese named entity recognition method fusing vocabulary and syntax information comprises the following steps:
step 1, mapping an original input text into a word vector, introducing external vocabulary information by using an improved word set matching algorithm, and integrating the external vocabulary information into the input representation of each word;
step 2, extracting context information by using a bidirectional LSTM according to the input representation of the character;
step 3, obtaining part-of-speech tags and syntax components from an original input text by using an NLP tool, constructing a syntax vector by using a healthy value memory network, and performing weighted fusion on a context vector and the syntax vector by using a gate control mechanism to obtain a feature vector;
and 4, inputting the feature vector into a CRF (conditional random access memory) of a label prediction layer to realize Chinese named entity identification.
Preferably, the step 1 specifically includes:
step 1.1, regarding the input text as a sentence, and expressing x ═ x (x) in a sequence 1 ,x 2 ,..,x n ) (ii) a Wherein x is i Represents the ith word in a sentence x of length n; to better utilize the lexical information, the results of each word matching dictionary are divided into the following four word sets of "BIES":
(1) word set B (x) i ) Including all in x with x i The first word;
(2) word set I (x) i ) Including all x on x i The words in the middle;
(3) word set E (x) i ) Including all in x and x i A word ending;
(4) word set S (x) i ) Containing all x i Is the word of a single character;
step 1.2, after a BIES word set corresponding to each word is obtained, compressing each word set into a vector with fixed dimension; the improved word set matching algorithm comprises a static word set algorithm and a dynamic word set algorithm, wherein the static word set algorithm uses the occurrence frequency of words to represent corresponding weight in order to ensure the calculation efficiency, and the static word set vector calculation method of a single word set comprises the following steps:
wherein the content of the first and second substances,expression wordNumber of occurrences in the corpus;representing the total times of occurrence of words in the word set T;meaning the wordMapping into a word vector; t represents one of four word sets of "BIES";representing a word x i Vector representation of the corresponding word set T;
for better information retention, four static word sets are represented as a whole and integrated into a vector with a fixed dimension by splicing:
wherein, tau i Representing a word x i A corresponding static word set vector;
the dynamic vocabulary set algorithm uses an attention mechanism to measure information between characters and matching words, calculates attention weights of different matching words, enhances useful vocabularies and suppresses vocabularies with insignificant effects as follows:
wherein the content of the first and second substances,meaning the wordMapping into a word vector; q is andtraining vectors with the same dimension;is the word attention score obtained by the attention mechanism;the normalized word attention weight;a dynamic word set vector representing a single word set; m represents a word x i The number of words matched with the corresponding word set T;
carrying out weighted summation through attention weights to obtain a dynamic word set vector, representing four dynamic word sets as a whole and compressing the four dynamic word sets into a fixed-dimension vector:
wherein, A tau i Expressed as a word x i A corresponding dynamic word set vector;
step 1.3, to fully consider the importance of each word in the two word sets, the pair dynamic word set vectorAnd static word set vectorsDynamic weighted combination; using an evaluation functionTo measure static word set vectorAnd dynamic word set vectorRole on entity recognition task:
combining the word vector with the static word set vector i And a dynamic word set vector A tau i Taken together, as the input representation ultimately containing the external vocabulary information:
wherein the content of the first and second substances,representing a word x i The final vector representation of (a); l is the vector dimension anda matched 1 vector; e.g. of the type x Representing the word x i Converting into corresponding word vectors; denotes dot product calculation;representing vector stitching;
preferably, the step 2 specifically includes:
the sequence coding layer adopts a bidirectional LSTM to obtain a context vector of each word, wherein the bidirectional LSTM is a combination of a forward LSTM and a backward LSTM; use ofRepresenting hidden layer states of the LSTM in the forward direction at time i, usingA hidden layer state representing the inverted LSTM at time i; by concatenating the corresponding forward and backward LSTM states, a final context vector is obtained
Preferably, the step 3 specifically includes:
step 3.1, segmenting words of an original text by using a Stanford CoreNLP tool, and extracting two syntactic information, namely part-of-speech labels and syntactic component information of an input text by using a Berkely Neural Parse tool; wherein, the part-of-speech tag represents the tag information of a single word, and the syntactic component represents the structure grouping information of the text span;
step 3.2, for each x in the input sequence i Mapping its context characteristics and syntax information to keys and values in the key-value memory network KVMN, respectively represented asAnd
step 3.3, k is embedded using two embedding matrices i,j And v i,j Respectively map toAndfor each x i Associated context feature K i And syntax information V i Weight γ assigned to syntax information i,j Comprises the following steps:
wherein h is i Is x i A concealment vector obtained from a sequence coding layer;is h i Transposed form of (1);
weight gamma i,j Applied to corresponding syntax information v i,j Above, as follows:
wherein alpha is i For KVMN model correspondence x i Weighted syntax information of (1); therefore, the KVMN can ensure that the grammar information is weighted according to the corresponding context characteristics, so that the important information is distinguished and used;
step 3.4, sentence-wise Normal vector α for better utilization of syntax information encoded by KVMN i And a context vector h i Dynamic weighted combination using an evaluation function lambda i To measure the context vector h i And syntax vector alpha i Contribution to the sentence:
λ i =σ(W λ1 .h i +W λ2 .α i +b λ )
wherein, W λ1 And W λ2 Is a trainable matrix; b is a mixture of λ Is a bias term; sigma represents a sigmoid activation function;
and then the syntactic vector alpha is expressed i And a context vector h i Combining together:
where l is the vector dimension and h i Matched 1 vector, and the resulting O i Namely, the feature vector fusing the context information and the syntax information.
Preferably, the step 4 specifically includes:
for an input sequence x ═ x 1 ,x 2 ,..,x n ) Given a prediction sequence y ═ y (y) 1 ,y 2 ,y 3 ,...,y n ) The score for the predicted sequence is calculated as follows:
where, M is the transition matrix,indicating slave label y i To the label y i+1 The transfer fraction of (a);y for i word in sentence i ' probability scores of tags; y is 0 And y n+1 Respectively representing a start and end tag; n represents the length of the input sequence;
the probability p (y | x) of the predicted sequence y generation is calculated using the softmax function as follows:
wherein, Y x Representing the set of all possible predicted sequences of y for the solution space;is an example of the entire sequence tagged solution space,a score representing the instance;
during the training process, the likelihood probability log (p (x, y)) of a correct predicted sequence is maximized as follows:
after continuous iterative training and back propagation, the obtained y * The result of labeling with the CRF sequence, i.e. the output of the final model:
In another aspect, a system for Chinese named entity recognition incorporating lexical and syntactic information, comprising:
the input representation acquisition module is used for mapping an original input text into a word vector, introducing external vocabulary information by using an improved word set matching algorithm and integrating the external vocabulary information into the input representation of each word;
a context vector acquisition module for extracting context information using a bidirectional LSTM according to the input representation of the word;
the context information and syntax information fusion module is used for acquiring part-of-speech tags and syntax components from an original input text by using an NLP tool, constructing a syntax vector by using a healthy value memory network, and performing weighted fusion on the context vector and the syntax vector by using a gate control mechanism to acquire a feature vector;
and the sequence marking module is used for inputting the feature vectors into a CRF (conditional random access memory) of the label prediction layer to realize Chinese named entity identification.
As can be seen from the above description of the present invention, compared with the prior art, the present invention has the following advantages:
the invention introduces external vocabulary information in the word vector through an improved word set matching algorithm, and integrates the context vector and the syntactic vector in a syntactic information layer so as to solve the problem of insufficient entity boundary information in the Chinese named entity and fuse the syntactic information of the input text.
Drawings
FIG. 1 is a flow chart of a method for Chinese named entity recognition incorporating lexical and syntactic information in accordance with an embodiment of the present invention;
FIG. 2 is a model diagram of a method for identifying a named entity in Chinese incorporating lexical and syntactic information according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating a dictionary matching method corresponding to Chinese linguistics according to an embodiment of the present invention;
FIG. 4 is a diagram of syntax information acquisition according to an embodiment of the present invention;
FIG. 5 is a block diagram of a system for Chinese named entity recognition incorporating lexical and syntactic information according to an embodiment of the present invention.
Detailed Description
The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.
Referring to fig. 1, the method for recognizing a named entity in chinese, which fuses vocabulary and syntax information, according to the present invention, includes:
step 1, mapping an original input text into a word vector, introducing external vocabulary information by using an improved word set matching algorithm, and integrating the external vocabulary information into the input representation of each word;
step 2, extracting context information by using a bidirectional LSTM according to the input representation of the character;
step 3, obtaining part-of-speech tags and syntax components from an original input text by using an NLP tool, constructing a syntax vector by using a healthy value memory network, and performing weighted fusion on a context vector and the syntax vector by using a gate control mechanism to obtain a feature vector;
and 4, inputting the feature vector into a CRF (conditional random access memory) of a label prediction layer to realize Chinese named entity identification.
Specifically, a model diagram of the method of the present invention is shown in fig. 2, and the model is divided into four parts: the input represents the layer, sequence coding layer, syntax information layer and label prediction layer.
The invention relates to a Chinese named entity recognition method fusing vocabulary and syntax information, which comprises the following specific implementation steps:
(1) acquisition of an input representation containing external vocabulary information
The invention provides a word set matching method, which introduces an attention mechanism on the basis of SoftLexicon, combines matched static and dynamic word set information, and then fuses the static word set information and the dynamic word set information by using a gating mechanism.
1.1) static word set Algorithm
In the chinese NER model, a sentence of original input text is treated as a sequence x ═ x (x) 1 ,x 2 ,..,x n ). Wherein x is i Representing the ith word in a sentence x of length n. To better utilize the lexical information, the results of each word matching dictionary are divided into four sets of words:
a. word set B (x) i ) Including all in x with x i The first word;
b. word set I (x) i ) Including all x on x i The words in the middle;
c. word set E (x) i ) Including all in x and x i A word ending;
d. word set S (x) i ) Containing all x i Is a word of a single word.
These four word sets are collectively referred to as "BIES" and are respectively represented as:
wherein L represents an external dictionary; w is a j,k Indicating the presence of a word in an external dictionary, denoted by w in the sequence j At the beginning with w k The ending word. When the input of the input layer is "chinese linguistics", the corresponding dictionary matching method is as shown in fig. 3.
After a 'BIES' word set corresponding to each word is obtained, each word set is compressed into a vector with fixed dimension. The static word set algorithm is superior to the dynamic word set algorithm using the attention mechanism in processing speed, and in order to ensure the calculation efficiency, the frequency of the occurrence of the words is used for representing the corresponding weight. Since the frequency of a word is a static value, the importance of the word can be reflected by the high frequency of the word, which can speed up the calculation of the weight of the word. For the word x i Corresponding to a word set T ofWherein the content of the first and second substances,representing a word x i And the corresponding jth word in the word set T with the length of m. According to the thought, the weight of the static word set vector is updated to obtain the static word set vector:
wherein the content of the first and second substances,expression wordNumber of occurrences in the corpus;representing the total times of occurrence of words in the word set T;meaning the wordMapping into a word vector, wherein T represents one of four word sets of 'BIES'; the superscripts ws represents a word set,representing a word x i A vector representation of the corresponding set of words T. For better information retention, the four static word sets are represented as a whole and compressed into a fixed-dimension vector.
1.2) dynamic word set Algorithm
The attention mechanism may assign different weights to different words, greater weights to a few key words, and less weights to irrelevant words. The dynamic word set algorithm uses an attention mechanism to measure information between characters and matching words, calculates importance of different matching words, enhances useful words and suppresses words with unobvious function. For the word x i Corresponding to a word set T ofWherein the content of the first and second substances,representing a word x i And the corresponding jth word in the word set T with the length of m. Will next beAs input, a three-layer neural network is adopted to obtain a word importance score with dimension of 1 × m and value range of (0, 1):
wherein q is andthe training vectors of the same dimensions are used,to obtain the word importance score through the attention mechanism,is the normalized word importance score. Then, obtaining a dynamic word set vector according to the word importance degree score, representing the four dynamic word sets as a whole and compressing the four dynamic word sets into a fixed-dimension vector:
wherein the content of the first and second substances,representing a word x i Dynamic word set vectors, A tau, corresponding to the word set T i Expressed as a word x i Corresponding dynamic word set vector, m being the word x i And the number of words matched with the corresponding word set T.
1.3) gating mechanism
To fully consider the importance of each word in each word set, vectors are applied to the dynamic word setsAnd static word set vectorAnd (4) dynamically weighting and combining. Using an evaluation functionTo measure static word set vectorAnd a dynamic word set vector alpha i Role on entity recognition task:
wherein the content of the first and second substances,is a trainable matrix that is trained by the user,is the bias term. Then, the static word set vector τ is applied i And a dynamic word set vector A tau i Combining together:
wherein the content of the first and second substances,a word vector representing a word is provided,representing a word x i Is the vector dimension andmatched 1 vector, e x Representing the word x i Converting into corresponding word vector, dot product calculation,is vector stitching. By this method, the word vector trained in advance is embedded into the character representation, and the external dictionary information can be used reasonably.
(2) Acquisition of context vectors
Long-and-short memory networks (LSTM) are a variant of recurrent neural networks and are widely used in NLP tasks, such as NER, text classification, sentiment analysis. The LSTM introduces a cell state, maintains and controls information by using an input gate, a forgetting gate and an output gate, and effectively overcomes gradient explosion and gradient loss caused by long-distance dependence of an RNN model. The mathematical expression for the LSTM model is as follows:
i t =σ(W i [h t-1 ,τ t ]+b i ) (15)
f t =σ(W f [h t-1 ,τ t ]+b f ) (16)
o t =σ(W o [h t-1 ,τ t ]+b o ) (17)
h t =o t *tanh(c t ) (20)
wherein σ represents a sigmoid activation function, and tanh represents a hyperbolic tangent function; tau is t Representing a unit input; i.e. i t ,f t ,o t Respectively showing an input gate, a forgetting gate and an output gate; w i ,W f ,W o Respectively representing the weights of the input gate, the forgetting gate and the output gate; b i ,b f ,b o Respectively representing the offsets of the input gate, the forgetting gate and the output gate;representing a current input state; c. C t An output representing the current state.
To use context information for words simultaneously, the model herein uses bi-directional LSTM to obtain a context vector for each word, which is a combination of forward LSTM and backward LSTM. Use ofRepresenting hidden layer states of the i-time forward LSTM, usingRepresenting the hidden layer state of the inverted LSTM at time i. By concatenating the corresponding forward and backward LSTM states, a final context vector is obtained
(3) Fusion of contextual information and syntactic information
Context information is output by a bidirectional LSTM of a sequence coding layer, syntax information is firstly extracted from an input sequence through an NLP tool, then a key value memory network is used for mapping, and finally a gating mechanism is used for integrating two kinds of information.
3.1) syntax information acquisition
The invention usesThe Stanford CoreNLP tool performs word segmentation on the text, and then extracts two kinds of syntactic information, namely a part-of-speech tag and a syntactic component by using a Berkely Neural Parse tool. Wherein the part-of-speech tags represent tag information for individual words and the syntactic component represents structural grouping information for text spans. Taking "liberation of road surface water" as an example, fig. 4 shows part-of-speech tags and syntactic components in an example sentence. For an input sequence x ═ x 1 ,x 2 ,..,x n ) Each x in (1) i The context feature and syntax information are extracted as follows.
3.1.1) part of speech tag: each x is i Extracting x as a core word using a window of + -1 words i Context words and part-of-speech tags on both sides. As shown in fig. 4(a), the obtained contextual characteristics of the word "track" are (liberty, large track, road surface), and the combination of these words and the corresponding part-of-speech tags is used as part-of-speech information of the NER task, i.e., the corresponding syntax information is (liberty _ NN, large track _ NN, road surface _ NN).
3.1.2) syntactic components: from x i Starting with the leaf nodes of the syntax tree, searching upwards along the tree to find the first syntax node, then taking all words under the node as context characteristics, and taking the combination of the words and the corresponding syntax labels as the syntax component information of the NER task. As shown in fig. 4(b), for the word "track", searching up for a node, the syntax node "NP" contains two words "free" and "big track". Therefore, the context feature is (release, large track), and the context feature is combined with the "NP" tag as syntax component information of the NER task, i.e., the corresponding syntax information is (release _ NP, large track _ NP).
3.2) KVMN construction
Since the syntax information extracted by the tool has certain noise information, if the noise is not properly utilized, the performance of the model may be affected. Variations of the KVMN have proven effective in incorporating context features and provide an appropriate way to take advantage of context features and their corresponding dependency types.
In constructing the KVMN, the input sequence x ═ x (x) is first obtained 1 ,x 2 ,..,x n ) Correspond toThe result of the analysis (2). For each x in the input sequence i Mapping its contextual characteristics and syntactic information onto keys and values in the KVMN, respectivelyAndwherein m is i Denotes x i K and V represent context feature and syntax information. Next, k is embedded using two embedding matrices i,j And v i,j Respectively map toAndthe above. For each x i Associated context feature K i And syntax information V i The weight assigned to the syntax information is:
wherein h is i Is x i A concealment vector obtained from a sequence coding layer. Weight gamma i,j Applied to corresponding syntax information v i,j The method comprises the following steps:
wherein alpha is i For KVMN model correspondence x i Weighted syntax information of (2). Thus, the KVMN can ensure that the grammar information is weighted according to its corresponding context characteristics, thereby distinguishing and using the important information.
3.3) gating mechanism
For better utilization of KVMN-encoded syntax information, the syntax vector α is aligned i And a context vector h i Dynamic weightingAnd (4) combining. Using an evaluation function lambda i To measure the context vector h i And syntax vector alpha i Contribution to the sentence:
λ i =σ(W λ1 .h i +W λ2 .α i +b λ ) (23)
wherein, W λ1 、W λ2 Is a trainable matrix, b λ Is the bias term. Then, the syntax vector α is encoded i And a context vector h i Combining together:
wherein, O i Is KVMN corresponds to x i Is the vector dimension and h i And matching 1 vector, performing matrix dot product calculation, and completing the vector splicing.
(4) Sequence annotation using CRF
Compared with the HMM, the independent assumption of the CRF on the HMM has no strict requirement, the sequence and external observation information can be effectively utilized, and the label deviation and the differentiation generated by only directly assuming the label are avoided. CRF can capture more dependencies: for example, the "I-ORG" tag cannot follow "B-LOC". At CNER, the input of CRF is the context feature vector O learned from the syntax information layer i . For an input sequence x ═ x 1 ,x 2 ,..,x n ) Record P i,j Representing the probability score of the jth tag of the ith word in the sentence. For the predicted sequence y ═ y 1 ,y 2 ,y 3 ,...,y n ) And calculating the score of the predicted sequence:
where M is the transition matrix, M i,j Represents the branch score from tag i to tag j; y is 0 And y n+1 Representing the start and end tags, respectively. The probability of the predicted sequence y being generated is calculated using the softmax function:
wherein, Y x Representing all possible predicted sequences of y. In the training process, the likelihood probability of correctly predicting the sequence is maximized:
after training is finished, the score obtained by predicting the sequence is as follows:
(5) evaluation of effects
To verify the effectiveness of the method of the present invention, and to compare it with other models, three named entity data sets of chinese were used for experiments, including: weibo dataset, Resume dataset, and MSRA dataset. The experimental indexes adopt accuracy, recall rate and F1 values.
The Weibo data set is information from social network microblogs and comprises four entities of a person name, a place name, an organization name and a geopolitical entity. The MSRA data set is a data set from the news domain, and includes only a training set and a test set, including three entity types of person name, place name and organization. The Resume dataset is Resume information from the finance of the New wave, and contains seven entity types including city, educational institution, place name, person name, institution name, proper noun, professional background, and job title. The situation for each data set is shown in table 1.
Table 1 data set details
In order to verify the effectiveness of the method in this chapter, the invention compares the following 3 models as baseline:
(1) Lattice-LSTM: introducing a word cell structure, and fusing all word information ending with the current character;
(2) FLAT: the method is improved based on a transformer structure, skillful position coding is designed to fuse lattice structures, and absolute position coding is improved to make the absolute position coding more suitable for NER tasks;
(3) SoftLexicon: the method of simply utilizing the vocabulary in the input presentation layer has high portability.
The evaluation criteria are as follows:
the evaluation indices of the experiment used accuracy (Precision), Recall (Recall) and F1 values (F1 score). Wherein Precision is the correct proportion in all recognized entity words; recall indicates that the correct entity is identified as a proportion of all entities in the data set. Since the accuracy and the precision are inversely proportional, F1 is their harmonic mean. The evaluation function is:
the hyper-parameters used by the model are shown in table 2:
TABLE 2 hyper-parameter configuration
The results of the experiments on the three baseline models and the model of the invention on the respective data sets are shown in table 3, with the best results for each data set shown in bold.
Table 3 experimental results for different data sets
As shown in the results of Table 3, the LSF-CNER proposed herein outperforms other methods on the Weibo, Resume dataset.
On the Weibo data set with more noise and an indefinite format, the LSF-CNER achieves a better effect, and compared with SoftLexicon, the F1 value is improved by 0.71%.
On the data set with relatively fixed format and less noise of Resume, the F1 value of the LSF-CNER is improved by 0.11 percent compared with SoftLexicon.
In the MSRA dataset with more data in the formal text, the F1 value of the LSF-CNER is reduced by 0.01 percent compared with SoftLexicon, which shows that the model is already fitted. The reason is that SoftLexicon can well fit large-scale MSRA and Resume data sets, and the model effect cannot be obviously improved by adding characteristic information. Because of the lack of feature information on small-scale datasets, syntactic and lexical information can be used to improve model performance.
Further, to verify the general utility of the LSF-CNER model. The vocabulary information of the words is combined with the BERT word vectors as the output of the input presentation layer and sent to the sequence coding layer. Table 4 shows the results of experiments with BERT, wherein F1 values are given. The F1 average value result of the method proposed herein on different data sets is improved by 4.68% compared with BERT-Tagger; compared with BERT + BiLSTM + CRF, the yield is increased by 2.48%; compared with SoftLexicon + BERT, the yield is improved by 0.89%. Especially in the Weibo data set, the improvement of the method is more remarkable.
TABLE 4 Experimental results (%)
The experimental results show that the effect of inputting the representation layer can be better improved by combining the pre-training model. The method herein outperforms the traditional CNER method in the Resume dataset, the Weibo dataset, and the MSRA dataset. This verifies that fusing the lexical information and the grammatical information is valid.
In order to study the contributions of the two parts, dynamic word set and syntactic information, to the entity recognition task separately, ablation experiments were performed on the Resume dataset and were performed with the model set to five cases.
(1) BERT + LSTM + CRF: an initial model that does not include external vocabulary information and syntax information;
(2) SoftLexicon + BERT: including external vocabulary information;
(3) SoftLexicon + Attention + BERT, wherein an Attention mechanism is introduced on the basis of SoftLexicon to dynamically adjust the weights of different words in a word set;
(4) syntact + BERT: contains syntactic information but no external vocabulary information;
(5) LSF + BERT: containing syntactic information, and improved word set information.
Table 5 ablation test results (%)
Model (model) | P | R | F1 |
BERT+BiLSTM+CRF | 95.75 | 95.28 | 95.51 |
SoftLexicon+BERT | 96.08 | 96.13 | 96.11 |
SoftLexicon+Attention+BERT | 96.19 | 96.35 | 96.27 |
Syntactic+BERT | 95.86 | 96.08 | 95.97 |
LSF+BERT | 96.73 | 96.45 | 96.59 |
As shown in table 5, when the attention mechanism is introduced in the SoftLexicon + BERT model to adjust the weight of the word set vector, the average F1 value is raised by 0.16%; when syntax information is introduced into the BERT + BilSTM + CRF model, the average F1 value is improved by 0.46%; when the syntax information and the static and dynamic word set information are contained at the same time, the average F1 value is improved by 0.48 percent compared with the value of SoftLexicon + BERT. From the above experimental results it can be seen that: the static and dynamic word set information and the syntax information are helpful for improving the performance of the model; the two methods exist simultaneously, so that the recognition precision of the Chinese named entity is further improved. This shows that dynamically adjusting the weight of words in a word set and introducing syntactic information help to help sentences recognize the importance of different words, thereby improving the accuracy of the Chinese named entity.
In summary, the invention provides a method for identifying a named entity in Chinese, which combines vocabulary and syntax information. The new word set matching method integrates static and dynamic word set information on an input presentation layer, and then uses a gating mechanism to dynamically weight the output and syntax information of a sequence coding layer, thereby not only considering the potential boundary of a Chinese named entity, but also considering the potential syntax information in a sentence, and the two kinds of information are fused in a more balanced way, thereby improving the expression effect of the model. Experimental results on three CNER datasets show that the new method has good performance compared to the mainstream method.
Referring to fig. 5, the present invention further includes a system for recognizing a named entity in chinese that fuses lexical and syntactic information, including:
an input representation obtaining module 501, configured to map an original input text into a word vector, introduce external vocabulary information using an improved word set matching algorithm, and integrate the external vocabulary information into an input representation of each word;
a context vector obtaining module 502, configured to extract context information using a bidirectional LSTM according to an input representation of a word;
a context information and syntax information fusion module 503, configured to obtain part-of-speech tags and syntax components from an original input text using an NLP tool, construct a syntax vector using a robust memory network, and perform weighted fusion on the context vector and the syntax vector through a gate control mechanism to obtain a feature vector;
a sequence labeling module 504, configured to input the feature vectors into the CRF of the label prediction layer to implement chinese named entity recognition
A Chinese named entity recognition system fusing vocabulary and syntax information concretely realizes a Chinese named entity recognition method fusing vocabulary and syntax information, and the embodiment does not repeat description.
The above description is only an embodiment of the present invention, but the design concept of the present invention is not limited thereto, and any insubstantial modifications made by using the design concept should fall within the scope of infringing the present invention.
Claims (6)
1. A Chinese named entity recognition method fusing vocabulary and syntax information is characterized by comprising the following steps:
step 1, mapping an original input text into a word vector, introducing external vocabulary information by using an improved word set matching algorithm, and integrating the external vocabulary information into the input representation of each word;
step 2, extracting context information by using a bidirectional LSTM according to the input representation of the character;
step 3, obtaining part-of-speech tags and syntax components from an original input text by using an NLP tool, constructing a syntax vector by using a healthy value memory network, and performing weighted fusion on a context vector and the syntax vector by using a gate control mechanism to obtain a feature vector;
and 4, inputting the feature vector into a CRF (conditional random access memory) of a label prediction layer to realize Chinese named entity identification.
2. The method for recognizing named entities in chinese that fuses lexical and syntactic information according to claim 1, wherein said step 1 specifically comprises:
step 1.1, regarding the input text as a sentence, and expressing x ═ x (x) in a sequence 1 ,x 2 ,..,x n ) (ii) a Wherein x is i Represents the ith word in a sentence x of length n; to better utilize the lexical information, the results of each word matching dictionary are divided into the following four word sets of "BIES":
(1) word set B (x) i ) Including all in x with x i The first word;
(2) word set I (x) i ) Including all x on x i The words in the middle;
(3) word set E (x) i ) Including all in x and x i A word ending;
(4) word set S (x) i ) Containing all x i Is the word of a single character;
step 1.2, after a BIES word set corresponding to each word is obtained, compressing each word set into a vector with fixed dimension; the improved word set matching algorithm comprises a static word set algorithm and a dynamic word set algorithm, wherein the static word set algorithm uses the occurrence frequency of words to represent corresponding weight in order to ensure the calculation efficiency, and the static word set vector calculation method of a single word set comprises the following steps:
wherein the content of the first and second substances,expression wordNumber of occurrences in the corpus;representing the total times of occurrence of words in the word set T;meaning the wordMapping into a word vector; t represents one of four word sets of "BIES";representing a word x i Vector representation of the corresponding word set T;
for better information retention, four static word sets are represented as a whole and integrated into a vector with a fixed dimension by splicing:
wherein, tau i Representing a word x i A corresponding static word set vector;
the dynamic vocabulary set algorithm uses an attention mechanism to measure information between characters and matching words, calculates attention weights of different matching words, enhances useful vocabularies and suppresses vocabularies with insignificant effects as follows:
wherein, the first and the second end of the pipe are connected with each other,meaning the wordMapping into a word vector; q is andtraining vectors with the same dimension;is the word attention score obtained by the attention mechanism;the normalized word attention weight;a dynamic word set vector representing a single word set; m represents a word x i The number of words matched with the corresponding word set T;
carrying out weighted summation through attention weight to obtain a dynamic word set vector, representing four dynamic word sets as a whole and compressing the four dynamic word sets into a fixed-dimension vector:
wherein, A tau i Expressed as a word x i A corresponding dynamic word set vector;
step 1.3, to fully consider the importance of each word in the two word sets, the pair dynamic word set vectorAnd static word set vectorsDynamic weighted combination; using an evaluation function theta i To measure static word set vectorAnd dynamic word set vectorRole on entity recognition task:
θ i =σ(W θ1 .τ i +W θ2 .Aτ i +b θ )
wherein, W θ1 、W θ2 Is a trainable matrix; b θ Is a bias term;
combining the word vector and the static word set vector tau i And a dynamic word set vector A tau i Taken together, as the input representation ultimately containing the external vocabulary information:
wherein the content of the first and second substances,representing a word x i The final vector representation of (a); l is the vector dimension anda matched 1 vector; e.g. of the type x Representing the word x i Converting into corresponding word vectors; denotes dot product calculation;representing vector stitching.
3. The method for recognizing named entities in chinese that fuses lexical and syntactic information according to claim 2, wherein said step 2 specifically comprises:
the sequence coding layer adopts a bidirectional LSTM to obtain a context vector of each word, wherein the bidirectional LSTM is a combination of a forward LSTM and a backward LSTM; use ofRepresenting hidden layer states of the i-time forward LSTM, usingA hidden layer state representing the inverted LSTM at time i; by concatenating the corresponding forward and backward LSTM states, a final context vector is obtained
4. The method for recognizing named entities in chinese with fused lexical and syntactic information according to claim 3, wherein said step 3 specifically comprises:
step 3.1, segmenting words of an original text by using a Stanford CoreNLP tool, and extracting two syntactic information, namely part-of-speech labels and syntactic component information of an input text by using a Berkely Neural Parse tool; wherein, the part-of-speech tag represents the tag information of a single word, and the syntactic component represents the structure grouping information of the text span;
step 3.2, for each x in the input sequence i Mapping its context characteristics and syntax information to keys and values in a key-value memory network KVMN, denoted as K respectively i =[k i,1 ,...k i,j ,...,k i,mi ]And V i =[v i,1 ,...v i,j ,...v i,mi ];
Step 3.3, k is embedded using two embedding matrices i,j And v i,j Respectively map toAndfor each x i Associated context feature K i And syntax information V i Weight γ assigned to syntax information ij Comprises the following steps:
wherein h is i Is x i A concealment vector obtained from a sequence coding layer; h is i T Is h i Transposed form of (1);
weight gamma i,j Applied to corresponding syntax information v i,j Above, as follows:
wherein alpha is i For KVMN model corresponds to x i Weighted syntax information of (1); therefore, the KVMN can ensure that the grammar information is weighted according to the corresponding context characteristics, so that the important information is distinguished and used;
step 3.4, sentence-wise Normal vector α for better utilization of syntax information encoded by KVMN i And a context vector h i Dynamic weighted combination using an evaluation function lambda i To measure the context vector h i And syntax vector alpha i Contribution to the sentence:
λ i =σ(W λ1 .h i +W λ2 .α i +b λ )
wherein, W λ1 And W λ2 Is a trainable matrix; b λ Is a bias term; sigma represents a sigmoid activation function;
and then the syntactic vector alpha is expressed i And a context vector h i Combining together:
where l is the vector dimension and h i Matched 1 vector, and the resulting O i Namely, the feature vector fusing the context information and the syntax information.
5. The method for recognizing named entities in chinese with fused lexical and syntactic information as claimed in claim 4, wherein said step 4 specifically comprises:
for an input sequence x ═ x 1 ,x 2 ,..,x n ) Given a prediction sequence y ═ y (y) 1 ,y 2 ,y 3 ,...,y n ) The score for the predicted sequence is calculated as follows:
where, M is the transition matrix,indicating slave label y i To the label y i+1 The transfer fraction of (a);y for i word in sentence i ' probability scores of tags; y is 0 And y n+1 Respectively representing a start and end tag; n represents the length of the input sequence;
the probability p (y | x) of the predicted sequence y generation is calculated using the softmax function as follows:
wherein, Y x For solution space, represent the set of all possible predicted sequences for y;is an example of the entire sequence tagged solution space,a score representing the instance;
during the training process, the likelihood probability log (p (x, y)) of a correct predicted sequence is maximized as follows:
after continuous iterative training and back propagation, the obtained y * The result of labeling with the CRF sequence, i.e. the output of the final model:
6. A Chinese named entity recognition system fusing lexical and syntactic information, comprising:
the input representation acquisition module is used for mapping an original input text into a word vector, introducing external vocabulary information by using an improved word set matching algorithm and integrating the external vocabulary information into the input representation of each word;
a context vector acquisition module for extracting context information using a bidirectional LSTM according to the input representation of the word;
the context information and syntax information fusion module is used for acquiring part-of-speech tags and syntax components from an original input text by using an NLP tool, constructing a syntax vector by using a healthy value memory network, and performing weighted fusion on the context vector and the syntax vector by using a gate control mechanism to acquire a feature vector;
and the sequence marking module is used for inputting the feature vectors into a CRF (conditional random access memory) of the label prediction layer to realize Chinese named entity identification.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210575509.3A CN114818717A (en) | 2022-05-25 | 2022-05-25 | Chinese named entity recognition method and system fusing vocabulary and syntax information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210575509.3A CN114818717A (en) | 2022-05-25 | 2022-05-25 | Chinese named entity recognition method and system fusing vocabulary and syntax information |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114818717A true CN114818717A (en) | 2022-07-29 |
Family
ID=82517680
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210575509.3A Pending CN114818717A (en) | 2022-05-25 | 2022-05-25 | Chinese named entity recognition method and system fusing vocabulary and syntax information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114818717A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115270803A (en) * | 2022-09-30 | 2022-11-01 | 北京道达天际科技股份有限公司 | Entity extraction method based on BERT and fused with N-gram characteristics |
CN115774993A (en) * | 2022-12-29 | 2023-03-10 | 广东南方网络信息科技有限公司 | Conditional error identification method and device based on syntactic analysis |
CN117077672A (en) * | 2023-07-05 | 2023-11-17 | 哈尔滨理工大学 | Chinese naming entity recognition method based on vocabulary enhancement and TCN-BILSTM model |
CN117521639A (en) * | 2024-01-05 | 2024-02-06 | 湖南工商大学 | Text detection method combined with academic text structure |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109726389A (en) * | 2018-11-13 | 2019-05-07 | 北京邮电大学 | A kind of Chinese missing pronoun complementing method based on common sense and reasoning |
WO2021082366A1 (en) * | 2019-10-28 | 2021-05-06 | 南京师范大学 | Interactive and iterative learning-based intelligent construction method for geographical name tagging corpus |
CN112883732A (en) * | 2020-11-26 | 2021-06-01 | 中国电子科技网络信息安全有限公司 | Method and device for identifying Chinese fine-grained named entities based on associative memory network |
CN113095074A (en) * | 2021-03-22 | 2021-07-09 | 北京工业大学 | Word segmentation method and system for Chinese electronic medical record |
CN113609859A (en) * | 2021-08-04 | 2021-11-05 | 浙江工业大学 | Special equipment Chinese named entity recognition method based on pre-training model |
CN114528840A (en) * | 2022-01-21 | 2022-05-24 | 深圳大学 | Chinese entity identification method, terminal and storage medium fusing context information |
-
2022
- 2022-05-25 CN CN202210575509.3A patent/CN114818717A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109726389A (en) * | 2018-11-13 | 2019-05-07 | 北京邮电大学 | A kind of Chinese missing pronoun complementing method based on common sense and reasoning |
WO2021082366A1 (en) * | 2019-10-28 | 2021-05-06 | 南京师范大学 | Interactive and iterative learning-based intelligent construction method for geographical name tagging corpus |
CN112883732A (en) * | 2020-11-26 | 2021-06-01 | 中国电子科技网络信息安全有限公司 | Method and device for identifying Chinese fine-grained named entities based on associative memory network |
CN113095074A (en) * | 2021-03-22 | 2021-07-09 | 北京工业大学 | Word segmentation method and system for Chinese electronic medical record |
CN113609859A (en) * | 2021-08-04 | 2021-11-05 | 浙江工业大学 | Special equipment Chinese named entity recognition method based on pre-training model |
CN114528840A (en) * | 2022-01-21 | 2022-05-24 | 深圳大学 | Chinese entity identification method, terminal and storage medium fusing context information |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115270803A (en) * | 2022-09-30 | 2022-11-01 | 北京道达天际科技股份有限公司 | Entity extraction method based on BERT and fused with N-gram characteristics |
CN115774993A (en) * | 2022-12-29 | 2023-03-10 | 广东南方网络信息科技有限公司 | Conditional error identification method and device based on syntactic analysis |
CN115774993B (en) * | 2022-12-29 | 2023-09-08 | 广东南方网络信息科技有限公司 | Condition type error identification method and device based on syntactic analysis |
CN117077672A (en) * | 2023-07-05 | 2023-11-17 | 哈尔滨理工大学 | Chinese naming entity recognition method based on vocabulary enhancement and TCN-BILSTM model |
CN117077672B (en) * | 2023-07-05 | 2024-04-26 | 哈尔滨理工大学 | Chinese naming entity recognition method based on vocabulary enhancement and TCN-BILSTM model |
CN117521639A (en) * | 2024-01-05 | 2024-02-06 | 湖南工商大学 | Text detection method combined with academic text structure |
CN117521639B (en) * | 2024-01-05 | 2024-04-02 | 湖南工商大学 | Text detection method combined with academic text structure |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110442760B (en) | Synonym mining method and device for question-answer retrieval system | |
CN111738003B (en) | Named entity recognition model training method, named entity recognition method and medium | |
CN111931506B (en) | Entity relationship extraction method based on graph information enhancement | |
CN106599032B (en) | Text event extraction method combining sparse coding and structure sensing machine | |
CN112183094B (en) | Chinese grammar debugging method and system based on multiple text features | |
CN111160031A (en) | Social media named entity identification method based on affix perception | |
CN114818717A (en) | Chinese named entity recognition method and system fusing vocabulary and syntax information | |
CN112231472B (en) | Judicial public opinion sensitive information identification method integrated with domain term dictionary | |
CN108874896B (en) | Humor identification method based on neural network and humor characteristics | |
CN113392209B (en) | Text clustering method based on artificial intelligence, related equipment and storage medium | |
US20230069935A1 (en) | Dialog system answering method based on sentence paraphrase recognition | |
CN115357719B (en) | Power audit text classification method and device based on improved BERT model | |
CN115048447B (en) | Database natural language interface system based on intelligent semantic completion | |
CN112069312B (en) | Text classification method based on entity recognition and electronic device | |
CN112818698B (en) | Fine-grained user comment sentiment analysis method based on dual-channel model | |
CN111581943A (en) | Chinese-over-bilingual multi-document news viewpoint sentence identification method based on sentence association graph | |
CN113919366A (en) | Semantic matching method and device for power transformer knowledge question answering | |
CN115146629A (en) | News text and comment correlation analysis method based on comparative learning | |
CN111984782A (en) | Method and system for generating text abstract of Tibetan language | |
CN114036246A (en) | Commodity map vectorization method and device, electronic equipment and storage medium | |
CN111581365B (en) | Predicate extraction method | |
CN113705207A (en) | Grammar error recognition method and device | |
CN112749566B (en) | Semantic matching method and device for English writing assistance | |
CN116910251A (en) | Text classification method, device, equipment and medium based on BERT model | |
CN116562291A (en) | Chinese nested named entity recognition method based on boundary detection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |