CN114818717A - Chinese named entity recognition method and system fusing vocabulary and syntax information - Google Patents

Chinese named entity recognition method and system fusing vocabulary and syntax information Download PDF

Info

Publication number
CN114818717A
CN114818717A CN202210575509.3A CN202210575509A CN114818717A CN 114818717 A CN114818717 A CN 114818717A CN 202210575509 A CN202210575509 A CN 202210575509A CN 114818717 A CN114818717 A CN 114818717A
Authority
CN
China
Prior art keywords
word
vector
information
syntax
word set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210575509.3A
Other languages
Chinese (zh)
Inventor
李弼程
刘其龙
张敏
皮慧娟
王华珍
王成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huaqiao University
Original Assignee
Huaqiao University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huaqiao University filed Critical Huaqiao University
Priority to CN202210575509.3A priority Critical patent/CN114818717A/en
Publication of CN114818717A publication Critical patent/CN114818717A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a Chinese named entity recognition method and system fusing vocabulary and syntax information, comprising the following steps: step 1, mapping an original input text into a word vector, introducing external vocabulary information by using an improved word set matching algorithm, and integrating the external vocabulary information into the input representation of each word; step 2, extracting context information by using a bidirectional LSTM according to the input representation of the character; step 3, obtaining part-of-speech tags and syntax components from an original input text by using an NLP tool, constructing a syntax vector by using a healthy value memory network, and performing weighted fusion on a context vector and the syntax vector by using a gate control mechanism to obtain a feature vector; and 4, inputting the feature vector into a CRF (conditional random access memory) of a label prediction layer to realize Chinese named entity identification. The invention can solve the problem of insufficient entity boundary information in the Chinese named entity and merge the syntactic information of the input text.

Description

Chinese named entity recognition method and system fusing vocabulary and syntax information
Technical Field
The invention relates to the field of information extraction of natural language processing, in particular to a Chinese named entity recognition method and system fusing vocabulary and syntax information.
Background
Named Entity Recognition (NER) aims at recognizing and classifying entities in text into different categories such as: name of person, place name, organization name, etc. The NER is an important task in NLP and has been widely applied to the fields of relation extraction, question answering, machine translation, knowledge base construction and the like, so that the research and breakthrough of the NER have very important significance.
The named entity recognition of Chinese and English is different, and each word in English can express complete semantic information; in Chinese, only one word or phrase can express a complete meaning in most cases, and Chinese has no obvious characteristics such as vocabulary boundary characters and capitals, so that the entity boundary identification is difficult. However, the word entity boundary is usually the same as the entity boundary, so the word boundary information plays an important role in Chinese Named Entity Recognition (CNER).
The problem of difficult recognition of word boundaries can be solved by introducing external features. In these features, lexical information and syntactic information have important meanings, and can help the CNER model to find corresponding entities. When the external features are used in the existing CNER model, the distinguishing and processing are rarely carried out, and the noise in the features may influence the performance of the model. Therefore, finding a suitable method to integrate extrinsic feature information into the CNER model remains a challenge. In most cases, it is desirable that the CNER model may contain a variety of additional features. Therefore, an effective mechanism is needed to be designed to weight and combine these features to limit the noise information.
Meanwhile, the existing SoftLexion word set matching method depends on static word frequency statistics in a data set, and the word frequency is used for measuring the effect of different words on the Chinese named entity recognition task. Considering that different data sets have different scales and sizes, the problem of too low word frequency exists on small-scale data sets, and the word frequency cannot better reflect the importance of words in some cases. Therefore, a more reasonable method can be found to reasonably weigh the weights of the words in the word set.
Disclosure of Invention
The invention mainly aims to overcome the defects in the prior art, and provides a method and a system for identifying a Chinese named entity by fusing vocabulary and syntax information, wherein the vocabulary and syntax information are fused (fused) (abbreviated as an LSF-CNER model), specifically, external vocabulary information and syntax information of an input text are fused by using a gating unit, and an attention mechanism is introduced into the model to construct a Chinese named entity identification model, so that the accuracy of the identification of the Chinese named entity is expected to be improved. The problems mainly solved by the invention are embodied in the following two aspects: on one hand, the word set matching algorithm after improving each word of the input text sequence splices the matched static word set vector, dynamic word set vector and initial word vector, thereby integrating external vocabulary information into the word vector and solving the problem of insufficient word boundary characteristics of Chinese text. On the other hand, the syntax information of the input text is extracted by using an NLP tool, and the context vectors extracted by a gating mechanism and a bidirectional LSTM are integrated, so that the representation of the feature vectors is enriched, and the syntax information of a deeper level is fused.
The invention adopts the following technical scheme:
on one hand, the Chinese named entity recognition method fusing vocabulary and syntax information comprises the following steps:
step 1, mapping an original input text into a word vector, introducing external vocabulary information by using an improved word set matching algorithm, and integrating the external vocabulary information into the input representation of each word;
step 2, extracting context information by using a bidirectional LSTM according to the input representation of the character;
step 3, obtaining part-of-speech tags and syntax components from an original input text by using an NLP tool, constructing a syntax vector by using a healthy value memory network, and performing weighted fusion on a context vector and the syntax vector by using a gate control mechanism to obtain a feature vector;
and 4, inputting the feature vector into a CRF (conditional random access memory) of a label prediction layer to realize Chinese named entity identification.
Preferably, the step 1 specifically includes:
step 1.1, regarding the input text as a sentence, and expressing x ═ x (x) in a sequence 1 ,x 2 ,..,x n ) (ii) a Wherein x is i Represents the ith word in a sentence x of length n; to better utilize the lexical information, the results of each word matching dictionary are divided into the following four word sets of "BIES":
(1) word set B (x) i ) Including all in x with x i The first word;
(2) word set I (x) i ) Including all x on x i The words in the middle;
(3) word set E (x) i ) Including all in x and x i A word ending;
(4) word set S (x) i ) Containing all x i Is the word of a single character;
step 1.2, after a BIES word set corresponding to each word is obtained, compressing each word set into a vector with fixed dimension; the improved word set matching algorithm comprises a static word set algorithm and a dynamic word set algorithm, wherein the static word set algorithm uses the occurrence frequency of words to represent corresponding weight in order to ensure the calculation efficiency, and the static word set vector calculation method of a single word set comprises the following steps:
Figure BDA0003662008680000021
wherein the content of the first and second substances,
Figure BDA0003662008680000022
expression word
Figure BDA0003662008680000023
Number of occurrences in the corpus;
Figure BDA0003662008680000024
representing the total times of occurrence of words in the word set T;
Figure BDA0003662008680000025
meaning the word
Figure BDA0003662008680000026
Mapping into a word vector; t represents one of four word sets of "BIES";
Figure BDA0003662008680000027
representing a word x i Vector representation of the corresponding word set T;
for better information retention, four static word sets are represented as a whole and integrated into a vector with a fixed dimension by splicing:
Figure BDA0003662008680000031
wherein, tau i Representing a word x i A corresponding static word set vector;
the dynamic vocabulary set algorithm uses an attention mechanism to measure information between characters and matching words, calculates attention weights of different matching words, enhances useful vocabularies and suppresses vocabularies with insignificant effects as follows:
Figure BDA0003662008680000032
Figure BDA0003662008680000033
Figure BDA0003662008680000034
wherein the content of the first and second substances,
Figure BDA0003662008680000035
meaning the word
Figure BDA0003662008680000036
Mapping into a word vector; q is and
Figure BDA0003662008680000037
training vectors with the same dimension;
Figure BDA0003662008680000038
is the word attention score obtained by the attention mechanism;
Figure BDA0003662008680000039
the normalized word attention weight;
Figure BDA00036620086800000310
a dynamic word set vector representing a single word set; m represents a word x i The number of words matched with the corresponding word set T;
carrying out weighted summation through attention weights to obtain a dynamic word set vector, representing four dynamic word sets as a whole and compressing the four dynamic word sets into a fixed-dimension vector:
Figure BDA00036620086800000311
wherein, A tau i Expressed as a word x i A corresponding dynamic word set vector;
step 1.3, to fully consider the importance of each word in the two word sets, the pair dynamic word set vector
Figure BDA00036620086800000312
And static word set vectors
Figure BDA00036620086800000313
Dynamic weighted combination; using an evaluation function
Figure BDA00036620086800000322
To measure static word set vector
Figure BDA00036620086800000314
And dynamic word set vector
Figure BDA00036620086800000315
Role on entity recognition task:
Figure BDA00036620086800000316
wherein the content of the first and second substances,
Figure BDA00036620086800000317
is a trainable matrix;
Figure BDA00036620086800000318
is a bias term;
combining the word vector with the static word set vector i And a dynamic word set vector A tau i Taken together, as the input representation ultimately containing the external vocabulary information:
Figure BDA00036620086800000319
wherein the content of the first and second substances,
Figure BDA00036620086800000323
representing a word x i The final vector representation of (a); l is the vector dimension and
Figure BDA00036620086800000320
a matched 1 vector; e.g. of the type x Representing the word x i Converting into corresponding word vectors; denotes dot product calculation;
Figure BDA00036620086800000321
representing vector stitching;
preferably, the step 2 specifically includes:
the sequence coding layer adopts a bidirectional LSTM to obtain a context vector of each word, wherein the bidirectional LSTM is a combination of a forward LSTM and a backward LSTM; use of
Figure BDA0003662008680000041
Representing hidden layer states of the LSTM in the forward direction at time i, using
Figure BDA0003662008680000042
A hidden layer state representing the inverted LSTM at time i; by concatenating the corresponding forward and backward LSTM states, a final context vector is obtained
Figure BDA0003662008680000043
Preferably, the step 3 specifically includes:
step 3.1, segmenting words of an original text by using a Stanford CoreNLP tool, and extracting two syntactic information, namely part-of-speech labels and syntactic component information of an input text by using a Berkely Neural Parse tool; wherein, the part-of-speech tag represents the tag information of a single word, and the syntactic component represents the structure grouping information of the text span;
step 3.2, for each x in the input sequence i Mapping its context characteristics and syntax information to keys and values in the key-value memory network KVMN, respectively represented as
Figure BDA0003662008680000044
And
Figure BDA0003662008680000045
step 3.3, k is embedded using two embedding matrices i,j And v i,j Respectively map to
Figure BDA0003662008680000046
And
Figure BDA0003662008680000047
for each x i Associated context feature K i And syntax information V i Weight γ assigned to syntax information i,j Comprises the following steps:
Figure BDA0003662008680000048
wherein h is i Is x i A concealment vector obtained from a sequence coding layer;
Figure BDA0003662008680000049
is h i Transposed form of (1);
weight gamma i,j Applied to corresponding syntax information v i,j Above, as follows:
Figure BDA00036620086800000410
wherein alpha is i For KVMN model correspondence x i Weighted syntax information of (1); therefore, the KVMN can ensure that the grammar information is weighted according to the corresponding context characteristics, so that the important information is distinguished and used;
step 3.4, sentence-wise Normal vector α for better utilization of syntax information encoded by KVMN i And a context vector h i Dynamic weighted combination using an evaluation function lambda i To measure the context vector h i And syntax vector alpha i Contribution to the sentence:
λ i =σ(W λ1 .h i +W λ2i +b λ )
wherein, W λ1 And W λ2 Is a trainable matrix; b is a mixture of λ Is a bias term; sigma represents a sigmoid activation function;
and then the syntactic vector alpha is expressed i And a context vector h i Combining together:
Figure BDA00036620086800000411
where l is the vector dimension and h i Matched 1 vector, and the resulting O i Namely, the feature vector fusing the context information and the syntax information.
Preferably, the step 4 specifically includes:
for an input sequence x ═ x 1 ,x 2 ,..,x n ) Given a prediction sequence y ═ y (y) 1 ,y 2 ,y 3 ,...,y n ) The score for the predicted sequence is calculated as follows:
Figure BDA0003662008680000051
where, M is the transition matrix,
Figure BDA0003662008680000052
indicating slave label y i To the label y i+1 The transfer fraction of (a);
Figure BDA0003662008680000053
y for i word in sentence i ' probability scores of tags; y is 0 And y n+1 Respectively representing a start and end tag; n represents the length of the input sequence;
the probability p (y | x) of the predicted sequence y generation is calculated using the softmax function as follows:
Figure BDA0003662008680000054
wherein, Y x Representing the set of all possible predicted sequences of y for the solution space;
Figure BDA0003662008680000055
is an example of the entire sequence tagged solution space,
Figure BDA0003662008680000056
a score representing the instance;
during the training process, the likelihood probability log (p (x, y)) of a correct predicted sequence is maximized as follows:
Figure BDA0003662008680000057
after continuous iterative training and back propagation, the obtained y * The result of labeling with the CRF sequence, i.e. the output of the final model:
Figure BDA0003662008680000058
training process continuously maximizes objective function
Figure BDA0003662008680000059
In another aspect, a system for Chinese named entity recognition incorporating lexical and syntactic information, comprising:
the input representation acquisition module is used for mapping an original input text into a word vector, introducing external vocabulary information by using an improved word set matching algorithm and integrating the external vocabulary information into the input representation of each word;
a context vector acquisition module for extracting context information using a bidirectional LSTM according to the input representation of the word;
the context information and syntax information fusion module is used for acquiring part-of-speech tags and syntax components from an original input text by using an NLP tool, constructing a syntax vector by using a healthy value memory network, and performing weighted fusion on the context vector and the syntax vector by using a gate control mechanism to acquire a feature vector;
and the sequence marking module is used for inputting the feature vectors into a CRF (conditional random access memory) of the label prediction layer to realize Chinese named entity identification.
As can be seen from the above description of the present invention, compared with the prior art, the present invention has the following advantages:
the invention introduces external vocabulary information in the word vector through an improved word set matching algorithm, and integrates the context vector and the syntactic vector in a syntactic information layer so as to solve the problem of insufficient entity boundary information in the Chinese named entity and fuse the syntactic information of the input text.
Drawings
FIG. 1 is a flow chart of a method for Chinese named entity recognition incorporating lexical and syntactic information in accordance with an embodiment of the present invention;
FIG. 2 is a model diagram of a method for identifying a named entity in Chinese incorporating lexical and syntactic information according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating a dictionary matching method corresponding to Chinese linguistics according to an embodiment of the present invention;
FIG. 4 is a diagram of syntax information acquisition according to an embodiment of the present invention;
FIG. 5 is a block diagram of a system for Chinese named entity recognition incorporating lexical and syntactic information according to an embodiment of the present invention.
Detailed Description
The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.
Referring to fig. 1, the method for recognizing a named entity in chinese, which fuses vocabulary and syntax information, according to the present invention, includes:
step 1, mapping an original input text into a word vector, introducing external vocabulary information by using an improved word set matching algorithm, and integrating the external vocabulary information into the input representation of each word;
step 2, extracting context information by using a bidirectional LSTM according to the input representation of the character;
step 3, obtaining part-of-speech tags and syntax components from an original input text by using an NLP tool, constructing a syntax vector by using a healthy value memory network, and performing weighted fusion on a context vector and the syntax vector by using a gate control mechanism to obtain a feature vector;
and 4, inputting the feature vector into a CRF (conditional random access memory) of a label prediction layer to realize Chinese named entity identification.
Specifically, a model diagram of the method of the present invention is shown in fig. 2, and the model is divided into four parts: the input represents the layer, sequence coding layer, syntax information layer and label prediction layer.
The invention relates to a Chinese named entity recognition method fusing vocabulary and syntax information, which comprises the following specific implementation steps:
(1) acquisition of an input representation containing external vocabulary information
The invention provides a word set matching method, which introduces an attention mechanism on the basis of SoftLexicon, combines matched static and dynamic word set information, and then fuses the static word set information and the dynamic word set information by using a gating mechanism.
1.1) static word set Algorithm
In the chinese NER model, a sentence of original input text is treated as a sequence x ═ x (x) 1 ,x 2 ,..,x n ). Wherein x is i Representing the ith word in a sentence x of length n. To better utilize the lexical information, the results of each word matching dictionary are divided into four sets of words:
a. word set B (x) i ) Including all in x with x i The first word;
b. word set I (x) i ) Including all x on x i The words in the middle;
c. word set E (x) i ) Including all in x and x i A word ending;
d. word set S (x) i ) Containing all x i Is a word of a single word.
These four word sets are collectively referred to as "BIES" and are respectively represented as:
Figure BDA00036620086800000711
Figure BDA00036620086800000712
Figure BDA00036620086800000713
Figure BDA00036620086800000714
wherein L represents an external dictionary; w is a j,k Indicating the presence of a word in an external dictionary, denoted by w in the sequence j At the beginning with w k The ending word. When the input of the input layer is "chinese linguistics", the corresponding dictionary matching method is as shown in fig. 3.
After a 'BIES' word set corresponding to each word is obtained, each word set is compressed into a vector with fixed dimension. The static word set algorithm is superior to the dynamic word set algorithm using the attention mechanism in processing speed, and in order to ensure the calculation efficiency, the frequency of the occurrence of the words is used for representing the corresponding weight. Since the frequency of a word is a static value, the importance of the word can be reflected by the high frequency of the word, which can speed up the calculation of the weight of the word. For the word x i Corresponding to a word set T of
Figure BDA0003662008680000071
Wherein the content of the first and second substances,
Figure BDA0003662008680000072
representing a word x i And the corresponding jth word in the word set T with the length of m. According to the thought, the weight of the static word set vector is updated to obtain the static word set vector:
Figure BDA0003662008680000073
wherein the content of the first and second substances,
Figure BDA0003662008680000074
expression word
Figure BDA0003662008680000075
Number of occurrences in the corpus;
Figure BDA0003662008680000076
representing the total times of occurrence of words in the word set T;
Figure BDA0003662008680000077
meaning the word
Figure BDA0003662008680000078
Mapping into a word vector, wherein T represents one of four word sets of 'BIES'; the superscripts ws represents a word set,
Figure BDA0003662008680000079
representing a word x i A vector representation of the corresponding set of words T. For better information retention, the four static word sets are represented as a whole and compressed into a fixed-dimension vector.
Figure BDA00036620086800000710
1.2) dynamic word set Algorithm
The attention mechanism may assign different weights to different words, greater weights to a few key words, and less weights to irrelevant words. The dynamic word set algorithm uses an attention mechanism to measure information between characters and matching words, calculates importance of different matching words, enhances useful words and suppresses words with unobvious function. For the word x i Corresponding to a word set T of
Figure BDA0003662008680000081
Wherein the content of the first and second substances,
Figure BDA0003662008680000082
representing a word x i And the corresponding jth word in the word set T with the length of m. Will next be
Figure BDA0003662008680000083
As input, a three-layer neural network is adopted to obtain a word importance score with dimension of 1 × m and value range of (0, 1):
Figure BDA0003662008680000084
Figure BDA0003662008680000085
Figure BDA0003662008680000086
wherein q is and
Figure BDA0003662008680000087
the training vectors of the same dimensions are used,
Figure BDA0003662008680000088
to obtain the word importance score through the attention mechanism,
Figure BDA0003662008680000089
is the normalized word importance score. Then, obtaining a dynamic word set vector according to the word importance degree score, representing the four dynamic word sets as a whole and compressing the four dynamic word sets into a fixed-dimension vector:
Figure BDA00036620086800000810
Figure BDA00036620086800000811
wherein the content of the first and second substances,
Figure BDA00036620086800000812
representing a word x i Dynamic word set vectors, A tau, corresponding to the word set T i Expressed as a word x i Corresponding dynamic word set vector, m being the word x i And the number of words matched with the corresponding word set T.
1.3) gating mechanism
To fully consider the importance of each word in each word set, vectors are applied to the dynamic word sets
Figure BDA00036620086800000813
And static word set vector
Figure BDA00036620086800000814
And (4) dynamically weighting and combining. Using an evaluation function
Figure BDA00036620086800000824
To measure static word set vector
Figure BDA00036620086800000815
And a dynamic word set vector alpha i Role on entity recognition task:
Figure BDA00036620086800000816
wherein the content of the first and second substances,
Figure BDA00036620086800000817
is a trainable matrix that is trained by the user,
Figure BDA00036620086800000818
is the bias term. Then, the static word set vector τ is applied i And a dynamic word set vector A tau i Combining together:
Figure BDA00036620086800000819
Figure BDA00036620086800000820
wherein the content of the first and second substances,
Figure BDA00036620086800000821
a word vector representing a word is provided,
Figure BDA00036620086800000825
representing a word x i Is the vector dimension and
Figure BDA00036620086800000822
matched 1 vector, e x Representing the word x i Converting into corresponding word vector, dot product calculation,
Figure BDA00036620086800000823
is vector stitching. By this method, the word vector trained in advance is embedded into the character representation, and the external dictionary information can be used reasonably.
(2) Acquisition of context vectors
Long-and-short memory networks (LSTM) are a variant of recurrent neural networks and are widely used in NLP tasks, such as NER, text classification, sentiment analysis. The LSTM introduces a cell state, maintains and controls information by using an input gate, a forgetting gate and an output gate, and effectively overcomes gradient explosion and gradient loss caused by long-distance dependence of an RNN model. The mathematical expression for the LSTM model is as follows:
i t =σ(W i [h t-1t ]+b i ) (15)
f t =σ(W f [h t-1t ]+b f ) (16)
o t =σ(W o [h t-1t ]+b o ) (17)
Figure BDA0003662008680000091
Figure BDA0003662008680000092
h t =o t *tanh(c t ) (20)
wherein σ represents a sigmoid activation function, and tanh represents a hyperbolic tangent function; tau is t Representing a unit input; i.e. i t ,f t ,o t Respectively showing an input gate, a forgetting gate and an output gate; w i ,W f ,W o Respectively representing the weights of the input gate, the forgetting gate and the output gate; b i ,b f ,b o Respectively representing the offsets of the input gate, the forgetting gate and the output gate;
Figure BDA0003662008680000093
representing a current input state; c. C t An output representing the current state.
To use context information for words simultaneously, the model herein uses bi-directional LSTM to obtain a context vector for each word, which is a combination of forward LSTM and backward LSTM. Use of
Figure BDA0003662008680000094
Representing hidden layer states of the i-time forward LSTM, using
Figure BDA0003662008680000095
Representing the hidden layer state of the inverted LSTM at time i. By concatenating the corresponding forward and backward LSTM states, a final context vector is obtained
Figure BDA0003662008680000096
(3) Fusion of contextual information and syntactic information
Context information is output by a bidirectional LSTM of a sequence coding layer, syntax information is firstly extracted from an input sequence through an NLP tool, then a key value memory network is used for mapping, and finally a gating mechanism is used for integrating two kinds of information.
3.1) syntax information acquisition
The invention usesThe Stanford CoreNLP tool performs word segmentation on the text, and then extracts two kinds of syntactic information, namely a part-of-speech tag and a syntactic component by using a Berkely Neural Parse tool. Wherein the part-of-speech tags represent tag information for individual words and the syntactic component represents structural grouping information for text spans. Taking "liberation of road surface water" as an example, fig. 4 shows part-of-speech tags and syntactic components in an example sentence. For an input sequence x ═ x 1 ,x 2 ,..,x n ) Each x in (1) i The context feature and syntax information are extracted as follows.
3.1.1) part of speech tag: each x is i Extracting x as a core word using a window of + -1 words i Context words and part-of-speech tags on both sides. As shown in fig. 4(a), the obtained contextual characteristics of the word "track" are (liberty, large track, road surface), and the combination of these words and the corresponding part-of-speech tags is used as part-of-speech information of the NER task, i.e., the corresponding syntax information is (liberty _ NN, large track _ NN, road surface _ NN).
3.1.2) syntactic components: from x i Starting with the leaf nodes of the syntax tree, searching upwards along the tree to find the first syntax node, then taking all words under the node as context characteristics, and taking the combination of the words and the corresponding syntax labels as the syntax component information of the NER task. As shown in fig. 4(b), for the word "track", searching up for a node, the syntax node "NP" contains two words "free" and "big track". Therefore, the context feature is (release, large track), and the context feature is combined with the "NP" tag as syntax component information of the NER task, i.e., the corresponding syntax information is (release _ NP, large track _ NP).
3.2) KVMN construction
Since the syntax information extracted by the tool has certain noise information, if the noise is not properly utilized, the performance of the model may be affected. Variations of the KVMN have proven effective in incorporating context features and provide an appropriate way to take advantage of context features and their corresponding dependency types.
In constructing the KVMN, the input sequence x ═ x (x) is first obtained 1 ,x 2 ,..,x n ) Correspond toThe result of the analysis (2). For each x in the input sequence i Mapping its contextual characteristics and syntactic information onto keys and values in the KVMN, respectively
Figure BDA0003662008680000101
And
Figure BDA0003662008680000102
wherein m is i Denotes x i K and V represent context feature and syntax information. Next, k is embedded using two embedding matrices i,j And v i,j Respectively map to
Figure BDA0003662008680000103
And
Figure BDA0003662008680000104
the above. For each x i Associated context feature K i And syntax information V i The weight assigned to the syntax information is:
Figure BDA0003662008680000105
wherein h is i Is x i A concealment vector obtained from a sequence coding layer. Weight gamma i,j Applied to corresponding syntax information v i,j The method comprises the following steps:
Figure BDA0003662008680000106
wherein alpha is i For KVMN model correspondence x i Weighted syntax information of (2). Thus, the KVMN can ensure that the grammar information is weighted according to its corresponding context characteristics, thereby distinguishing and using the important information.
3.3) gating mechanism
For better utilization of KVMN-encoded syntax information, the syntax vector α is aligned i And a context vector h i Dynamic weightingAnd (4) combining. Using an evaluation function lambda i To measure the context vector h i And syntax vector alpha i Contribution to the sentence:
λ i =σ(W λ1 .h i +W λ2i +b λ ) (23)
wherein, W λ1 、W λ2 Is a trainable matrix, b λ Is the bias term. Then, the syntax vector α is encoded i And a context vector h i Combining together:
Figure BDA0003662008680000111
wherein, O i Is KVMN corresponds to x i Is the vector dimension and h i And matching 1 vector, performing matrix dot product calculation, and completing the vector splicing.
(4) Sequence annotation using CRF
Compared with the HMM, the independent assumption of the CRF on the HMM has no strict requirement, the sequence and external observation information can be effectively utilized, and the label deviation and the differentiation generated by only directly assuming the label are avoided. CRF can capture more dependencies: for example, the "I-ORG" tag cannot follow "B-LOC". At CNER, the input of CRF is the context feature vector O learned from the syntax information layer i . For an input sequence x ═ x 1 ,x 2 ,..,x n ) Record P i,j Representing the probability score of the jth tag of the ith word in the sentence. For the predicted sequence y ═ y 1 ,y 2 ,y 3 ,...,y n ) And calculating the score of the predicted sequence:
Figure BDA0003662008680000112
where M is the transition matrix, M i,j Represents the branch score from tag i to tag j; y is 0 And y n+1 Representing the start and end tags, respectively. The probability of the predicted sequence y being generated is calculated using the softmax function:
Figure BDA0003662008680000113
wherein, Y x Representing all possible predicted sequences of y. In the training process, the likelihood probability of correctly predicting the sequence is maximized:
Figure BDA0003662008680000114
after training is finished, the score obtained by predicting the sequence is as follows:
Figure BDA0003662008680000115
(5) evaluation of effects
To verify the effectiveness of the method of the present invention, and to compare it with other models, three named entity data sets of chinese were used for experiments, including: weibo dataset, Resume dataset, and MSRA dataset. The experimental indexes adopt accuracy, recall rate and F1 values.
The Weibo data set is information from social network microblogs and comprises four entities of a person name, a place name, an organization name and a geopolitical entity. The MSRA data set is a data set from the news domain, and includes only a training set and a test set, including three entity types of person name, place name and organization. The Resume dataset is Resume information from the finance of the New wave, and contains seven entity types including city, educational institution, place name, person name, institution name, proper noun, professional background, and job title. The situation for each data set is shown in table 1.
Table 1 data set details
Figure BDA0003662008680000121
In order to verify the effectiveness of the method in this chapter, the invention compares the following 3 models as baseline:
(1) Lattice-LSTM: introducing a word cell structure, and fusing all word information ending with the current character;
(2) FLAT: the method is improved based on a transformer structure, skillful position coding is designed to fuse lattice structures, and absolute position coding is improved to make the absolute position coding more suitable for NER tasks;
(3) SoftLexicon: the method of simply utilizing the vocabulary in the input presentation layer has high portability.
The evaluation criteria are as follows:
the evaluation indices of the experiment used accuracy (Precision), Recall (Recall) and F1 values (F1 score). Wherein Precision is the correct proportion in all recognized entity words; recall indicates that the correct entity is identified as a proportion of all entities in the data set. Since the accuracy and the precision are inversely proportional, F1 is their harmonic mean. The evaluation function is:
Figure BDA0003662008680000122
Figure BDA0003662008680000123
Figure BDA0003662008680000124
the hyper-parameters used by the model are shown in table 2:
TABLE 2 hyper-parameter configuration
Figure BDA0003662008680000125
The results of the experiments on the three baseline models and the model of the invention on the respective data sets are shown in table 3, with the best results for each data set shown in bold.
Table 3 experimental results for different data sets
Figure BDA0003662008680000131
As shown in the results of Table 3, the LSF-CNER proposed herein outperforms other methods on the Weibo, Resume dataset.
On the Weibo data set with more noise and an indefinite format, the LSF-CNER achieves a better effect, and compared with SoftLexicon, the F1 value is improved by 0.71%.
On the data set with relatively fixed format and less noise of Resume, the F1 value of the LSF-CNER is improved by 0.11 percent compared with SoftLexicon.
In the MSRA dataset with more data in the formal text, the F1 value of the LSF-CNER is reduced by 0.01 percent compared with SoftLexicon, which shows that the model is already fitted. The reason is that SoftLexicon can well fit large-scale MSRA and Resume data sets, and the model effect cannot be obviously improved by adding characteristic information. Because of the lack of feature information on small-scale datasets, syntactic and lexical information can be used to improve model performance.
Further, to verify the general utility of the LSF-CNER model. The vocabulary information of the words is combined with the BERT word vectors as the output of the input presentation layer and sent to the sequence coding layer. Table 4 shows the results of experiments with BERT, wherein F1 values are given. The F1 average value result of the method proposed herein on different data sets is improved by 4.68% compared with BERT-Tagger; compared with BERT + BiLSTM + CRF, the yield is increased by 2.48%; compared with SoftLexicon + BERT, the yield is improved by 0.89%. Especially in the Weibo data set, the improvement of the method is more remarkable.
TABLE 4 Experimental results (%)
Figure BDA0003662008680000132
The experimental results show that the effect of inputting the representation layer can be better improved by combining the pre-training model. The method herein outperforms the traditional CNER method in the Resume dataset, the Weibo dataset, and the MSRA dataset. This verifies that fusing the lexical information and the grammatical information is valid.
In order to study the contributions of the two parts, dynamic word set and syntactic information, to the entity recognition task separately, ablation experiments were performed on the Resume dataset and were performed with the model set to five cases.
(1) BERT + LSTM + CRF: an initial model that does not include external vocabulary information and syntax information;
(2) SoftLexicon + BERT: including external vocabulary information;
(3) SoftLexicon + Attention + BERT, wherein an Attention mechanism is introduced on the basis of SoftLexicon to dynamically adjust the weights of different words in a word set;
(4) syntact + BERT: contains syntactic information but no external vocabulary information;
(5) LSF + BERT: containing syntactic information, and improved word set information.
Table 5 ablation test results (%)
Model (model) P R F1
BERT+BiLSTM+CRF 95.75 95.28 95.51
SoftLexicon+BERT 96.08 96.13 96.11
SoftLexicon+Attention+BERT 96.19 96.35 96.27
Syntactic+BERT 95.86 96.08 95.97
LSF+BERT 96.73 96.45 96.59
As shown in table 5, when the attention mechanism is introduced in the SoftLexicon + BERT model to adjust the weight of the word set vector, the average F1 value is raised by 0.16%; when syntax information is introduced into the BERT + BilSTM + CRF model, the average F1 value is improved by 0.46%; when the syntax information and the static and dynamic word set information are contained at the same time, the average F1 value is improved by 0.48 percent compared with the value of SoftLexicon + BERT. From the above experimental results it can be seen that: the static and dynamic word set information and the syntax information are helpful for improving the performance of the model; the two methods exist simultaneously, so that the recognition precision of the Chinese named entity is further improved. This shows that dynamically adjusting the weight of words in a word set and introducing syntactic information help to help sentences recognize the importance of different words, thereby improving the accuracy of the Chinese named entity.
In summary, the invention provides a method for identifying a named entity in Chinese, which combines vocabulary and syntax information. The new word set matching method integrates static and dynamic word set information on an input presentation layer, and then uses a gating mechanism to dynamically weight the output and syntax information of a sequence coding layer, thereby not only considering the potential boundary of a Chinese named entity, but also considering the potential syntax information in a sentence, and the two kinds of information are fused in a more balanced way, thereby improving the expression effect of the model. Experimental results on three CNER datasets show that the new method has good performance compared to the mainstream method.
Referring to fig. 5, the present invention further includes a system for recognizing a named entity in chinese that fuses lexical and syntactic information, including:
an input representation obtaining module 501, configured to map an original input text into a word vector, introduce external vocabulary information using an improved word set matching algorithm, and integrate the external vocabulary information into an input representation of each word;
a context vector obtaining module 502, configured to extract context information using a bidirectional LSTM according to an input representation of a word;
a context information and syntax information fusion module 503, configured to obtain part-of-speech tags and syntax components from an original input text using an NLP tool, construct a syntax vector using a robust memory network, and perform weighted fusion on the context vector and the syntax vector through a gate control mechanism to obtain a feature vector;
a sequence labeling module 504, configured to input the feature vectors into the CRF of the label prediction layer to implement chinese named entity recognition
A Chinese named entity recognition system fusing vocabulary and syntax information concretely realizes a Chinese named entity recognition method fusing vocabulary and syntax information, and the embodiment does not repeat description.
The above description is only an embodiment of the present invention, but the design concept of the present invention is not limited thereto, and any insubstantial modifications made by using the design concept should fall within the scope of infringing the present invention.

Claims (6)

1. A Chinese named entity recognition method fusing vocabulary and syntax information is characterized by comprising the following steps:
step 1, mapping an original input text into a word vector, introducing external vocabulary information by using an improved word set matching algorithm, and integrating the external vocabulary information into the input representation of each word;
step 2, extracting context information by using a bidirectional LSTM according to the input representation of the character;
step 3, obtaining part-of-speech tags and syntax components from an original input text by using an NLP tool, constructing a syntax vector by using a healthy value memory network, and performing weighted fusion on a context vector and the syntax vector by using a gate control mechanism to obtain a feature vector;
and 4, inputting the feature vector into a CRF (conditional random access memory) of a label prediction layer to realize Chinese named entity identification.
2. The method for recognizing named entities in chinese that fuses lexical and syntactic information according to claim 1, wherein said step 1 specifically comprises:
step 1.1, regarding the input text as a sentence, and expressing x ═ x (x) in a sequence 1 ,x 2 ,..,x n ) (ii) a Wherein x is i Represents the ith word in a sentence x of length n; to better utilize the lexical information, the results of each word matching dictionary are divided into the following four word sets of "BIES":
(1) word set B (x) i ) Including all in x with x i The first word;
(2) word set I (x) i ) Including all x on x i The words in the middle;
(3) word set E (x) i ) Including all in x and x i A word ending;
(4) word set S (x) i ) Containing all x i Is the word of a single character;
step 1.2, after a BIES word set corresponding to each word is obtained, compressing each word set into a vector with fixed dimension; the improved word set matching algorithm comprises a static word set algorithm and a dynamic word set algorithm, wherein the static word set algorithm uses the occurrence frequency of words to represent corresponding weight in order to ensure the calculation efficiency, and the static word set vector calculation method of a single word set comprises the following steps:
Figure FDA0003662008670000011
wherein the content of the first and second substances,
Figure FDA0003662008670000012
expression word
Figure FDA0003662008670000013
Number of occurrences in the corpus;
Figure FDA0003662008670000014
representing the total times of occurrence of words in the word set T;
Figure FDA0003662008670000015
meaning the word
Figure FDA0003662008670000016
Mapping into a word vector; t represents one of four word sets of "BIES";
Figure FDA0003662008670000017
representing a word x i Vector representation of the corresponding word set T;
for better information retention, four static word sets are represented as a whole and integrated into a vector with a fixed dimension by splicing:
Figure FDA0003662008670000018
wherein, tau i Representing a word x i A corresponding static word set vector;
the dynamic vocabulary set algorithm uses an attention mechanism to measure information between characters and matching words, calculates attention weights of different matching words, enhances useful vocabularies and suppresses vocabularies with insignificant effects as follows:
Figure FDA0003662008670000021
Figure FDA0003662008670000022
Figure FDA0003662008670000023
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003662008670000024
meaning the word
Figure FDA0003662008670000025
Mapping into a word vector; q is and
Figure FDA0003662008670000026
training vectors with the same dimension;
Figure FDA0003662008670000027
is the word attention score obtained by the attention mechanism;
Figure FDA0003662008670000028
the normalized word attention weight;
Figure FDA0003662008670000029
a dynamic word set vector representing a single word set; m represents a word x i The number of words matched with the corresponding word set T;
carrying out weighted summation through attention weight to obtain a dynamic word set vector, representing four dynamic word sets as a whole and compressing the four dynamic word sets into a fixed-dimension vector:
Figure FDA00036620086700000210
wherein, A tau i Expressed as a word x i A corresponding dynamic word set vector;
step 1.3, to fully consider the importance of each word in the two word sets, the pair dynamic word set vector
Figure FDA00036620086700000211
And static word set vectors
Figure FDA00036620086700000212
Dynamic weighted combination; using an evaluation function theta i To measure static word set vector
Figure FDA00036620086700000213
And dynamic word set vector
Figure FDA00036620086700000214
Role on entity recognition task:
θ i =σ(W θ1i +W θ2 .Aτ i +b θ )
wherein, W θ1 、W θ2 Is a trainable matrix; b θ Is a bias term;
combining the word vector and the static word set vector tau i And a dynamic word set vector A tau i Taken together, as the input representation ultimately containing the external vocabulary information:
Figure FDA00036620086700000215
wherein the content of the first and second substances,
Figure FDA00036620086700000216
representing a word x i The final vector representation of (a); l is the vector dimension and
Figure FDA00036620086700000217
a matched 1 vector; e.g. of the type x Representing the word x i Converting into corresponding word vectors; denotes dot product calculation;
Figure FDA00036620086700000218
representing vector stitching.
3. The method for recognizing named entities in chinese that fuses lexical and syntactic information according to claim 2, wherein said step 2 specifically comprises:
the sequence coding layer adopts a bidirectional LSTM to obtain a context vector of each word, wherein the bidirectional LSTM is a combination of a forward LSTM and a backward LSTM; use of
Figure FDA00036620086700000219
Representing hidden layer states of the i-time forward LSTM, using
Figure FDA00036620086700000220
A hidden layer state representing the inverted LSTM at time i; by concatenating the corresponding forward and backward LSTM states, a final context vector is obtained
Figure FDA0003662008670000031
4. The method for recognizing named entities in chinese with fused lexical and syntactic information according to claim 3, wherein said step 3 specifically comprises:
step 3.1, segmenting words of an original text by using a Stanford CoreNLP tool, and extracting two syntactic information, namely part-of-speech labels and syntactic component information of an input text by using a Berkely Neural Parse tool; wherein, the part-of-speech tag represents the tag information of a single word, and the syntactic component represents the structure grouping information of the text span;
step 3.2, for each x in the input sequence i Mapping its context characteristics and syntax information to keys and values in a key-value memory network KVMN, denoted as K respectively i =[k i,1 ,...k i,j ,...,k i,mi ]And V i =[v i,1 ,...v i,j ,...v i,mi ];
Step 3.3, k is embedded using two embedding matrices i,j And v i,j Respectively map to
Figure FDA0003662008670000032
And
Figure FDA0003662008670000033
for each x i Associated context feature K i And syntax information V i Weight γ assigned to syntax information ij Comprises the following steps:
Figure FDA0003662008670000034
wherein h is i Is x i A concealment vector obtained from a sequence coding layer; h is i T Is h i Transposed form of (1);
weight gamma i,j Applied to corresponding syntax information v i,j Above, as follows:
Figure FDA0003662008670000035
wherein alpha is i For KVMN model corresponds to x i Weighted syntax information of (1); therefore, the KVMN can ensure that the grammar information is weighted according to the corresponding context characteristics, so that the important information is distinguished and used;
step 3.4, sentence-wise Normal vector α for better utilization of syntax information encoded by KVMN i And a context vector h i Dynamic weighted combination using an evaluation function lambda i To measure the context vector h i And syntax vector alpha i Contribution to the sentence:
λ i =σ(W λ1 .h i +W λ2i +b λ )
wherein, W λ1 And W λ2 Is a trainable matrix; b λ Is a bias term; sigma represents a sigmoid activation function;
and then the syntactic vector alpha is expressed i And a context vector h i Combining together:
Figure FDA0003662008670000036
where l is the vector dimension and h i Matched 1 vector, and the resulting O i Namely, the feature vector fusing the context information and the syntax information.
5. The method for recognizing named entities in chinese with fused lexical and syntactic information as claimed in claim 4, wherein said step 4 specifically comprises:
for an input sequence x ═ x 1 ,x 2 ,..,x n ) Given a prediction sequence y ═ y (y) 1 ,y 2 ,y 3 ,...,y n ) The score for the predicted sequence is calculated as follows:
Figure FDA0003662008670000041
where, M is the transition matrix,
Figure FDA0003662008670000042
indicating slave label y i To the label y i+1 The transfer fraction of (a);
Figure FDA0003662008670000043
y for i word in sentence i ' probability scores of tags; y is 0 And y n+1 Respectively representing a start and end tag; n represents the length of the input sequence;
the probability p (y | x) of the predicted sequence y generation is calculated using the softmax function as follows:
Figure FDA0003662008670000044
wherein, Y x For solution space, represent the set of all possible predicted sequences for y;
Figure FDA0003662008670000045
is an example of the entire sequence tagged solution space,
Figure FDA0003662008670000046
a score representing the instance;
during the training process, the likelihood probability log (p (x, y)) of a correct predicted sequence is maximized as follows:
Figure FDA0003662008670000047
after continuous iterative training and back propagation, the obtained y * The result of labeling with the CRF sequence, i.e. the output of the final model:
Figure FDA0003662008670000048
training process continuously maximizes objective function
Figure FDA0003662008670000049
6. A Chinese named entity recognition system fusing lexical and syntactic information, comprising:
the input representation acquisition module is used for mapping an original input text into a word vector, introducing external vocabulary information by using an improved word set matching algorithm and integrating the external vocabulary information into the input representation of each word;
a context vector acquisition module for extracting context information using a bidirectional LSTM according to the input representation of the word;
the context information and syntax information fusion module is used for acquiring part-of-speech tags and syntax components from an original input text by using an NLP tool, constructing a syntax vector by using a healthy value memory network, and performing weighted fusion on the context vector and the syntax vector by using a gate control mechanism to acquire a feature vector;
and the sequence marking module is used for inputting the feature vectors into a CRF (conditional random access memory) of the label prediction layer to realize Chinese named entity identification.
CN202210575509.3A 2022-05-25 2022-05-25 Chinese named entity recognition method and system fusing vocabulary and syntax information Pending CN114818717A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210575509.3A CN114818717A (en) 2022-05-25 2022-05-25 Chinese named entity recognition method and system fusing vocabulary and syntax information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210575509.3A CN114818717A (en) 2022-05-25 2022-05-25 Chinese named entity recognition method and system fusing vocabulary and syntax information

Publications (1)

Publication Number Publication Date
CN114818717A true CN114818717A (en) 2022-07-29

Family

ID=82517680

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210575509.3A Pending CN114818717A (en) 2022-05-25 2022-05-25 Chinese named entity recognition method and system fusing vocabulary and syntax information

Country Status (1)

Country Link
CN (1) CN114818717A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115270803A (en) * 2022-09-30 2022-11-01 北京道达天际科技股份有限公司 Entity extraction method based on BERT and fused with N-gram characteristics
CN115774993A (en) * 2022-12-29 2023-03-10 广东南方网络信息科技有限公司 Conditional error identification method and device based on syntactic analysis
CN117077672A (en) * 2023-07-05 2023-11-17 哈尔滨理工大学 Chinese naming entity recognition method based on vocabulary enhancement and TCN-BILSTM model
CN117521639A (en) * 2024-01-05 2024-02-06 湖南工商大学 Text detection method combined with academic text structure

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109726389A (en) * 2018-11-13 2019-05-07 北京邮电大学 A kind of Chinese missing pronoun complementing method based on common sense and reasoning
WO2021082366A1 (en) * 2019-10-28 2021-05-06 南京师范大学 Interactive and iterative learning-based intelligent construction method for geographical name tagging corpus
CN112883732A (en) * 2020-11-26 2021-06-01 中国电子科技网络信息安全有限公司 Method and device for identifying Chinese fine-grained named entities based on associative memory network
CN113095074A (en) * 2021-03-22 2021-07-09 北京工业大学 Word segmentation method and system for Chinese electronic medical record
CN113609859A (en) * 2021-08-04 2021-11-05 浙江工业大学 Special equipment Chinese named entity recognition method based on pre-training model
CN114528840A (en) * 2022-01-21 2022-05-24 深圳大学 Chinese entity identification method, terminal and storage medium fusing context information

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109726389A (en) * 2018-11-13 2019-05-07 北京邮电大学 A kind of Chinese missing pronoun complementing method based on common sense and reasoning
WO2021082366A1 (en) * 2019-10-28 2021-05-06 南京师范大学 Interactive and iterative learning-based intelligent construction method for geographical name tagging corpus
CN112883732A (en) * 2020-11-26 2021-06-01 中国电子科技网络信息安全有限公司 Method and device for identifying Chinese fine-grained named entities based on associative memory network
CN113095074A (en) * 2021-03-22 2021-07-09 北京工业大学 Word segmentation method and system for Chinese electronic medical record
CN113609859A (en) * 2021-08-04 2021-11-05 浙江工业大学 Special equipment Chinese named entity recognition method based on pre-training model
CN114528840A (en) * 2022-01-21 2022-05-24 深圳大学 Chinese entity identification method, terminal and storage medium fusing context information

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115270803A (en) * 2022-09-30 2022-11-01 北京道达天际科技股份有限公司 Entity extraction method based on BERT and fused with N-gram characteristics
CN115774993A (en) * 2022-12-29 2023-03-10 广东南方网络信息科技有限公司 Conditional error identification method and device based on syntactic analysis
CN115774993B (en) * 2022-12-29 2023-09-08 广东南方网络信息科技有限公司 Condition type error identification method and device based on syntactic analysis
CN117077672A (en) * 2023-07-05 2023-11-17 哈尔滨理工大学 Chinese naming entity recognition method based on vocabulary enhancement and TCN-BILSTM model
CN117077672B (en) * 2023-07-05 2024-04-26 哈尔滨理工大学 Chinese naming entity recognition method based on vocabulary enhancement and TCN-BILSTM model
CN117521639A (en) * 2024-01-05 2024-02-06 湖南工商大学 Text detection method combined with academic text structure
CN117521639B (en) * 2024-01-05 2024-04-02 湖南工商大学 Text detection method combined with academic text structure

Similar Documents

Publication Publication Date Title
CN110442760B (en) Synonym mining method and device for question-answer retrieval system
CN111738003B (en) Named entity recognition model training method, named entity recognition method and medium
CN111931506B (en) Entity relationship extraction method based on graph information enhancement
CN106599032B (en) Text event extraction method combining sparse coding and structure sensing machine
CN112183094B (en) Chinese grammar debugging method and system based on multiple text features
CN111160031A (en) Social media named entity identification method based on affix perception
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN112231472B (en) Judicial public opinion sensitive information identification method integrated with domain term dictionary
CN108874896B (en) Humor identification method based on neural network and humor characteristics
CN113392209B (en) Text clustering method based on artificial intelligence, related equipment and storage medium
US20230069935A1 (en) Dialog system answering method based on sentence paraphrase recognition
CN115357719B (en) Power audit text classification method and device based on improved BERT model
CN115048447B (en) Database natural language interface system based on intelligent semantic completion
CN112069312B (en) Text classification method based on entity recognition and electronic device
CN112818698B (en) Fine-grained user comment sentiment analysis method based on dual-channel model
CN111581943A (en) Chinese-over-bilingual multi-document news viewpoint sentence identification method based on sentence association graph
CN113919366A (en) Semantic matching method and device for power transformer knowledge question answering
CN115146629A (en) News text and comment correlation analysis method based on comparative learning
CN111984782A (en) Method and system for generating text abstract of Tibetan language
CN114036246A (en) Commodity map vectorization method and device, electronic equipment and storage medium
CN111581365B (en) Predicate extraction method
CN113705207A (en) Grammar error recognition method and device
CN112749566B (en) Semantic matching method and device for English writing assistance
CN116910251A (en) Text classification method, device, equipment and medium based on BERT model
CN116562291A (en) Chinese nested named entity recognition method based on boundary detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination