CN114818717A

CN114818717A - Chinese named entity recognition method and system fusing vocabulary and syntax information

Info

Publication number: CN114818717A
Application number: CN202210575509.3A
Authority: CN
Inventors: 李弼程; 刘其龙; 张敏; 皮慧娟; 王华珍; 王成
Original assignee: Huaqiao University
Current assignee: Huaqiao University
Priority date: 2022-05-25
Filing date: 2022-05-25
Publication date: 2022-07-29

Abstract

The invention discloses a Chinese named entity recognition method and system fusing vocabulary and syntax information, comprising the following steps: step 1, mapping an original input text into a word vector, introducing external vocabulary information by using an improved word set matching algorithm, and integrating the external vocabulary information into the input representation of each word; step 2, extracting context information by using a bidirectional LSTM according to the input representation of the character; step 3, obtaining part-of-speech tags and syntax components from an original input text by using an NLP tool, constructing a syntax vector by using a healthy value memory network, and performing weighted fusion on a context vector and the syntax vector by using a gate control mechanism to obtain a feature vector; and 4, inputting the feature vector into a CRF (conditional random access memory) of a label prediction layer to realize Chinese named entity identification. The invention can solve the problem of insufficient entity boundary information in the Chinese named entity and merge the syntactic information of the input text.

Description

Chinese named entity recognition method and system fusing vocabulary and syntax information

Technical Field

The invention relates to the field of information extraction of natural language processing, in particular to a Chinese named entity recognition method and system fusing vocabulary and syntax information.

Background

Named Entity Recognition (NER) aims at recognizing and classifying entities in text into different categories such as: name of person, place name, organization name, etc. The NER is an important task in NLP and has been widely applied to the fields of relation extraction, question answering, machine translation, knowledge base construction and the like, so that the research and breakthrough of the NER have very important significance.

The named entity recognition of Chinese and English is different, and each word in English can express complete semantic information; in Chinese, only one word or phrase can express a complete meaning in most cases, and Chinese has no obvious characteristics such as vocabulary boundary characters and capitals, so that the entity boundary identification is difficult. However, the word entity boundary is usually the same as the entity boundary, so the word boundary information plays an important role in Chinese Named Entity Recognition (CNER).

The problem of difficult recognition of word boundaries can be solved by introducing external features. In these features, lexical information and syntactic information have important meanings, and can help the CNER model to find corresponding entities. When the external features are used in the existing CNER model, the distinguishing and processing are rarely carried out, and the noise in the features may influence the performance of the model. Therefore, finding a suitable method to integrate extrinsic feature information into the CNER model remains a challenge. In most cases, it is desirable that the CNER model may contain a variety of additional features. Therefore, an effective mechanism is needed to be designed to weight and combine these features to limit the noise information.

Meanwhile, the existing SoftLexion word set matching method depends on static word frequency statistics in a data set, and the word frequency is used for measuring the effect of different words on the Chinese named entity recognition task. Considering that different data sets have different scales and sizes, the problem of too low word frequency exists on small-scale data sets, and the word frequency cannot better reflect the importance of words in some cases. Therefore, a more reasonable method can be found to reasonably weigh the weights of the words in the word set.

Disclosure of Invention

The invention mainly aims to overcome the defects in the prior art, and provides a method and a system for identifying a Chinese named entity by fusing vocabulary and syntax information, wherein the vocabulary and syntax information are fused (fused) (abbreviated as an LSF-CNER model), specifically, external vocabulary information and syntax information of an input text are fused by using a gating unit, and an attention mechanism is introduced into the model to construct a Chinese named entity identification model, so that the accuracy of the identification of the Chinese named entity is expected to be improved. The problems mainly solved by the invention are embodied in the following two aspects: on one hand, the word set matching algorithm after improving each word of the input text sequence splices the matched static word set vector, dynamic word set vector and initial word vector, thereby integrating external vocabulary information into the word vector and solving the problem of insufficient word boundary characteristics of Chinese text. On the other hand, the syntax information of the input text is extracted by using an NLP tool, and the context vectors extracted by a gating mechanism and a bidirectional LSTM are integrated, so that the representation of the feature vectors is enriched, and the syntax information of a deeper level is fused.

The invention adopts the following technical scheme:

on one hand, the Chinese named entity recognition method fusing vocabulary and syntax information comprises the following steps:

step 1, mapping an original input text into a word vector, introducing external vocabulary information by using an improved word set matching algorithm, and integrating the external vocabulary information into the input representation of each word;

step 2, extracting context information by using a bidirectional LSTM according to the input representation of the character;

step 3, obtaining part-of-speech tags and syntax components from an original input text by using an NLP tool, constructing a syntax vector by using a healthy value memory network, and performing weighted fusion on a context vector and the syntax vector by using a gate control mechanism to obtain a feature vector;

and 4, inputting the feature vector into a CRF (conditional random access memory) of a label prediction layer to realize Chinese named entity identification.

Preferably, the step 1 specifically includes:

step 1.1, regarding the input text as a sentence, and expressing x ═ x (x) in a sequence ₁ ,x ₂ ,..,x _n ) (ii) a Wherein x is _i Represents the ith word in a sentence x of length n; to better utilize the lexical information, the results of each word matching dictionary are divided into the following four word sets of "BIES":

(1) word set B (x) _i ) Including all in x with x _i The first word;

(2) word set I (x) _i ) Including all x on x _i The words in the middle;

(3) word set E (x) _i ) Including all in x and x _i A word ending;

(4) word set S (x) _i ) Containing all x _i Is the word of a single character;

step 1.2, after a BIES word set corresponding to each word is obtained, compressing each word set into a vector with fixed dimension; the improved word set matching algorithm comprises a static word set algorithm and a dynamic word set algorithm, wherein the static word set algorithm uses the occurrence frequency of words to represent corresponding weight in order to ensure the calculation efficiency, and the static word set vector calculation method of a single word set comprises the following steps:

wherein the content of the first and second substances,

expression word

Number of occurrences in the corpus;

representing the total times of occurrence of words in the word set T;

meaning the word

Mapping into a word vector; t represents one of four word sets of "BIES";

representing a word x _i Vector representation of the corresponding word set T;

for better information retention, four static word sets are represented as a whole and integrated into a vector with a fixed dimension by splicing:

wherein, tau _i Representing a word x _i A corresponding static word set vector;

the dynamic vocabulary set algorithm uses an attention mechanism to measure information between characters and matching words, calculates attention weights of different matching words, enhances useful vocabularies and suppresses vocabularies with insignificant effects as follows:

wherein the content of the first and second substances,

meaning the word

Mapping into a word vector; q is and

training vectors with the same dimension;

is the word attention score obtained by the attention mechanism;

the normalized word attention weight;

a dynamic word set vector representing a single word set; m represents a word x _i The number of words matched with the corresponding word set T;

carrying out weighted summation through attention weights to obtain a dynamic word set vector, representing four dynamic word sets as a whole and compressing the four dynamic word sets into a fixed-dimension vector:

wherein, A tau _i Expressed as a word x _i A corresponding dynamic word set vector;

step 1.3, to fully consider the importance of each word in the two word sets, the pair dynamic word set vector

And static word set vectors

Dynamic weighted combination; using an evaluation function

To measure static word set vector

And dynamic word set vector

Role on entity recognition task:

wherein the content of the first and second substances,

is a trainable matrix;

is a bias term;

combining the word vector with the static word set vector _i And a dynamic word set vector A tau _i Taken together, as the input representation ultimately containing the external vocabulary information:

wherein the content of the first and second substances,

representing a word x _i The final vector representation of (a); l is the vector dimension and

a matched 1 vector; e.g. of the type ^x Representing the word x _i Converting into corresponding word vectors; denotes dot product calculation;

representing vector stitching;

preferably, the step 2 specifically includes:

the sequence coding layer adopts a bidirectional LSTM to obtain a context vector of each word, wherein the bidirectional LSTM is a combination of a forward LSTM and a backward LSTM; use of

Representing hidden layer states of the LSTM in the forward direction at time i, using

A hidden layer state representing the inverted LSTM at time i; by concatenating the corresponding forward and backward LSTM states, a final context vector is obtained

Preferably, the step 3 specifically includes:

step 3.1, segmenting words of an original text by using a Stanford CoreNLP tool, and extracting two syntactic information, namely part-of-speech labels and syntactic component information of an input text by using a Berkely Neural Parse tool; wherein, the part-of-speech tag represents the tag information of a single word, and the syntactic component represents the structure grouping information of the text span;

step 3.2, for each x in the input sequence _i Mapping its context characteristics and syntax information to keys and values in the key-value memory network KVMN, respectively represented as

And

step 3.3, k is embedded using two embedding matrices _i,j And v _i,j Respectively map to

And

for each x _i Associated context feature K _i And syntax information V _i Weight γ assigned to syntax information _i,j Comprises the following steps:

wherein h is _i Is x _i A concealment vector obtained from a sequence coding layer;

is h _i Transposed form of (1);

weight gamma _i,j Applied to corresponding syntax information v _i,j Above, as follows:

wherein alpha is _i For KVMN model correspondence x _i Weighted syntax information of (1); therefore, the KVMN can ensure that the grammar information is weighted according to the corresponding context characteristics, so that the important information is distinguished and used;

step 3.4, sentence-wise Normal vector α for better utilization of syntax information encoded by KVMN _i And a context vector h _i Dynamic weighted combination using an evaluation function lambda _i To measure the context vector h _i And syntax vector alpha _i Contribution to the sentence:

λ _i ＝σ(W _λ1 .h _i +W _λ2 .α _i +b _λ )

wherein, W _λ1 And W _λ2 Is a trainable matrix; b is a mixture of _λ Is a bias term; sigma represents a sigmoid activation function;

and then the syntactic vector alpha is expressed _i And a context vector h _i Combining together:

where l is the vector dimension and h _i Matched 1 vector, and the resulting O _i Namely, the feature vector fusing the context information and the syntax information.

Preferably, the step 4 specifically includes:

for an input sequence x ═ x ₁ ,x ₂ ,..,x _n ) Given a prediction sequence y ═ y (y) ₁ ,y ₂ ,y ₃ ,...,y _n ) The score for the predicted sequence is calculated as follows:

where, M is the transition matrix,

indicating slave label y _i To the label y _i+1 The transfer fraction of (a);

y for i word in sentence _i ' probability scores of tags; y is ₀ And y _n+1 Respectively representing a start and end tag; n represents the length of the input sequence;

the probability p (y | x) of the predicted sequence y generation is calculated using the softmax function as follows:

wherein, Y _x Representing the set of all possible predicted sequences of y for the solution space;

is an example of the entire sequence tagged solution space,

a score representing the instance;

during the training process, the likelihood probability log (p (x, y)) of a correct predicted sequence is maximized as follows:

after continuous iterative training and back propagation, the obtained y ^* The result of labeling with the CRF sequence, i.e. the output of the final model:

training process continuously maximizes objective function

In another aspect, a system for Chinese named entity recognition incorporating lexical and syntactic information, comprising:

the input representation acquisition module is used for mapping an original input text into a word vector, introducing external vocabulary information by using an improved word set matching algorithm and integrating the external vocabulary information into the input representation of each word;

a context vector acquisition module for extracting context information using a bidirectional LSTM according to the input representation of the word;

the context information and syntax information fusion module is used for acquiring part-of-speech tags and syntax components from an original input text by using an NLP tool, constructing a syntax vector by using a healthy value memory network, and performing weighted fusion on the context vector and the syntax vector by using a gate control mechanism to acquire a feature vector;

and the sequence marking module is used for inputting the feature vectors into a CRF (conditional random access memory) of the label prediction layer to realize Chinese named entity identification.

As can be seen from the above description of the present invention, compared with the prior art, the present invention has the following advantages:

the invention introduces external vocabulary information in the word vector through an improved word set matching algorithm, and integrates the context vector and the syntactic vector in a syntactic information layer so as to solve the problem of insufficient entity boundary information in the Chinese named entity and fuse the syntactic information of the input text.

Drawings

FIG. 1 is a flow chart of a method for Chinese named entity recognition incorporating lexical and syntactic information in accordance with an embodiment of the present invention;

FIG. 2 is a model diagram of a method for identifying a named entity in Chinese incorporating lexical and syntactic information according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a dictionary matching method corresponding to Chinese linguistics according to an embodiment of the present invention;

FIG. 4 is a diagram of syntax information acquisition according to an embodiment of the present invention;

FIG. 5 is a block diagram of a system for Chinese named entity recognition incorporating lexical and syntactic information according to an embodiment of the present invention.

Detailed Description

The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.

Referring to fig. 1, the method for recognizing a named entity in chinese, which fuses vocabulary and syntax information, according to the present invention, includes:

Specifically, a model diagram of the method of the present invention is shown in fig. 2, and the model is divided into four parts: the input represents the layer, sequence coding layer, syntax information layer and label prediction layer.

The invention relates to a Chinese named entity recognition method fusing vocabulary and syntax information, which comprises the following specific implementation steps:

(1) acquisition of an input representation containing external vocabulary information

The invention provides a word set matching method, which introduces an attention mechanism on the basis of SoftLexicon, combines matched static and dynamic word set information, and then fuses the static word set information and the dynamic word set information by using a gating mechanism.

1.1) static word set Algorithm

In the chinese NER model, a sentence of original input text is treated as a sequence x ═ x (x) ₁ ,x ₂ ,..,x _n ). Wherein x is _i Representing the ith word in a sentence x of length n. To better utilize the lexical information, the results of each word matching dictionary are divided into four sets of words:

a. word set B (x) _i ) Including all in x with x _i The first word;

b. word set I (x) _i ) Including all x on x _i The words in the middle;

c. word set E (x) _i ) Including all in x and x _i A word ending;

d. word set S (x) _i ) Containing all x _i Is a word of a single word.

These four word sets are collectively referred to as "BIES" and are respectively represented as:

wherein L represents an external dictionary; w is a _j,k Indicating the presence of a word in an external dictionary, denoted by w in the sequence _j At the beginning with w _k The ending word. When the input of the input layer is "chinese linguistics", the corresponding dictionary matching method is as shown in fig. 3.

After a 'BIES' word set corresponding to each word is obtained, each word set is compressed into a vector with fixed dimension. The static word set algorithm is superior to the dynamic word set algorithm using the attention mechanism in processing speed, and in order to ensure the calculation efficiency, the frequency of the occurrence of the words is used for representing the corresponding weight. Since the frequency of a word is a static value, the importance of the word can be reflected by the high frequency of the word, which can speed up the calculation of the weight of the word. For the word x _i Corresponding to a word set T of

Wherein the content of the first and second substances,

representing a word x _i And the corresponding jth word in the word set T with the length of m. According to the thought, the weight of the static word set vector is updated to obtain the static word set vector:

wherein the content of the first and second substances,

expression word

Number of occurrences in the corpus;

representing the total times of occurrence of words in the word set T;

meaning the word

Mapping into a word vector, wherein T represents one of four word sets of 'BIES'; the superscripts ws represents a word set,

representing a word x _i A vector representation of the corresponding set of words T. For better information retention, the four static word sets are represented as a whole and compressed into a fixed-dimension vector.

1.2) dynamic word set Algorithm

The attention mechanism may assign different weights to different words, greater weights to a few key words, and less weights to irrelevant words. The dynamic word set algorithm uses an attention mechanism to measure information between characters and matching words, calculates importance of different matching words, enhances useful words and suppresses words with unobvious function. For the word x _i Corresponding to a word set T of

Wherein the content of the first and second substances,

representing a word x _i And the corresponding jth word in the word set T with the length of m. Will next be

As input, a three-layer neural network is adopted to obtain a word importance score with dimension of 1 × m and value range of (0, 1):

wherein q is and

the training vectors of the same dimensions are used,

to obtain the word importance score through the attention mechanism,

is the normalized word importance score. Then, obtaining a dynamic word set vector according to the word importance degree score, representing the four dynamic word sets as a whole and compressing the four dynamic word sets into a fixed-dimension vector:

wherein the content of the first and second substances,

representing a word x _i Dynamic word set vectors, A tau, corresponding to the word set T _i Expressed as a word x _i Corresponding dynamic word set vector, m being the word x _i And the number of words matched with the corresponding word set T.

1.3) gating mechanism

To fully consider the importance of each word in each word set, vectors are applied to the dynamic word sets

And static word set vector

And (4) dynamically weighting and combining. Using an evaluation function

To measure static word set vector

And a dynamic word set vector alpha _i Role on entity recognition task:

wherein the content of the first and second substances,

is a trainable matrix that is trained by the user,

is the bias term. Then, the static word set vector τ is applied _i And a dynamic word set vector A tau _i Combining together:

wherein the content of the first and second substances,

a word vector representing a word is provided,

representing a word x _i Is the vector dimension and

matched 1 vector, e ^x Representing the word x _i Converting into corresponding word vector, dot product calculation,

is vector stitching. By this method, the word vector trained in advance is embedded into the character representation, and the external dictionary information can be used reasonably.

(2) Acquisition of context vectors

Long-and-short memory networks (LSTM) are a variant of recurrent neural networks and are widely used in NLP tasks, such as NER, text classification, sentiment analysis. The LSTM introduces a cell state, maintains and controls information by using an input gate, a forgetting gate and an output gate, and effectively overcomes gradient explosion and gradient loss caused by long-distance dependence of an RNN model. The mathematical expression for the LSTM model is as follows:

i _t ＝σ(W _i [h _t-1 ,τ _t ]+b _i ) (15)

f _t ＝σ(W _f [h _t-1 ,τ _t ]+b _f ) (16)

o _t ＝σ(W _o [h _t-1 ,τ _t ]+b _o ) (17)

h _t ＝o _t *tanh(c _t ) (20)

wherein σ represents a sigmoid activation function, and tanh represents a hyperbolic tangent function; tau is _t Representing a unit input; i.e. i _t ，f _t ，o _t Respectively showing an input gate, a forgetting gate and an output gate; w _i ，W _f ，W _o Respectively representing the weights of the input gate, the forgetting gate and the output gate; b _i ，b _f ，b _o Respectively representing the offsets of the input gate, the forgetting gate and the output gate;

representing a current input state; c. C _t An output representing the current state.

To use context information for words simultaneously, the model herein uses bi-directional LSTM to obtain a context vector for each word, which is a combination of forward LSTM and backward LSTM. Use of

Representing hidden layer states of the i-time forward LSTM, using

Representing the hidden layer state of the inverted LSTM at time i. By concatenating the corresponding forward and backward LSTM states, a final context vector is obtained

(3) Fusion of contextual information and syntactic information

Context information is output by a bidirectional LSTM of a sequence coding layer, syntax information is firstly extracted from an input sequence through an NLP tool, then a key value memory network is used for mapping, and finally a gating mechanism is used for integrating two kinds of information.

3.1) syntax information acquisition

The invention usesThe Stanford CoreNLP tool performs word segmentation on the text, and then extracts two kinds of syntactic information, namely a part-of-speech tag and a syntactic component by using a Berkely Neural Parse tool. Wherein the part-of-speech tags represent tag information for individual words and the syntactic component represents structural grouping information for text spans. Taking "liberation of road surface water" as an example, fig. 4 shows part-of-speech tags and syntactic components in an example sentence. For an input sequence x ═ x ₁ ,x ₂ ,..,x _n ) Each x in (1) _i The context feature and syntax information are extracted as follows.

3.1.1) part of speech tag: each x is _i Extracting x as a core word using a window of + -1 words _i Context words and part-of-speech tags on both sides. As shown in fig. 4(a), the obtained contextual characteristics of the word "track" are (liberty, large track, road surface), and the combination of these words and the corresponding part-of-speech tags is used as part-of-speech information of the NER task, i.e., the corresponding syntax information is (liberty _ NN, large track _ NN, road surface _ NN).

3.1.2) syntactic components: from x _i Starting with the leaf nodes of the syntax tree, searching upwards along the tree to find the first syntax node, then taking all words under the node as context characteristics, and taking the combination of the words and the corresponding syntax labels as the syntax component information of the NER task. As shown in fig. 4(b), for the word "track", searching up for a node, the syntax node "NP" contains two words "free" and "big track". Therefore, the context feature is (release, large track), and the context feature is combined with the "NP" tag as syntax component information of the NER task, i.e., the corresponding syntax information is (release _ NP, large track _ NP).

3.2) KVMN construction

Since the syntax information extracted by the tool has certain noise information, if the noise is not properly utilized, the performance of the model may be affected. Variations of the KVMN have proven effective in incorporating context features and provide an appropriate way to take advantage of context features and their corresponding dependency types.

In constructing the KVMN, the input sequence x ═ x (x) is first obtained ₁ ,x ₂ ,..,x _n ) Correspond toThe result of the analysis (2). For each x in the input sequence _i Mapping its contextual characteristics and syntactic information onto keys and values in the KVMN, respectively

And

wherein m is _i Denotes x _i K and V represent context feature and syntax information. Next, k is embedded using two embedding matrices _i,j And v _i,j Respectively map to

And

the above. For each x _i Associated context feature K _i And syntax information V _i The weight assigned to the syntax information is:

wherein h is _i Is x _i A concealment vector obtained from a sequence coding layer. Weight gamma _i,j Applied to corresponding syntax information v _i,j The method comprises the following steps:

wherein alpha is _i For KVMN model correspondence x _i Weighted syntax information of (2). Thus, the KVMN can ensure that the grammar information is weighted according to its corresponding context characteristics, thereby distinguishing and using the important information.

3.3) gating mechanism

For better utilization of KVMN-encoded syntax information, the syntax vector α is aligned _i And a context vector h _i Dynamic weightingAnd (4) combining. Using an evaluation function lambda _i To measure the context vector h _i And syntax vector alpha _i Contribution to the sentence:

λ _i ＝σ(W _λ1 .h _i +W _λ2 .α _i +b _λ ) (23)

wherein, W _λ1 、W _λ2 Is a trainable matrix, b _λ Is the bias term. Then, the syntax vector α is encoded _i And a context vector h _i Combining together:

wherein, O _i Is KVMN corresponds to x _i Is the vector dimension and h _i And matching 1 vector, performing matrix dot product calculation, and completing the vector splicing.

(4) Sequence annotation using CRF

Compared with the HMM, the independent assumption of the CRF on the HMM has no strict requirement, the sequence and external observation information can be effectively utilized, and the label deviation and the differentiation generated by only directly assuming the label are avoided. CRF can capture more dependencies: for example, the "I-ORG" tag cannot follow "B-LOC". At CNER, the input of CRF is the context feature vector O learned from the syntax information layer _i . For an input sequence x ═ x ₁ ,x ₂ ,..,x _n ) Record P _i,j Representing the probability score of the jth tag of the ith word in the sentence. For the predicted sequence y ═ y ₁ ,y ₂ ,y ₃ ,...,y _n ) And calculating the score of the predicted sequence:

where M is the transition matrix, M _i,j Represents the branch score from tag i to tag j; y is ₀ And y _n+1 Representing the start and end tags, respectively. The probability of the predicted sequence y being generated is calculated using the softmax function:

wherein, Y _x Representing all possible predicted sequences of y. In the training process, the likelihood probability of correctly predicting the sequence is maximized:

after training is finished, the score obtained by predicting the sequence is as follows:

(5) evaluation of effects

To verify the effectiveness of the method of the present invention, and to compare it with other models, three named entity data sets of chinese were used for experiments, including: weibo dataset, Resume dataset, and MSRA dataset. The experimental indexes adopt accuracy, recall rate and F1 values.

The Weibo data set is information from social network microblogs and comprises four entities of a person name, a place name, an organization name and a geopolitical entity. The MSRA data set is a data set from the news domain, and includes only a training set and a test set, including three entity types of person name, place name and organization. The Resume dataset is Resume information from the finance of the New wave, and contains seven entity types including city, educational institution, place name, person name, institution name, proper noun, professional background, and job title. The situation for each data set is shown in table 1.

Table 1 data set details

In order to verify the effectiveness of the method in this chapter, the invention compares the following 3 models as baseline:

(1) Lattice-LSTM: introducing a word cell structure, and fusing all word information ending with the current character;

(2) FLAT: the method is improved based on a transformer structure, skillful position coding is designed to fuse lattice structures, and absolute position coding is improved to make the absolute position coding more suitable for NER tasks;

(3) SoftLexicon: the method of simply utilizing the vocabulary in the input presentation layer has high portability.

The evaluation criteria are as follows:

the evaluation indices of the experiment used accuracy (Precision), Recall (Recall) and F1 values (F1 score). Wherein Precision is the correct proportion in all recognized entity words; recall indicates that the correct entity is identified as a proportion of all entities in the data set. Since the accuracy and the precision are inversely proportional, F1 is their harmonic mean. The evaluation function is:

the hyper-parameters used by the model are shown in table 2:

TABLE 2 hyper-parameter configuration

The results of the experiments on the three baseline models and the model of the invention on the respective data sets are shown in table 3, with the best results for each data set shown in bold.

Table 3 experimental results for different data sets

As shown in the results of Table 3, the LSF-CNER proposed herein outperforms other methods on the Weibo, Resume dataset.

On the Weibo data set with more noise and an indefinite format, the LSF-CNER achieves a better effect, and compared with SoftLexicon, the F1 value is improved by 0.71%.

On the data set with relatively fixed format and less noise of Resume, the F1 value of the LSF-CNER is improved by 0.11 percent compared with SoftLexicon.

In the MSRA dataset with more data in the formal text, the F1 value of the LSF-CNER is reduced by 0.01 percent compared with SoftLexicon, which shows that the model is already fitted. The reason is that SoftLexicon can well fit large-scale MSRA and Resume data sets, and the model effect cannot be obviously improved by adding characteristic information. Because of the lack of feature information on small-scale datasets, syntactic and lexical information can be used to improve model performance.

Further, to verify the general utility of the LSF-CNER model. The vocabulary information of the words is combined with the BERT word vectors as the output of the input presentation layer and sent to the sequence coding layer. Table 4 shows the results of experiments with BERT, wherein F1 values are given. The F1 average value result of the method proposed herein on different data sets is improved by 4.68% compared with BERT-Tagger; compared with BERT + BiLSTM + CRF, the yield is increased by 2.48%; compared with SoftLexicon + BERT, the yield is improved by 0.89%. Especially in the Weibo data set, the improvement of the method is more remarkable.

TABLE 4 Experimental results (%)

The experimental results show that the effect of inputting the representation layer can be better improved by combining the pre-training model. The method herein outperforms the traditional CNER method in the Resume dataset, the Weibo dataset, and the MSRA dataset. This verifies that fusing the lexical information and the grammatical information is valid.

In order to study the contributions of the two parts, dynamic word set and syntactic information, to the entity recognition task separately, ablation experiments were performed on the Resume dataset and were performed with the model set to five cases.

(1) BERT + LSTM + CRF: an initial model that does not include external vocabulary information and syntax information;

(2) SoftLexicon + BERT: including external vocabulary information;

(3) SoftLexicon + Attention + BERT, wherein an Attention mechanism is introduced on the basis of SoftLexicon to dynamically adjust the weights of different words in a word set;

(4) syntact + BERT: contains syntactic information but no external vocabulary information;

(5) LSF + BERT: containing syntactic information, and improved word set information.

Table 5 ablation test results (%)

Model (model)	P	R	F1
				BERT+BiLSTM+CRF	95.75	95.28	95.51
SoftLexicon+BERT	96.08	96.13	96.11
				SoftLexicon+Attention+BERT	96.19	96.35	96.27
Syntactic+BERT	95.86	96.08	95.97
				LSF+BERT	96.73	96.45	96.59

As shown in table 5, when the attention mechanism is introduced in the SoftLexicon + BERT model to adjust the weight of the word set vector, the average F1 value is raised by 0.16%; when syntax information is introduced into the BERT + BilSTM + CRF model, the average F1 value is improved by 0.46%; when the syntax information and the static and dynamic word set information are contained at the same time, the average F1 value is improved by 0.48 percent compared with the value of SoftLexicon + BERT. From the above experimental results it can be seen that: the static and dynamic word set information and the syntax information are helpful for improving the performance of the model; the two methods exist simultaneously, so that the recognition precision of the Chinese named entity is further improved. This shows that dynamically adjusting the weight of words in a word set and introducing syntactic information help to help sentences recognize the importance of different words, thereby improving the accuracy of the Chinese named entity.

In summary, the invention provides a method for identifying a named entity in Chinese, which combines vocabulary and syntax information. The new word set matching method integrates static and dynamic word set information on an input presentation layer, and then uses a gating mechanism to dynamically weight the output and syntax information of a sequence coding layer, thereby not only considering the potential boundary of a Chinese named entity, but also considering the potential syntax information in a sentence, and the two kinds of information are fused in a more balanced way, thereby improving the expression effect of the model. Experimental results on three CNER datasets show that the new method has good performance compared to the mainstream method.

Referring to fig. 5, the present invention further includes a system for recognizing a named entity in chinese that fuses lexical and syntactic information, including:

an input representation obtaining module 501, configured to map an original input text into a word vector, introduce external vocabulary information using an improved word set matching algorithm, and integrate the external vocabulary information into an input representation of each word;

a context vector obtaining module 502, configured to extract context information using a bidirectional LSTM according to an input representation of a word;

a context information and syntax information fusion module 503, configured to obtain part-of-speech tags and syntax components from an original input text using an NLP tool, construct a syntax vector using a robust memory network, and perform weighted fusion on the context vector and the syntax vector through a gate control mechanism to obtain a feature vector;

a sequence labeling module 504, configured to input the feature vectors into the CRF of the label prediction layer to implement chinese named entity recognition

A Chinese named entity recognition system fusing vocabulary and syntax information concretely realizes a Chinese named entity recognition method fusing vocabulary and syntax information, and the embodiment does not repeat description.

The above description is only an embodiment of the present invention, but the design concept of the present invention is not limited thereto, and any insubstantial modifications made by using the design concept should fall within the scope of infringing the present invention.

Claims

1. A Chinese named entity recognition method fusing vocabulary and syntax information is characterized by comprising the following steps:

2. The method for recognizing named entities in chinese that fuses lexical and syntactic information according to claim 1, wherein said step 1 specifically comprises:

(1) word set B (x) _i ) Including all in x with x _i The first word;

(2) word set I (x) _i ) Including all x on x _i The words in the middle;

(3) word set E (x) _i ) Including all in x and x _i A word ending;

(4) word set S (x) _i ) Containing all x _i Is the word of a single character;

wherein the content of the first and second substances,

expression word

Number of occurrences in the corpus;

representing the total times of occurrence of words in the word set T;

meaning the word

Mapping into a word vector; t represents one of four word sets of "BIES";

representing a word x _i Vector representation of the corresponding word set T;

wherein, the first and the second end of the pipe are connected with each other,

meaning the word

Mapping into a word vector; q is and

training vectors with the same dimension;

is the word attention score obtained by the attention mechanism;

the normalized word attention weight;

carrying out weighted summation through attention weight to obtain a dynamic word set vector, representing four dynamic word sets as a whole and compressing the four dynamic word sets into a fixed-dimension vector:

And static word set vectors

Dynamic weighted combination; using an evaluation function theta _i To measure static word set vector

And dynamic word set vector

Role on entity recognition task:

θ _i ＝σ(W _θ1 .τ _i +W _θ2 .Aτ _i +b _θ )

wherein, W _θ1 、W _θ2 Is a trainable matrix; b _θ Is a bias term;

combining the word vector and the static word set vector tau _i And a dynamic word set vector A tau _i Taken together, as the input representation ultimately containing the external vocabulary information:

wherein the content of the first and second substances,

representing vector stitching.

3. The method for recognizing named entities in chinese that fuses lexical and syntactic information according to claim 2, wherein said step 2 specifically comprises:

Representing hidden layer states of the i-time forward LSTM, using

4. The method for recognizing named entities in chinese with fused lexical and syntactic information according to claim 3, wherein said step 3 specifically comprises:

step 3.2, for each x in the input sequence _i Mapping its context characteristics and syntax information to keys and values in a key-value memory network KVMN, denoted as K respectively _i ＝[k _i,1 ,...k _i,j ,...,k _i,mi ]And V _i ＝[v _i,1 ,...v _i,j ,...v _i,mi ]；

And

for each x _i Associated context feature K _i And syntax information V _i Weight γ assigned to syntax information _ij Comprises the following steps:

wherein h is _i Is x _i A concealment vector obtained from a sequence coding layer; h is _i ^T Is h _i Transposed form of (1);

wherein alpha is _i For KVMN model corresponds to x _i Weighted syntax information of (1); therefore, the KVMN can ensure that the grammar information is weighted according to the corresponding context characteristics, so that the important information is distinguished and used;

λ _i ＝σ(W _λ1 .h _i +W _λ2 .α _i +b _λ )

wherein, W _λ1 And W _λ2 Is a trainable matrix; b _λ Is a bias term; sigma represents a sigmoid activation function;

5. The method for recognizing named entities in chinese with fused lexical and syntactic information as claimed in claim 4, wherein said step 4 specifically comprises:

where, M is the transition matrix,

indicating slave label y _i To the label y _i+1 The transfer fraction of (a);

wherein, Y _x For solution space, represent the set of all possible predicted sequences for y;

is an example of the entire sequence tagged solution space,

a score representing the instance;

training process continuously maximizes objective function

6. A Chinese named entity recognition system fusing lexical and syntactic information, comprising: