CN113609861B

CN113609861B - Multi-dimensional feature named entity recognition method and system based on food literature data

Info

Publication number: CN113609861B
Application number: CN202110913799.3A
Authority: CN
Inventors: 雷雪; 方德英; 张青川; 蔡圆媛
Original assignee: Beijing Technology and Business University
Current assignee: Beijing Technology and Business University
Priority date: 2021-08-10
Filing date: 2021-08-10
Publication date: 2024-02-23
Anticipated expiration: 2041-08-10
Also published as: CN113609861A

Abstract

The invention relates to a method and a system for identifying a multidimensional feature named entity based on food literature data, wherein the method comprises the following steps: s1: obtaining corpus of food field literature; s2: acquiring word components and word pinyin of food field documents, and respectively inputting the word components and the word pinyin into a BiLSTM model to acquire a word component feature vector S and a word pinyin feature vector P; s3: pre-training the Bert model to obtain a trained pre-training model; inputting the corpus obtained in the step S1 into a trained pre-training model to obtain a character dimension feature vector; s4: inputting the character dimension feature vector, the character component feature vector and the character pinyin feature vector into a neural network model based on BiLSTM to obtain a feature vector fused with full text semantic information; s5: and inputting the feature vector fused with the full-text semantic information into a CRF model to finally obtain a named entity recognition result. According to the invention, the character component characteristics and the pinyin characteristics are added into the character dimension vector representation, so that the accuracy of named entity identification of literature data facing the food field is improved.

Description

Multi-dimensional feature named entity recognition method and system based on food literature data

Technical Field

The invention relates to the field of natural language processing, in particular to a method and a system for identifying multidimensional feature named entities based on food literature data.

Background

With the emphasis placed on the food field, literature resources related to foods are rapidly growing. The literature facing the food field is one of the main ways for displaying the scientific research results, and the content comprises research purposes, research methods, experimental processes, research results, research meanings and the like. The academic literature is a knowledge resource type with high professional value, is in a relatively standard text form, and contains professional terms, concepts and authority data in the food field. These text contents exist in unstructured form, containing a large number of entities in the food profession. By modeling document data information in the food field, key entities in the document are automatically extracted, effective semantic knowledge is extracted, and the research result can be applied to the food research fields of entity relation extraction, automatic question-answering, semantic web labeling, knowledge graph and the like, and becomes a basic stone for better researching the natural language processing direction.

Early methods for named entity recognition were mainly rule-based methods and dictionary-based methods, but rules formulated with increasing corpora would be more and more cumbersome, and using rule-based methods and dictionary-based methods would be too time-consuming and labor-consuming. With the advent of big data age, traditional machine learning methods such as HMM, SVM and CRF are also used by students for named entity recognition, to late deep learning methods, neural network models are introduced to perform named entity recognition, attention mechanisms and transfer learning are recently started to be applied, language pre-training is also attempted by using models such as BERT, and named entity recognition accuracy is improved. Because of the diversity of Chinese and the lack of obvious labels that can be separated, whether an entity can be accurately identified from text is mainly in two aspects: whether the boundary of the entity can be accurately divided; whether the entity genus can be accurately judged. Therefore, how to better extract text features of Chinese corpus and extract effective entity effective information becomes a technical problem of Chinese named entity recognition.

Disclosure of Invention

In order to solve the technical problems, the invention provides a multi-dimensional feature named entity identification method and system based on food literature data.

The technical scheme of the invention is as follows: a multi-dimensional feature named entity recognition method based on food literature data comprises the following steps:

step S1: acquiring a food field literature abstract on a network by utilizing a crawler technology, and performing data processing on the food field literature abstract in a mode of combining manpower and algorithm to obtain corpus of food field literature;

step S2: acquiring word components and word pinyin of food field documents on a network by utilizing a crawler technology, and respectively inputting the word components and the word pinyin into a BiLSTM model for coding to acquire a word component feature vector S and a word pinyin feature vector P;

step S3: pre-training the Bert model by using the open field corpus to obtain a trained pre-training model; inputting the corpus of the food field literature into the trained pre-training model for incremental training to obtain a character dimension feature vector Z;

step S4: inputting the character dimension feature vector Z, the character component feature vector S and the character pinyin feature vector P into a neural network model based on BiLSTM to obtain a feature vector fusing full-text semantic information;

step S5: and inputting the feature vector fused with the full-text semantic information into a CRF model, calculating a label result, and finally obtaining a named entity recognition result.

Compared with the prior art, the invention has the following advantages:

the invention discloses a multidimensional feature named entity recognition method based on food literature data, which can obtain enhanced semantic representation of food field literature through a Bert model, generate a vector of character dimension according to the context of the enhanced semantic representation, and fully utilize the features of Chinese fonts, namely the pinyin of a character and the semantic information of the Chinese character contained in the components to obtain the component feature representation of the character and the pinyin feature representation of the character. The three word dimension representations are combined to serve as input of a named entity recognition model, corpus information is fully mined from a single word level, and the problems of incomplete feature extraction and loss of result accuracy caused by unstructured text corpus lack of specifications and the like are avoided. The invention further utilizes the combination of BiLSTM and CRF to carry out entity identification on document data in the food field. The invention fully considers the semantic information of the Chinese food corpus document data, and has high accuracy of named entity identification.

Drawings

FIG. 1 is a flowchart of a multi-dimensional feature named entity recognition method based on food literature data in an embodiment of the invention;

FIG. 2 is a flow chart of an entity identification method according to an embodiment of the invention;

fig. 3 is a block diagram of a multi-dimensional feature named entity recognition system based on food literature data according to an embodiment of the present invention.

Detailed Description

The invention provides a multidimensional feature named entity recognition method based on food literature data, which fully utilizes the feature attribute of Chinese words and characters, adds the radical features and pinyin features of the words into the word dimension vector representation by constructing a new named entity recognition model, and improves the accuracy of named entity recognition of the literature data facing the food field.

The present invention will be further described in detail below with reference to the accompanying drawings by way of specific embodiments in order to make the objects, technical solutions and advantages of the present invention more apparent.

Example 1

As shown in fig. 1, the method for identifying a multi-dimensional feature named entity based on food literature data provided by the embodiment of the invention comprises the following steps:

step S1: obtaining a food field literature abstract on a network by utilizing a crawler technology, and performing data processing work on the food field literature abstract in a mode of combining manpower and algorithm to obtain corpus of food field literature;

step S3: pre-training the Bert model by using the open field corpus to obtain a trained pre-training model; inputting corpus of food field documents into a trained pre-training model for incremental training to obtain a character dimension feature vector Z;

step S4: inputting a character dimension feature vector Z, a character component feature vector S and a character pinyin feature vector P into a neural network model based on BiLSTM to obtain a feature vector fused with full-text semantic information;

In one embodiment, step S1 described above: obtaining a food field literature abstract on a network by utilizing a crawler technology, and performing data processing work on the food field literature abstract in a mode of combining manpower and algorithm to obtain corpus of food field literature, wherein the method specifically comprises the following steps:

on a plurality of academic websites such as China's knowledge network, the python crawler technology is utilized to climb abstracts of related documents such as food nutrition, food tracing, food logistics, food cold chain and the like related subject matters of food, and then the abstracts are subjected to data processing work in a mode of combining manpower with machines to establish a database facing the food field documents, so that corpus of the food field documents is obtained.

In one embodiment, step S2 above: the method comprises the steps of obtaining character components and character pinyin of food field documents on a network by utilizing a crawler technology, respectively inputting the character components and the character pinyin into a BiLSTM model for encoding to obtain character component feature vectors S and character pinyin feature vectors P, and specifically comprises the following steps:

because the Chinese fonts have the characteristic of multidimensional characteristics, the meaning of the fonts is related to the character components and pinyin of the fonts, the python crawler technology is firstly utilized to obtain the character components and the character pinyin of food field documents on websites such as hundred-degree words, and the character components and the character pinyin are respectively input into a separate BiLSTM model for encoding to obtain character component feature vectors S and character pinyin feature vectors P;

character component feature vector s= [ S ] ₁ ,s ₂ ,s ₃ ...s _n ]Is to use the Chinese character components to indirectly represent the meaning of Chinese characters to obtain the character components related to food, wherein s _i Is a word component vector associated with food; for example, the components associated with food are obtained: "mouth", "strikethrough" character components;

character spelling characteristic vector P= [ P ] ₁ ,p ₂ ,p ₃ ...p _m ]Is effective information containing word semantics in Chinese phonetic alphabet, wherein p _i Is a word pinyin vector associated with food; by introducing pinyin is equivalent to introducing an additional information related to the food product, for example "food" can be divided into "sh i 2" and "pin 3" (the numbers represent the tones).

In one embodiment, the step S3: pre-training the Bert model by using the open field corpus to obtain a trained pre-training model; inputting data in a database of food field documents into the trained Bert model for incremental training to obtain a character dimension feature vector Z, wherein the method specifically comprises the following steps of:

step S31: pre-training the Bert model by using corpus in the open field to obtain a 'Bert-Base-Uncased' pre-training model;

step S32: performing incremental training on the pre-training model by using the corpus of the food field literature in the step S1, adding additional Chinese food field features, and obtaining a feature vector Z= [ Z ] of the word dimension of the food field literature corpus based on the Bert model ₁ ,z ₂ ,z ₃ ...z _k ]。

The input sentence of the trained "Bert-Base-Uncased" pre-training model is divided into three parts: word vectors, text vectors, and location vectors. The pre-training model converts each word in the text into a one-dimensional vector by inquiring a word vector table, the value of the text vector is automatically learned in the model training process and is used for describing global semantic information of the text, the position vector is used for respectively adding a different vector to the words at different positions to distinguish, and the pre-training model takes three vectors as input; the model output is the feature vector Z after the fusion of the full text semantic information corresponding to each word is input.

In one embodiment, step S4 above: inputting a character dimension feature vector Z, a character component feature vector S and the character pinyin feature vector P into a neural network model based on BiLSTM to obtain a feature vector fusing full-text semantic information, wherein the method specifically comprises the following steps of:

splicing the character dimension vector representation Z, the character component feature vector S and the character pinyin feature vector P to obtain X=confcate (Z, S, P), and inputting X into a neural network model of BiLSTM shown in the following formulas (1) to (6):

i _t ＝σ(W _xi x _t +W _hi h _t-1 +W _ci c _t-1 +b _i ) (1)

f _t ＝σ(W _xf x _t +W _hf h _t-1 +W _cf c _t-1 +b _f ) (2)

g _t ＝tanh(W _xc x _t +W _hc h _t-1 +W _cc c _t-1 +b _c ) (3)

c _t ＝i _t g _t +f _t c _t-1 c _t ＝i _t g _t +f _t c _t-1 (4)

o _t ＝σ(W _xo x _t +W _ho h _t-1 +W _co c _t-1 +b _o (5)

h _t ＝o _t tanh(c _t ) (6)

wherein W and b represent weight matrix and bias vector parameters; x is x _t Is an input variable at time t; h is a _t-1 Is the state of a hidden layer at the moment t-1; h is a _t Is the hidden layer state at the moment t; c _t Is the state of the cell layer at time t; i.e _t 、f _t 、o _t And c _t The activation vectors of the input gate, the forget gate, the output gate and the cells are respectively represented, and sigma is a sigmoid function; o (o) _t The output result of the output gate of BiLSTM at the time t.

In the step, the word dimension vector is used as a main, the character component character and the character pinyin character are used as auxiliary, the input X of the BiLSTM neural network model is constructed and used for training the BiLSTM neural network model, the output at a certain moment t is utilized to depend on the word in front of a certain word in a text sequence and also depend on the word behind the certain word, the output of the model is the feature vector O fused with the full text semantic information, the context information of each word can be fully represented, and the problem of long-time dependence between two entities is effectively solved.

The output O of the BiLSTM neural network model is input to the conditional random field CRF (Conditional Random Fields). CRF is a probability statistical model, which can utilize a BIOES (B-begin, I-insert, E-end, S-single, O-output) label system to carry out probability statistics and labeling on feature vectors O fused with semantic information of the whole text, and a named entity recognition result can be obtained.

In one embodiment, the calculation formula (7) for calculating the label result using the CRF model in S5 is as follows:

wherein Score (X, y) is the output sequence, X is the input sentence, y= [ y ] ₁ ,y ₂ ,y ₃ ...y _n ]A label result sequence is correspondingly output; the CRF model consists of two parts, matrix a and matrix B: output result O= [ O ] of BiLSTM ₁ ,o ₂ ,o ₃ ...o _n ]Performing full connection to obtain an output matrix A= [ a ] ₁ ,a ₂ ,a ₃ ...a _n ]And tag y based on time t _t And tag y at time t+1 _t+1 A transfer matrix B therebetween;

and finally, calculating an output sequence by using a Viterbi algorithm to obtain a predicted tag result y, and finally obtaining a named entity recognition result.

According to the embodiment of the invention, named entities are classified into 4 types according to the literature data characteristics in the food field: food object (food object), model (model), method (method), result (result). For example, the input sentence is: an innovative research thought-comprehensive system intervention method is selected, and aims at the problems existing in the quality management of the food cold-chain logistics, so as to realize the perfect research and assumption of the quality management system of the food cold-chain logistics. By utilizing the multidimensional feature named entity recognition method, the recognition result is 3 types of named entities, which are respectively: the method is a "comprehensive systematic intervention method", the food object is a "food cold chain stream", and the result is a "research assumption".

As shown in fig. 2, a flowchart of a named entity recognition method is illustrated.

Example two

As shown in fig. 3, an embodiment of the present invention provides a multi-dimensional feature named entity recognition system based on food literature data, which includes the following modules:

the document corpus acquisition module 21 is used for acquiring a food field document abstract on a network by utilizing a crawler technology, and performing data processing work on the food field document abstract in a mode of combining manpower and algorithm to acquire corpus of food field documents;

the word component and word pinyin feature vector acquisition module 22 is used for acquiring word component and word pinyin of food field documents on a network by utilizing a crawler technology, and inputting the word component and the word pinyin into the BiLSTM model for encoding to obtain a word component feature vector S and a word pinyin feature vector P;

the feature vector module 23 for acquiring a word dimension is used for pre-training the Bert model by using the corpus in the open field to obtain a trained pre-training model; inputting corpus of food field documents into the trained pre-training model for incremental training to obtain a character dimension feature vector Z;

the feature vector module 24 for acquiring the fused full-text semantic information is used for inputting a feature vector Z of a word dimension, a word component feature vector S and a word pinyin feature vector P into a neural network model based on BiLSTM to acquire the feature vector fused full-text semantic information;

the named entity recognition result acquisition module 25 is configured to input the feature vector fused with the full text semantic information into the CRF model, calculate a label result, and finally obtain a named entity recognition result.

The above examples are provided for the purpose of describing the present invention only and are not intended to limit the scope of the present invention. The scope of the invention is defined by the appended claims. Various equivalents and modifications that do not depart from the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A multi-dimensional feature named entity recognition method based on food literature data is characterized by comprising the following steps:

step S2: the method comprises the steps of obtaining character components and character pinyin of food field documents on a network by utilizing a crawler technology, respectively inputting the character components and the character pinyin into a BiLSTM model for encoding to obtain character component feature vectors S and character pinyin feature vectors P, and specifically comprises the following steps:

obtaining the character component feature vector S= [ S ] ₁ ，s ₂ ，s ₃ …s _n ]Is to use the Chinese character components to indirectly represent the meaning of Chinese characters to obtain the character components related to food, wherein s _i Is a word component vector associated with food;

obtaining the character pinyin feature vector P= [ P ] ₁ ，p ₂ ，p ₃ ...p _m ]Is effective information containing word semantics in Chinese phonetic alphabet, wherein p _i Is a word pinyin vector associated with food;

step S4: inputting the character dimension feature vector Z, the character component feature vector S and the character pinyin feature vector P into a neural network model based on BiLSTM to obtain a feature vector fusing full-text semantic information, wherein the method specifically comprises the following steps of:

i _t ＝σ(W _xi x _t +W _hi h _t-1 +W _ci c _t-1 +b _i ) (1)

f _t ＝σ(W _xf x _t +W _hf h _t-1 +W _cf c _t-1 +b _f ) (2)

g _t ＝tanh(W _xc x _t +W _hc h _t-1 +W _cc c _t-1 +b _c ) (3)

c _t ＝i _t g _t +f _t c _t-1 c _t ＝i _t g _t +f _t c _t-1 (4)

o _t ＝σ(W _xo x _t +W _ho h _t-1 +W _co c _t-1 +b _o (5)

h _t ＝o _t tanh(c _t ) (6)

wherein W and b represent weight matrix and bias vector parameters; x is x _t Is the input variable at time t, x _t Depending on the word preceding it in the text sequence and also on the word following it; h is a _t-1 Is the state of a hidden layer at the moment t-1; ht is the hidden layer state at time t; c _t Is the state of the cell layer at time t; i.e _t 、f _t 、o _t And c _t The activation vectors of the input gate, the forget gate, the output gate and the cells are respectively represented, and sigma is a sigmoid function; o (o) _t The output result of the output gate of BiLSTM at the time t;

2. The method for identifying multi-dimensional feature named entities based on food literature data according to claim 1, wherein the step S3: pre-training the Bert model by using the open field corpus to obtain a trained pre-training model; inputting data in a database of the food field literature into the trained Bert model for incremental training to obtain a character dimension feature vector Z, wherein the method specifically comprises the following steps of:

step S32: performing incremental training on the pre-training model by utilizing the corpus of the food field literature in the step S1, adding additional Chinese food field features, and obtaining a feature vector Z= [ Z ] of the word dimension of the food field literature corpus based on the Bert model ₁ ，z ₂ ，z ₃ …z _k ]。

3. The method for identifying multi-dimensional feature named entity based on food literature data according to claim 1, wherein the calculation formula (7) for calculating the label result by using the CRF model in the step S5 is as follows:

wherein Score (X, y) is the output sequence, X is the input sentence, y= [ y ] ₁ ，y ₂ ，y ₃ ...y _n ]A label result sequence is correspondingly output; the CRF model consists of two parts, matrix a and matrix B: output result O= [ O ] of BiLSTM ₁ ，o ₂ ，o ₃ …o _n ]Performing full connection to obtain an output matrix A= [ a ] ₁ ，a ₂ ，a ₃ ...a _n ]And tag y based on time t _t And tag y at time t+1 _t+1 A transfer matrix B therebetween;

and finally, calculating an output sequence by using a Viterbi algorithm to obtain a predicted tag result y.

4. A multi-dimensional feature named entity recognition system based on food literature data, comprising the following modules:

the document corpus acquisition module is used for acquiring a food field document abstract on a network by utilizing a crawler technology, and carrying out data processing work on the food field document abstract in a mode of combining manpower and algorithm to obtain the corpus of the food field document;

the module for acquiring character components and character pinyin feature vectors is used for acquiring character components and character pinyin of food field documents on a network by utilizing a crawler technology, inputting the character components and the character pinyin feature vectors into a BiLSTM model for encoding to acquire character component feature vectors S and character pinyin feature vectors P, and specifically comprises the following steps:

the character dimension feature vector acquisition module is used for pre-training the Bert model by using the open field corpus to obtain a trained pre-training model; inputting the corpus of the food field literature into the trained pre-training model for incremental training to obtain a character dimension feature vector Z;

the feature vector module for acquiring the fused full text semantic information is used for inputting the feature vector Z of the word dimension, the word component feature vector S and the word pinyin feature vector P into a neural network model based on BiLSTM to acquire the feature vector fused full text semantic information, and specifically comprises the following steps:

i _t ＝σ(W _xi x _t +W _hi h _t-1 +W _ci c _t-1 +b _i ) (1)

f _t ＝σ(W _xf x _t +W _hf h _t-1 +W _cf c _t-1 +b _f ) (2)

g _t ＝tanh(W _xc x _t +W _hc h _t-1 +W _cc c _t-1 +b _c ) (3)

c _t ＝i _t g _t +f _t c _t-1 c _t ＝i _t g _t +f _t c _t-1 (4)

o _t ＝σ(W _xo x _t +W _ho h _t-1 +W _co c _t-1 +b _o (5)

h _t ＝o _t tanh(c _t ) (6)

wherein W and b represent weight matrix and bias vector parameters; x is x _t Is the input variable at time t, x _t Depending on the word preceding it in the text sequence and also on the word following it; h is a _t-1 Is the state of a hidden layer at the moment t-1; h is a _t Is the hidden layer state at the moment t; c _t Is the state of the cell layer at time t; i.e _t 、f _t 、o _t And c _t The activation vectors of the input gate, the forget gate, the output gate and the cells are respectively represented, and sigma is a sigmoid function; o (o) _t The output result of the output gate of BiLSTM at the time t;

and the named entity recognition result acquisition module is used for inputting the feature vector fused with the full-text semantic information into a CRF model, calculating a label result and finally obtaining the named entity recognition result.