CN112541356A

CN112541356A - Method and system for recognizing biomedical named entities

Info

Publication number: CN112541356A
Application number: CN202011519249.5A
Authority: CN
Inventors: 徐卫志; 范胜玉; 曹洋; 于惠
Original assignee: Shandong Normal University
Current assignee: Shandong Normal University
Priority date: 2020-12-21
Filing date: 2020-12-21
Publication date: 2021-03-23
Anticipated expiration: 2040-12-21
Also published as: CN112541356B

Abstract

The present disclosure provides a method and system for biomedical named entity identification, comprising: performing feature sampling on characters and words by using an attention mechanism to respectively obtain word embedding expansion, and then extracting word embedding by using a maximum pooling layer; adopting an attention mechanism to embed words of different levels for fusion to obtain word embedding of multiple levels; embedding the multi-level words into an input named entity recognition neural network model for training to obtain a trained named entity recognition neural network model; and inputting the biomedical named entity to be identified into the trained named entity identification neural network model to obtain an entity identification result.

Description

Method and system for recognizing biomedical named entities

Technical Field

The disclosure belongs to the technical field of natural language processing and deep learning, and particularly relates to a method and a system for recognizing a biomedical named entity.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Natural Language Processing (NLP) is a branch of the fields of artificial intelligence and linguistics, and is one of the most difficult problems in artificial intelligence. NLP refers to the operation and processing of information such as the form, sound, meaning of natural language, i.e., the input, output, recognition, analysis, understanding, generation, etc., of characters, words, sentences, chapters, etc., by a computer. It has many important effects on the way computers and humans interact. The basic tasks of the method comprise voice recognition, information retrieval, question-answering systems, machine translation and the like, and a model frequently used by NLP is a recurrent neural network and naive Bayes. The term language processing of natural language processing refers to computer technology capable of processing spoken and written languages. By using the related technology, massive data can be efficiently and quickly retrieved and stored. With the development of deep learning technology in many fields, natural language processing has also made a great breakthrough.

Attention Mechanism (Attention Mechanism) is an important tool to improve task performance in the field of natural language processing in recent years. And finally, weighting each dimension value of the word embedding vector of the sentence according to the attention score to finally obtain the word embedding vector subjected to attention calculation. The use of an attention mechanism for attention exploration of word-embedded information in sentences has become a mature technique in the field of named entity recognition.

Named Entity Recognition (NER) is a basic task in the field of NLP, and is also an important basic tool for most NLP tasks such as question and answer systems, machine translation, syntactic analysis, and the like. Previous approaches have been primarily dictionary-based and rule-based. The dictionary-based method is a method of fuzzy search or complete matching through character strings, but the quality and the size of the dictionary are limited as new entity names are continuously emerged; the rule-based method is to manually specify some rules and expand a rule set by common collocation of self characteristics and phrases of entity names, but huge human resources and time cost are consumed, the rules are generally effective only in a certain specific field, the cost of manual migration is high, and the rule portability is not strong. Named entity recognition is carried out, a machine learning method is mostly adopted, model training is continuously optimized, and the trained model shows better performance in test evaluation. Currently, the most applied models include Hidden Markov Models (HMMs), Support Vector Machines (SVMs), Maximum Entropy Markov Models (MEMMs), Conditional Random Fields (CRFs), and the like. The conditional random field model can effectively process the influence problem of the adjacent labels on the prediction sequence, so that the conditional random field model is applied to entity recognition more and has good effect. At present, a deep learning algorithm is generally adopted for the problem of sequence labeling. Compared with the traditional algorithm, the deep learning algorithm eliminates the step of manually extracting the features, and can effectively extract the distinguishing features.

In recent years, with the high-speed operation of the internet, information has come in various storage forms. In the biomedical field, literature resources are increased by thousands of times every year, the information is mostly stored in the form of unstructured texts, and the biomedical named entity recognition aims to convert the unstructured texts into structured texts and recognize and classify specific entity names such as genes, proteins, diseases and the like in the biomedical texts. At present, how to quickly and efficiently retrieve relevant information from huge data is a great challenge.

Disclosure of Invention

In order to solve the problems, the present disclosure provides a method and a system for identifying a biomedical named entity, which is mainly divided into two parts, namely, multilevel attention embedding vector calculation and cross attention fusion; the multi-level attention embedding vector calculation mainly includes character-based local attention calculation, character-based global attention calculation, and word-based local attention calculation.

According to some embodiments, the following technical scheme is adopted in the disclosure:

in a first aspect, the present disclosure provides a method of biomedical named entity identification;

a method of biomedical named entity identification, comprising:

performing feature sampling on characters and words by using an attention mechanism to respectively obtain word embedding expansion, and then extracting word embedding by using a maximum pooling layer;

adopting an attention mechanism to embed words of different levels for fusion to obtain word embedding of multiple levels;

embedding the multi-level words into an input named entity recognition neural network model for training to obtain a trained named entity recognition neural network model;

and inputting the biomedical named entity to be identified into the trained named entity identification neural network model to obtain an entity identification result.

In a second aspect, the present disclosure provides a system for biomedical named entity identification;

a system for biomedical named entity recognition, comprising:

a word embedding module configured to: performing feature sampling on characters and words by using an attention mechanism to respectively obtain word embedding expansion, and then extracting word embedding by using a maximum pooling layer;

a feature fusion module configured to: adopting an attention mechanism to embed words of different levels for fusion to obtain word embedding of multiple levels;

a model training module configured to: embedding the multi-level words into an input named entity recognition neural network model for training to obtain a trained named entity recognition neural network model;

an output module configured to: and inputting the biomedical named entity to be identified into the trained named entity identification neural network model to obtain an entity identification result.

In a third aspect, the present disclosure provides a computer-readable storage medium;

the present disclosure provides a computer readable storage medium for storing computer instructions which, when executed by a processor, perform a method of biomedical named entity identification as described in the first aspect.

Compared with the prior art, the beneficial effect of this disclosure is:

1. when the method is used for processing the biomedical named entity recognition, the named entity recognition neural network model is adopted, and algorithms such as multilevel attention embedding vector calculation, cross attention fusion and the like are combined, so that the accuracy of the named entity recognition is improved.

2. When the named entity recognition task is carried out, the sequence structure data is marked and divided through a Conditional Random Field (CRF), and the accurate final sequence marking effect can be realized.

Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to limit the disclosure.

FIG. 1 is a flow chart of a method of biomedical named entity identification of the present disclosure;

FIG. 2 is a schematic diagram of a character-based local attention mechanism in an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a character-based global attention mechanism in an embodiment of the present disclosure;

FIG. 4 is a local attention test effect based on characters in an embodiment of the present disclosure;

FIG. 5 is a cross attention fusion method according to an embodiment of the present disclosure for an experimental effect of local attention of a character;

FIG. 6 is a character-based global attention experiment effect in an embodiment of the present disclosure;

FIG. 7 is a cross attention fusion method in an embodiment of the present disclosure for an experimental effect on global attention of a character;

FIG. 8 is a word-based local attention test effect in an embodiment of the present disclosure.

The specific implementation mode is as follows:

the present disclosure is further described with reference to the following drawings and examples.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

Interpretation of terms:

Named Entity Recognition (NER) is a basic task in the field of NLP, and is also an important basic tool for most NLP tasks such as question and answer systems, machine translation, syntactic analysis, and the like.

As described in the background, with the development of science and technology, unstructured biomedical data is emerging, and at present, biomedical named entity recognition faces many difficulties: the entity name is provided with a plurality of modifiers, so that the difficulty in distinguishing the entity boundary is increased; multiple entity names share a word; lack of strict naming standards; ambiguity in abbreviations, etc. In order to solve the problems, the performance of the system can be greatly improved by adopting the convolutional neural network with multiple filters, and the identification accuracy is improved.

Example one

Fig. 1 is a flowchart of a method for identifying a biomedical named entity according to the present embodiment, and as shown in fig. 1, the present embodiment provides a method for identifying a biomedical named entity, including:

specifically, a word embedding mechanism is used for carrying out feature extraction on the word embedding in the sentence;

As another embodiment, the performing feature sampling on the characters and the words by using an attention mechanism to obtain extensions of word embedding respectively includes: and respectively carrying out attention exploration in local characters, global characters and local words by adopting multilevel attention embedding vector calculation, and extracting word embedding information at different levels.

A multi-level attention-embedding vector calculation comprising: character-based local attention calculations, character-based global attention calculations, and word-based local attention calculations.

Wherein, the character-based local attention calculation mainly models the characters inside the words by using a form of single heat (ont-hot) coding, then performs attention calculation on the modeled character embedding matrixes respectively, and finally selects proper dimension information by using pooling layer sampling for the calculated attention character embedding.

The character-based global attention calculation mainly uses Bi-GRU to search context information on sentence characters for a modeled character embedding matrix, then carries out attention calculation, and finally carries out sampling by using a pooling layer to form corresponding word embedding.

The word-based local attention calculation mainly performs attention distribution calculation on word embedding, and extracts attention distribution among word embedding.

It is worth noting that before calculating attention distribution, context exploration is carried out on word embedding in a sentence, and context information is extracted; the word embedding vector that needs to be carefully calculated thus contains the embedded information inside the sentence.

As another embodiment, the fusing word embedding of different levels by using an attention mechanism to obtain word embedding of multiple levels includes: and weighting the attention of two different levels into corresponding embedded information by adopting cross attention fusion for fusion to obtain multi-level word embedding.

The cross attention fusion algorithm means that the embedded information obtained by the traditional method aiming at different sampling methods usually uses a direct splicing mode and then enters the next processing, in the embodiment, attention calculation of two parties is adopted, the attention between the two parties is weighted into the corresponding embedded information, and finally splicing is carried out to carry out the next processing.

As another implementation mode, before character and word feature sampling is carried out by using an attention mechanism to respectively obtain the expansion of word embedding, the method further comprises the step of marking and dividing the biomedical named entity by adopting a conditional random field.

Specifically, the present embodiment further provides a more detailed implementation: the method of biomedical named entity recognition can also be divided into the following processes:

(1) word embedding. In the embodiment, a multilevel attention form is used for respectively carrying out attention exploration in local characters, global characters and local words, so that word embedding information is extracted in different dimensions by using an attention mechanism, and finally, words in different dimensions are embedded in an attention fusion mode to generate embedding vectors required by a downstream task through fusion, and the training performance of the model can be stably improved by using the scheme. In the process of a plurality of NLPs, the function of extracting features through word embedding information is proved to be effective, such as recent sentence similarity calculation, part-of-speech tagging problem, the word embedding mode of texts improves the performance of the system, and word-level representation can greatly improve the vocabulary which can be processed by the model.

(2) And extracting multi-level attention characteristics. In medical texts, pre-trained word embedding vectors are usually used for model training in the next step, however, in the commonly used pre-trained word embedding, there is a limitation on the support of specialized vocabularies, namely, a large number of word embedding vectors in the form of OOV exist. Therefore, in the embodiment, multi-dimensional attention calculation is used for searching the word embedding information, so that the word embedding information of the professional vocabulary is made up.

(3) And extracting context information. In biomedical texts, to extract efficient and beneficial entity names, the position of a word in a sentence and semantic information of adjacent words need to be considered, that is, context information is very beneficial to the NER task, so that the embodiment mainly adopts a bidirectional long-short term memory network (BLSTM), and the BilSTM is composed of forward LSTM and backward LSTM, and effectively solves the problems of gradient disappearance and gradient explosion.

(4) Labeling and dividing labels. When the named entity recognition task is carried out, sequence structure data is marked and divided through a Conditional Random Field (CRF), and a more accurate final sequence marking effect can be achieved. CRF is a variation of Markov random field, is constructed on BilSTM, generally represents a model by conditional probability for a given output identification label and observation sequence, and performs global normalization processing on all features, which is more advantageous than other machine learning methods.

In recent years, a neural network method combining bidirectional long-short term memory (BilSTM) and Conditional Random Fields (CRF) has achieved better effects on various NER data sets. Although BilSTM explores a great deal of context information, in the existing embedding of training words, the occurrence frequency of medical professional vocabularies is low, more accurate word senses cannot be obtained, and the word labels obtained each time cannot be correctly predicted. Pre-trained models, represented by BioBERT and SciBERT, use BERT models to obtain higher-level embedded information by training specific specialized medical corpuses, thereby improving the performance of downstream tasks.

Although the pre-trained model can achieve faster convergence speed and stable model performance, it uses a lot of computing resources, and the cost of training an excellent model is huge. Therefore, the use of a multi-level attention mechanism, a simple, low-cost approach that does not require pre-training, makes character-level and word-level coders more meaningful for specific word information.

In the NER task, the problem of gradient vanishing or gradient explosion is commonly encountered, but by using a bidirectional long and short term memory network (BLSTM), the named entity recognition neural network model of the present embodiment can obtain context information on both sides of any biomedical text statement, eliminating the limited environment problem in the feed-forward neural network. CRF, as a variant of Markov random field, effectively deals with the probability problem of labeling and partitioning sequence structure data.

Example two

The purpose of the present disclosure is to improve the accuracy of biological named entity recognition. In order that the invention may be more clearly understood, the invention will now be described in detail with reference to the accompanying drawings and specific examples.

In previous research, we can understand that the performance of the named entity recognition task can be improved by performing feature sampling on characters through a convolutional neural network as an extension of word embedding. In this embodiment, two character-based techniques are introduced: local attention mechanisms, global attention mechanisms, and word-based word-embedded attention mechanisms; finally, a multi-level cross attention information fusion mechanism is introduced, which is called as a multi-dimensional fusion technology.

A character-based Local Attention Mechanism (LAM) is shown in fig. 2. An attention mechanism is employed to mine the key components of local characters to embed the characters into words, and then maximum pooling is used to extract word embedding. As an extension of native word embedding, it increases the amount of information of the embedded word. Details of LAM are as follows:

EXAMPLE III

A character-based Global Attention Mechanism (GAM) is shown in fig. 3. During the training process, the characters of all sentences in each batch are combined, and then word embedding is extracted at the global character level by using an attention mechanism. Using the attention mechanism directly on the global character set may lose context information. In previous work, character context information was first extracted using BilSTM, and then calculated using an attention mechanism. In our experiments we found that not only can better context information be obtained, but also better computational efficiency can be obtained using BiGRU. The specific algorithm described by GAM is as follows:

vocabulary level local attention mechanisms have been used many times in past research. The word attention mechanism can accurately extract the attention distribution between word insertions. In addition, studies have shown that the effect of feature extraction using BiLSTM is not ideal after calculating the attention mechanism. Therefore, the present embodiment uses BiGRU to extract context information.

Multi-level feature fusion of NER tasks is a powerful and efficient strategy to take advantage of the most important functions to achieve better results. The present embodiment does not simply directly link the multi-dimensional feature information. When connecting features of two different dimensions, a cross-attention mechanism is introduced for the first time. For the characteristics of the two levels, an attention mechanism is adopted to calculate the attention scores of the two sides, and then the attention scores are fused to obtain multi-level word embedding. Notably, to enable the calculation of attention for these two levels of features, BilSTM or BiGRU was used to normalize the dimensions. The specific calculation process is as follows:

f₁＝BiRNN[f₁]

f₂＝BiRNN[f₂]

n₁＝softmax[m1]

n₂＝softmax[m2]

a₁＝o₁⊙f₁

a₂＝o₂⊙f₂

Att＝[a₁,a₂]

in the layer of bidirectional long and short term memory (BilSTM), three control gates of input, forgetting and output are provided to protect and control the cell state, capture better bidirectional semantic dependence and master the influence degree of the information on a prediction object by adjusting the weight of the related information in the context. The hidden layer uses a sigmod function. The control structure of the single LSTM unit is as follows:

i_t＝σ(W_ih_t-1+U_iXt+b_i)

f_t＝σ(W_fh_t-1+U_fXt+b_f)

o_t＝σ(W_oh_t-1+U_oX1+b_o)

h_t＝o_t⊙tanh(c_t)

in the biomedical field, when naming genes, diseases and proteins, entities are generally labeled by using label modes such as { B, I, O }, { B, I, O, E, S }, and the like, wherein B refers to the beginning of an entity, I refers to the inside of an entity, E refers to the end of an entity, and O refers to a non-entity component. For example, "B-GENE" refers to the start position tag of a GENE structure. BilSTM outputs label scores, and if the label with the highest score is selected from the labels in the unit, the method is inaccurate, and the CRF layer is required to ensure the legality of the label.

Example four

The embodiment provides a system for biomedical named entity recognition;

a system for biomedical named entity recognition, comprising:

It should be noted here that the word embedding module, the feature fusion module, the model training module, and the output module correspond to specific steps in the first embodiment, and the modules are the same as examples and application scenarios realized by the corresponding steps, but are not limited to the contents disclosed in the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.

EXAMPLE five

A computer readable storage medium storing computer instructions which, when executed by a processor, perform a method of biomedical named entity identification as described in the above embodiments.

EXAMPLE six

FIG. 4 is a graph of the effect of the local attention test based on characters in the embodiment of the present disclosure, as shown in FIG. 4, the effect of data above the character level on the number of different attention heads, that is, a parameter with the best effect, which is explored by increasing the number of heads;

EXAMPLE seven

Fig. 5 is an effect diagram of a cross attention fusion method in an embodiment of the disclosure, which is directed to a local attention experiment of a character, and as shown in fig. 5, when attention of the character is obtained, a direct splicing input step three is compared, the direct splicing input step three is spliced in an attention cross fusion manner, and then word embedding after a cross fusion person is performed and splicing is performed with original embedding.

Example eight

FIG. 6 is a graph of the effect of a global attention experiment based on characters in an embodiment of the present disclosure, as shown in FIG. 6, the impact of the number of attention heads on performance at different word levels is tested.

Example nine

Fig. 7 is a graph of the effect of the cross attention fusion method in the embodiment of the disclosure on the global attention experiment of the character, as shown in fig. 7, comparing the data influence of the word level.

Example ten

Fig. 8 is a graph of the effect of the local attention experiment based on words in the embodiment of the present disclosure, as shown in fig. 8, the word-level information is compared, and the distinguishing effect is compared after the influence of the character information is increased by directly using attention embedding, using bii l stm and then attention extraction, and using cross attention.

The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Although the present disclosure has been described with reference to specific embodiments, it should be understood that the scope of the present disclosure is not limited thereto, and those skilled in the art will appreciate that various modifications and changes can be made without departing from the spirit and scope of the present disclosure.

Claims

1. A method of biomedical named entity recognition, comprising:

2. The method for biomedical named entity recognition according to claim 1, wherein feature sampling is performed on characters and words by using an attention mechanism to obtain word embedding expansion respectively, and the method comprises the following steps: and respectively carrying out attention exploration in local characters, global characters and local words by adopting multilevel attention embedding vector calculation, and extracting word embedding information at different levels.

3. The method of biomedical named entity recognition of claim 2, wherein the multi-level attention-embedding vector computation comprises: the method comprises the steps of character-based local attention calculation, wherein the character-based local attention calculation models characters inside words in a form of single hot coding, then attention calculation is carried out on modeled character embedding matrixes respectively, and finally output attention character embedding calculation is carried out, and suitable dimension information is selected by maximum pooling layer sampling.

4. The method of biomedical named entity recognition of claim 3, wherein the multi-level attention-embedding vector computation comprises: and performing character-based global attention calculation, wherein the character-based global attention calculation is used for performing context information exploration on the character of the sentence by using the Bi-GRU for the modeled character embedding matrix, then performing attention calculation, and finally performing sampling by using a maximum pooling layer to form corresponding word embedding.

5. The method for biomedical named entity recognition of claim 2, wherein the multi-tiered attention embedding vector computation further comprises: word-based local attention calculation that performs attention distribution calculation for word embedding, extracting attention distribution between word embedding.

6. The method of biomedical named entity recognition according to claim 4 or 5, characterized in that before computing the attention distribution, context information is extracted by context exploration for word embedding inside the sentence.

7. The method of biomedical named entity recognition of claim 1, wherein the fusing of word embeddings of different levels using attention mechanism to obtain word embeddings of multiple levels comprises: and weighting the attention of two different levels into corresponding embedded information by adopting cross attention fusion for fusion to obtain multi-level word embedding.

8. The method of biomedical named entity recognition of claim 1, further comprising labeling and partitioning the biomedical named entity with conditional random fields before the feature sampling of characters as an extension of word embedding.

9. A system for biomedical named entity recognition, comprising:

10. A computer readable storage medium storing computer instructions which, when executed by a processor, perform a method for biomedical named entity identification according to any one of claims 1 to 8.