CN111126040B

CN111126040B - Biomedical named entity recognition method based on depth boundary combination

Info

Publication number: CN111126040B
Application number: CN201911362019.XA
Authority: CN
Inventors: 黄瑞章; 扈应; 秦永彬; 武乐飞; 陈艳平
Original assignee: Guizhou University
Current assignee: Guizhou University
Priority date: 2019-12-26
Filing date: 2019-12-26
Publication date: 2023-06-20
Anticipated expiration: 2039-12-26
Also published as: CN111126040A

Abstract

The invention discloses a biomedical named entity recognition method based on depth boundary combination, which comprises the following steps: step one, modeling a discontinuous entity in a biomedical entity as a nested entity structure; step two, representing biomedical vocabulary information by using character level Embedding and word level Embedding; step three, based on the word vector obtained in the step two, identifying the biomedical entity boundary by using a neural network model; step four, using a boundary combination strategy to generate a candidate entity set; and fifthly, constructing a neural network classifier, and screening the candidate entity set. Aiming at the characteristics of the biomedical named entity, the invention adopts a depth boundary-based combined frame and combines available external resources to more accurately represent biomedical vocabulary, so that the problem of discontinuous entity identification in biomedical texts is solved, the BioNER task is completed, more powerful theoretical and technical support is provided for the BioNER, a convenient and efficient entity identification tool is further provided for researchers in the biomedical field, and the performance of biomedical entity identification is effectively improved.

Description

Biomedical named entity recognition method based on depth boundary combination

Technical Field

The invention relates to a biomedical named entity recognition method, in particular to a biomedical named entity recognition method based on deep boundary combination, and belongs to the technical field of natural language processing and machine learning.

Background

Currently, biomedical research capable of timely and effectively preventing, modifying or treating diseases has been paid attention to by many people, and social and commercial values thereof are increasingly outstanding. Among them, many studies require a large investment and a long study period, and efficient retrieval of biomedical documents is an important means for securing the progress of the study. However, a large amount of biomedical knowledge is stored in the form of unstructured text in a database. Statistically, the PubMed central literature database contains more than 2900 tens of thousands of literature references, covering almost all biomedical field knowledge. Even if focusing on a very specialized research area, most biologists have difficulty keeping pace with the research progress in this area. Thus, accurate extraction of knowledge from a large number of documents becomes critical. Biomedical text mining is expected to achieve this goal and in some cases may also reduce costs, thereby providing timely utilization of the required knowledge and discovery of explicit and implicit associations between knowledge.

Biomedical information extraction provides a content-oriented approach to processing biomedical documents, rather than simply ranking related biomedical documents by document similarity. Biomedical named entity recognition (Biomedical Named Entity Recognition, bioNER) is one of the basic tasks of biomedical text mining, and aims to identify text blocks related to specific entities of interest, and plays a key role in tasks such as disease treatment relation extraction, gene function recognition and the like. The task of named entity recognition is usually referred to as identifying the corresponding person, place, tissue, etc. from text, however, in the biomedical field, biologists are more concerned with entities such as DNA, RNA, proteins, etc. As a first step in biomedical literature processing, the errors generated during processing can lead to cascading errors, thereby affecting subsequent tasks such as relationship recognition and event recognition. In view of the important linguistic and semantic roles played by biological, it would be of great theoretical and practical importance for biomedical research to more efficient identification and classification.

Compared to the named entities of the general field, biomedical named entities (BioNEs) mainly have the following characteristics: (1) The biological NEs have many pre-modifiers, such as Major Histocompatibility (MHC) class II genes (DNA), which make the physical length change large and the physical boundaries difficult to determine. (2) Many connectives or disjunctures exist in BioNEs, i.e., two or more entity names share the same prefix (suffix) noun using a connectives or disjunctures. For example, sentence human T and natural killer cells contains two named entities, human T cell (cell_type) and human natural killer cell (cell_type). (3) the physical nesting phenomenon is extensive. For example, in the entity Duffy anti/chemokine receptor gene (DNA), duffy anti/chemokine receptor is also a type of protein that needs to be recognized. (4) in BioNEs, there are many abbreviated entities. These abbreviated entities may also be ambiguous and not conducive to obtaining semantic information using neural network models. For example, "TCF" may refer to either T cell Factor (T cell Factor) or tissue culture broth (Tissue Culture Fluid). These entities are also difficult to identify from existing dictionaries, and require context-dependent accurate inferences of entity types. (5) There is no strict naming convention in biomedical literature and different representations of the same entity are possible. For example, cholesterol,5-Cholesten-3beta-ol and (3 beta) -cholest-5-en-3-ol are all represented by the same chemical. Many existing works are to directly apply the existing general-purpose named entity recognition method to the biomedical field, however, the biomedical named entity recognition (Bio-NER) is still a challenging task because of the specificity of the biomedical named entity, which is rarely achieved to a satisfactory effect. For this reason, the present invention was intended to develop research on the biological NER-related method.

Named Entity Recognition (NER) tasks are generally regarded as sequence labeling problems, where each word in a sentence is assigned a corresponding tag (Begin of the entity (B), inside of the entity (I), out of the entity (O)), thereby representing its semantic information. Over the years of development, bioNER has undergone three major stages: dictionary-based methods, rule-based methods, machine learning-based methods.

Dictionary-based methods store all known named entities in a database, and use the database to make simple, accurate (or fuzzy) matches to text. However, in contrast to the rapid growth of biomedical literature, it is not possible to build a database dictionary containing entities of all classes. The rule-based method matches named entities by manually designing heuristic rules. Budi et al use rules consisting of grammar (e.g., part of speech), syntax (e.g., part of speech), and orthographic patterns (e.g., case to identify named entities. Fukuda et al extract proteins using rules such as case, symbol, number, etc. Etzioni et al propose a semi-supervised framework that divides the named entity recognition process into three steps: pattern learning, subclass extraction, list extraction. And automatically generating a new extraction rule by using the framework to complete the task of identifying the named entity. However, the formulation of these rules requires a lot of manpower and material resources. The machine learning based method has the advantage of automatically extracting decision boundaries from the annotation data. It is widely used to solve the NER problem. Typically, NER is considered a multi-classification task or a sequence tagging task. Many supervised algorithms are applied to NER such as Decision Tree (DT), maximum Entropy (ME), support Vector Machine (SVM), hidden markov (HMM), conditional Random Field (CRF). Using machine learning based algorithms, researchers do not have to manually write complex rules. In addition, the algorithms can also identify new named entities and categories which do not appear in the standard dictionary, and are widely applied to NER tasks.

In recent years, with the development of neural networks, natural Language Processing (NLP) tasks have a greater development potential, and deep neural networks have been applied to various NLP tasks with great success. Compared with the traditional machine learning method based on artificial constructional features, the neural network can automatically extract high-order abstract features from the original input. It also has the advantage of organizing the different layers (such as convolution layer, recursion layer, pooling layer and full-join layer) to achieve complex nonlinear feature transformations. Many neural network models are applied to NER tasks, such as Convolutional Neural Networks (CNNs), long-short-term memory neural networks (LSTMs), LSTM-CNNs, LSTM-CRFs, and the like. In biomedical data sets (JNLPBA corpus and BioCreAtivE II Gene Mention (GM) corpus), gridach et al combined deep neural networks with CRF, word embedded representation and character-level word representation, demonstrated good performance in BioNER. However, these approaches hardly recognize the nested entities that are widely present in BioNEs, which creates a great impediment to improving the performance of the BioNER task.

There are fewer studies of named entity recognition with nested structures than planar entity recognition. The earliest studies on nested NEs were Alex et al, who compared three classical nested named entity recognition methods, hierarchical, cascading, and federated. Based on the same dataset (GENIA corpus), finkel et al used a flatter parse tree to identify nested NEs. In this model, rules are used to append entity candidates to the parse tree. Then, a CRF model is implemented on the tree, outputting a normalized tag sequence. Chen et al use a cascading framework to identify nested named entities, and the process can be divided into three steps: boundary detection, boundary combination and entity screening. In this model, a CRF model is used to detect entity boundaries. And after the entity candidate set is completed, searching for the entity positive example by adopting a maximum entropy model. Lu et al devised a mention hypergraph method to identify nested named entities. Hypergraphs are compact representations of all probability combinations of possible entities. Based on the representation, each child hypergraph is labeled using a log-linear method to identify nested NEs. Because of the large number of manually defined features required for this model, based on this model, muis et al propose a neural network-based hypergraph model to implement nested NE recognition. Ju et al identify nested entities by generating a flat NER layer from the output of the previous LSTM layer. The model dynamically stacks flat NER layers until no external entities are extracted. Even though the BioNER has been widely studied, there is still much room for improvement in its performance.

Disclosure of Invention

The invention aims to solve the technical problems that: the method comprises the steps of firstly modeling discontinuous entities existing in biomedical texts into nested structures, constructing a boundary detection classifier by using a neural network model, identifying a start boundary and an end boundary of the entities, and generating a candidate entity set through a boundary combination strategy. Finally, a classifier is trained to screen candidate named entities, so that the problem of poor identification performance of biomedical entities is effectively solved.

The technical scheme of the invention is as follows: a biomedical named entity recognition method based on depth boundary combinations, the method comprising the steps of: step one, modeling a discontinuous entity in a biomedical entity as a nested entity structure; step two, representing biomedical vocabulary information by using character level Embedding and word level Embedding; step three, based on the word vector obtained in the step two, identifying the biomedical entity boundary by using a neural network model; step four, using a boundary combination strategy to generate a candidate entity set; and fifthly, constructing a neural network classifier, and screening the candidate entity set.

In the third step, the neural network model is a Bi-LSTM+CRF model.

In the fourth step, the boundary combining policy is a greedy matching policy.

In the fifth step, the sentence is divided into four parts by taking the candidate entity as the center: the method comprises the steps of dividing a left part of an entity, a positive sequence of the entity, a reverse sequence of the entity and a right part of the entity into four channels, transmitting the partial semantic information into a neural network, further mining potential partial semantic information by using a convolutional neural network model, accessing a full connection layer, acquiring sentence global information and completing the identification of named entities.

The beneficial effects of the invention are as follows: compared with the prior art, the method and the device have the advantages that aiming at the characteristics of the biomedical named entity, the method and the device adopt a depth boundary-based combined frame, and the biomedical vocabulary is represented more accurately by combining with available external resources, so that the problem of discontinuous entity identification in biomedical texts is solved, and the BioNER task is completed. Provides more powerful theoretical and technical support for the BioNER, further provides a convenient and efficient entity identification tool for researchers in the biomedical field, and effectively improves the performance of biomedical entity identification.

The depth boundary combination frame has the following advantages: (1) The granularity of the entity boundaries is small and does not depend on any NLP task. The boundary information is clear and more easily identified than NEs. (2) the frame is flexible. The framework is a cascade framework, and different models can be used for boundary detection, boundary combination and entity screening. (3) external resources can be effectively utilized. Word embedding before training can be obtained from large-scale original data, which is beneficial to the neural network model to better understand semantic information, so that the experimental performance can be improved by using external resources.

Drawings

FIG. 1 is a diagram of an exemplary nested entity and discontinuous entity of the present invention;

FIG. 2 is a diagram of a rule-optimized boundary detection model of the present invention;

FIG. 3 is a depth boundary combined model diagram of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings of the present specification.

Example 1: 1-3, a biomedical named entity identification method based on depth boundary combination, which comprises the following steps: step one, modeling a discontinuous entity in a biomedical entity as a nested entity structure; step two, representing biomedical vocabulary information by using character level Embedding and word level Embedding; step three, based on the word vector obtained in the step two, identifying the biomedical entity boundary by using a neural network model; step four, using a boundary combination strategy to generate a candidate entity set; and fifthly, constructing a neural network classifier, and screening the candidate entity set.

In step one, this step is intended to model a representation of a discontinuity entity. An example of a discontinuity entity is shown in fig. 1. For discontinuous entities that are more difficult to represent in biomedical text, almost all relevant studies ignore this process because it is difficult to model the process of identifying discontinuous variables. The invention converts the discontinuity entity into a nested structure, e.g., in the short phrases "HEL, KU812 and K562 cells", with three BioNEs: using this notation, "HEL cells", "KU812 cells" and "K562 cells", the previous examples can be converted into three nested named entities, "HEL, KU812 and K562 cells", "K562 cells".

In the second step, aiming at the characteristics of biomedical words, more accurate word vectors are used for representing the semantic and syntactic information of the biomedical words, and biomedical text mining tasks are effectively carried out. The invention splices the character-level Embedding vector and the word-level Embedding vector, and better represents the semantic information of biomedical vocabulary. Character-level Embedding vectors are generated using Bi-LSTM training on a per character basis for the re-word, and word-level Embedding vectors use a glove vector representation trained by Stanford university on a 60 hundred million word basis.

In the third step, the neural network model is a Bi-LSTM+CRF model. The method is used for identifying biomedical entities in sentences by constructing a Bi-LSTM+CRF model. According to the entity boundary characteristics, a generalized, accurate and unified entity boundary representation method is sought, rules with the entity characteristics in the biomedical field are added based on a neural network model, and the entity boundary detection performance is optimized, so that the entity boundary information is maximally reserved in the process of converting the original corpus into the advanced characteristics, and the efficient extraction and full utilization of the boundary semantic information are realized.

In the fourth step, the boundary combining policy is a greedy matching policy. Based on entity boundary recognition, a boundary assembly strategy is implemented to convert an entity structure containing a multi-layer nested structure into a flattened entity structure which is mutually independent, and nested entities or discontinuous entities contained in sentences are accurately represented. And combining by utilizing a proper mode according to entity boundary information which is subjected to unified representation to generate candidate entities so as to find nested entities and discontinuous entities contained in the entities.

And fifthly, screening out correct entities by using convolutional nerves, LSTM and other models on the basis of boundary information combination, and taking the accuracy (P value), recall (R value) and F1 value as performance indexes. Dividing sentences into four parts by taking candidate entities as centers: the method comprises the steps of dividing a left part of an entity, a positive sequence of the entity, a reverse sequence of the entity and a right part of the entity into four channels, transmitting the partial semantic information into a neural network, further mining potential partial semantic information by using a convolutional neural network model, accessing a full connection layer, acquiring sentence global information and completing the identification of named entities.

The invention is further illustrated in the following examples:

to carry out the method of the present invention, step one is first performed, modeling a discontinuity entity present in a biomedical entity as a nested structure. For discontinuous entities that are more difficult to represent in biomedical text, almost all relevant studies ignore this process because it is difficult to model the process of identifying discontinuous variables. The invention converts the discontinuity entity into a nested structure, e.g., in the short phrases "HEL, KU812 and K562 cells", with three BioNEs: using this notation, "HEL cells", "KU812 cells" and "K562 cells", the previous examples can be converted into three nested named entities, "HEL, KU812 and K562 cells", "K562 cells".

Further, the second step is executed to obtain semantic information of biomedical vocabularies. The invention splices the character-level Embedding vector and the word-level Embedding vector to express biomedical vocabulary. The word level encoding vector is generated by a lookup table. The look-up table may be initialized randomly or may be initialized with pre-trained values. In the present invention, the glove word vector trained by the Stanford university on a 60 hundred million word basis is used for initialization. The character-level Embedding vector is trained from the Bi-LSTM model. Each letter of a word (fixed 20 letters in length for each word) is mapped into a 30-dimensional random vector, trained using the Bi-LSTM model, and the output of the model is represented as a character-level vector for that word. And finally, splicing the generated character-level encoding vector and the word-level encoding vector to be used as the final word vector representation of the word.

After obtaining the vector representation of the biomedical vocabulary, executing the third step, and constructing a model detection entity boundary of Bi-LSTM+CRF+rule. The model framework is shown in fig. 3 (boundary classifier). The boundary detection model used in the invention is a classical Bi-LSTM+CRF structure. And D, the vector representation of the biomedical vocabulary obtained in the step two is transmitted into a neural network model, and then is connected into a full connection layer and a CRF layer, and a normalization sequence with the maximum probability is output. In addition, the invention uses two modes to introduce rules in the biomedical field into the boundary detection model. First, after the model is output, a series of rules (e.g., words with three or more consecutive uppercase, words with a prime notation such as "-", "/", words with a prefix such as "DNA", "RNA", etc.) are used to filter the possible entity boundaries. In the second mode, a series of rules in the biomedical field are mapped into a lookup table, a rule vector of each word is generated through the lookup table, the rule vector and the word vector are spliced, a word vector with larger dimension is generated, and the word vector is transmitted into a model, so that the detection of the entity boundary is completed.

Further, step four is performed. Using the boundary combining strategy, a candidate entity set is generated. The present invention uses a greedy matching strategy. The first n (n=1, 2,3 …) possible starting boundaries in the range between each ending boundary and the left end boundary are matched. By this strategy, possible planar entities, nested entities and discontinuous entities (modeled as nested structures) present in the sentence are found, yielding a candidate set of entities.

Further, executing step five, constructing an entity classifier by using the neural network model, screening correct entities in the candidate entity set, and filtering error entities. There are many models that can be used in this process, such as Convolutional Neural Network (CNN), long and short memory neural network (RNN), conditional Random Field (CRF), maximum entropy (SVM), etc., and the Convolutional Neural Network (CNN) model is used in the present invention. The input to this step is a sentence containing labeled candidate entities, each with a tag indicating whether the entity is correct. Thus, the input may be represented as a set:

wherein,,

representative is in a sentenceSkIs the first of (2)i numberPosition to kthPersonal (S)Candidate entity of position composition, its label is L _k . Briefly, this step can be described as entering a sentence containing a labeled candidate entity, and requiring training a classifier to distinguish whether the current entity is the correct entity. The specific method comprises the following steps: dividing a sentence into four channels with entities as boundaries: the left part of the entity, the positive sequence of the entity, the reverse sequence of the entity and the right part of the entity. The length of each channel is fixed at 80. Each channel is processed by a neural network consisting of an Embedding layer, a convolution layer and a max pooling layer. Using the BERT model at the Embedding layer, each lane is mapped into 768-dimensional word vectors. And finally, inputting one-hot vectors representing respective categories through a softmax activation function.

Finally, the present invention verifies its validity on the real dataset GENIA dataset. The GENIA database is built by GENIA project for developing and evaluating molecular biology information retrieval and text mining systems. The dataset was derived from biomedical literature containing PubMed based on three medical topic terms, human, blood cells and transcription factors, for a total of 2000 medline abstracts. The dataset contains 36 fine-grained entity categories. There are 94584 entities in total. Wherein the proportion of nested and discontinuous entities is 35.27%. Table 1 shows the entity performance in identifying the GENIA dataset using the depth boundary combining method. The Layering method is used for respectively calculating the performances of the innermost layer and the outermost layer, comparing the result remembering of the two identification, and identifying two layers of nested entities, but also can not capture semantic information provided by different categories. The Cascade method is based on the LSTM sequence model to identify one kind of entity each time, 10 mutually independent models are respectively constructed, the performance is comprehensively obtained on the basis of 10 identification results, and obviously, the method can not consider the relation between different kinds and can not identify multi-layer nested entities to a certain extent;

table 1: performance of various entities on a GENIA dataset

To compare the present invention with the relevant work, we set up the experiment the same as Lu et al, and table 2 is an experimental comparison of the present invention with the relevant work.

Table 2: comparison of Experimental Properties

From tables 1 and 2, it can be seen that the present invention effectively models the discontinuous entity representation, and can accurately identify the discontinuous entity existing in the biomedical literature. In addition, the invention can overcome the defects of the traditional sequence marking method, and can more efficiently identify nested entities, and in combination, the biomedical named entity identification method based on depth boundary combination has excellent performance.

The present invention is not described in detail in the present application, and is well known to those skilled in the art. Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered by the scope of the claims of the present invention.

Claims

1. A biomedical named entity recognition method based on depth boundary combination is characterized in that: the method comprises the following steps: step one, modeling a discontinuous entity in a biomedical entity as a nested entity structure; step two, representing biomedical vocabulary information by using character level Embedding and word level Embedding; step three, based on the word vector obtained in the step two, identifying the biomedical entity boundary by using a neural network model; step four, using a boundary combination strategy to generate a candidate entity set; step five, constructing a neural network classifier, and screening a candidate entity set;

in the fourth step, the boundary combination strategy is a greedy matching strategy; based on entity boundary identification, implementing a boundary assembly strategy, converting an entity structure containing a multi-layer nested structure into a mutually independent flattened entity structure, and accurately representing nested entities or discontinuous entities contained in sentences;

step five, screening out correct entities by using convolutional nerves and LSTM models on the basis of boundary information combination, and taking accuracy, recall rate and F1 value as performance indexes; dividing sentences into four parts by taking candidate entities as centers: the method comprises the steps of (1) dividing the left part of an entity, the positive sequence of the entity, the reverse sequence of the entity and the right part of the entity into four channels, transmitting the partial semantic information into a neural network, further mining potential partial semantic information by using a convolutional neural network model, and then accessing a full connection layer to obtain sentence global information to finish the identification of named entities;

in the third step, the neural network model is a Bi-LSTM+CRF model.