CN111126040B - Biomedical named entity recognition method based on depth boundary combination - Google Patents

Biomedical named entity recognition method based on depth boundary combination Download PDF

Info

Publication number
CN111126040B
CN111126040B CN201911362019.XA CN201911362019A CN111126040B CN 111126040 B CN111126040 B CN 111126040B CN 201911362019 A CN201911362019 A CN 201911362019A CN 111126040 B CN111126040 B CN 111126040B
Authority
CN
China
Prior art keywords
entity
biomedical
boundary
entities
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911362019.XA
Other languages
Chinese (zh)
Other versions
CN111126040A (en
Inventor
黄瑞章
扈应
秦永彬
武乐飞
陈艳平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guizhou University
Original Assignee
Guizhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guizhou University filed Critical Guizhou University
Priority to CN201911362019.XA priority Critical patent/CN111126040B/en
Publication of CN111126040A publication Critical patent/CN111126040A/en
Application granted granted Critical
Publication of CN111126040B publication Critical patent/CN111126040B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a biomedical named entity recognition method based on depth boundary combination, which comprises the following steps: step one, modeling a discontinuous entity in a biomedical entity as a nested entity structure; step two, representing biomedical vocabulary information by using character level Embedding and word level Embedding; step three, based on the word vector obtained in the step two, identifying the biomedical entity boundary by using a neural network model; step four, using a boundary combination strategy to generate a candidate entity set; and fifthly, constructing a neural network classifier, and screening the candidate entity set. Aiming at the characteristics of the biomedical named entity, the invention adopts a depth boundary-based combined frame and combines available external resources to more accurately represent biomedical vocabulary, so that the problem of discontinuous entity identification in biomedical texts is solved, the BioNER task is completed, more powerful theoretical and technical support is provided for the BioNER, a convenient and efficient entity identification tool is further provided for researchers in the biomedical field, and the performance of biomedical entity identification is effectively improved.

Description

Biomedical named entity recognition method based on depth boundary combination
Technical Field
The invention relates to a biomedical named entity recognition method, in particular to a biomedical named entity recognition method based on deep boundary combination, and belongs to the technical field of natural language processing and machine learning.
Background
Currently, biomedical research capable of timely and effectively preventing, modifying or treating diseases has been paid attention to by many people, and social and commercial values thereof are increasingly outstanding. Among them, many studies require a large investment and a long study period, and efficient retrieval of biomedical documents is an important means for securing the progress of the study. However, a large amount of biomedical knowledge is stored in the form of unstructured text in a database. Statistically, the PubMed central literature database contains more than 2900 tens of thousands of literature references, covering almost all biomedical field knowledge. Even if focusing on a very specialized research area, most biologists have difficulty keeping pace with the research progress in this area. Thus, accurate extraction of knowledge from a large number of documents becomes critical. Biomedical text mining is expected to achieve this goal and in some cases may also reduce costs, thereby providing timely utilization of the required knowledge and discovery of explicit and implicit associations between knowledge.
Biomedical information extraction provides a content-oriented approach to processing biomedical documents, rather than simply ranking related biomedical documents by document similarity. Biomedical named entity recognition (Biomedical Named Entity Recognition, bioNER) is one of the basic tasks of biomedical text mining, and aims to identify text blocks related to specific entities of interest, and plays a key role in tasks such as disease treatment relation extraction, gene function recognition and the like. The task of named entity recognition is usually referred to as identifying the corresponding person, place, tissue, etc. from text, however, in the biomedical field, biologists are more concerned with entities such as DNA, RNA, proteins, etc. As a first step in biomedical literature processing, the errors generated during processing can lead to cascading errors, thereby affecting subsequent tasks such as relationship recognition and event recognition. In view of the important linguistic and semantic roles played by biological, it would be of great theoretical and practical importance for biomedical research to more efficient identification and classification.
Compared to the named entities of the general field, biomedical named entities (BioNEs) mainly have the following characteristics: (1) The biological NEs have many pre-modifiers, such as Major Histocompatibility (MHC) class II genes (DNA), which make the physical length change large and the physical boundaries difficult to determine. (2) Many connectives or disjunctures exist in BioNEs, i.e., two or more entity names share the same prefix (suffix) noun using a connectives or disjunctures. For example, sentence human T and natural killer cells contains two named entities, human T cell (cell_type) and human natural killer cell (cell_type). (3) the physical nesting phenomenon is extensive. For example, in the entity Duffy anti/chemokine receptor gene (DNA), duffy anti/chemokine receptor is also a type of protein that needs to be recognized. (4) in BioNEs, there are many abbreviated entities. These abbreviated entities may also be ambiguous and not conducive to obtaining semantic information using neural network models. For example, "TCF" may refer to either T cell Factor (T cell Factor) or tissue culture broth (Tissue Culture Fluid). These entities are also difficult to identify from existing dictionaries, and require context-dependent accurate inferences of entity types. (5) There is no strict naming convention in biomedical literature and different representations of the same entity are possible. For example, cholesterol,5-Cholesten-3beta-ol and (3 beta) -cholest-5-en-3-ol are all represented by the same chemical. Many existing works are to directly apply the existing general-purpose named entity recognition method to the biomedical field, however, the biomedical named entity recognition (Bio-NER) is still a challenging task because of the specificity of the biomedical named entity, which is rarely achieved to a satisfactory effect. For this reason, the present invention was intended to develop research on the biological NER-related method.
Named Entity Recognition (NER) tasks are generally regarded as sequence labeling problems, where each word in a sentence is assigned a corresponding tag (Begin of the entity (B), inside of the entity (I), out of the entity (O)), thereby representing its semantic information. Over the years of development, bioNER has undergone three major stages: dictionary-based methods, rule-based methods, machine learning-based methods.
Dictionary-based methods store all known named entities in a database, and use the database to make simple, accurate (or fuzzy) matches to text. However, in contrast to the rapid growth of biomedical literature, it is not possible to build a database dictionary containing entities of all classes. The rule-based method matches named entities by manually designing heuristic rules. Budi et al use rules consisting of grammar (e.g., part of speech), syntax (e.g., part of speech), and orthographic patterns (e.g., case to identify named entities. Fukuda et al extract proteins using rules such as case, symbol, number, etc. Etzioni et al propose a semi-supervised framework that divides the named entity recognition process into three steps: pattern learning, subclass extraction, list extraction. And automatically generating a new extraction rule by using the framework to complete the task of identifying the named entity. However, the formulation of these rules requires a lot of manpower and material resources. The machine learning based method has the advantage of automatically extracting decision boundaries from the annotation data. It is widely used to solve the NER problem. Typically, NER is considered a multi-classification task or a sequence tagging task. Many supervised algorithms are applied to NER such as Decision Tree (DT), maximum Entropy (ME), support Vector Machine (SVM), hidden markov (HMM), conditional Random Field (CRF). Using machine learning based algorithms, researchers do not have to manually write complex rules. In addition, the algorithms can also identify new named entities and categories which do not appear in the standard dictionary, and are widely applied to NER tasks.
In recent years, with the development of neural networks, natural Language Processing (NLP) tasks have a greater development potential, and deep neural networks have been applied to various NLP tasks with great success. Compared with the traditional machine learning method based on artificial constructional features, the neural network can automatically extract high-order abstract features from the original input. It also has the advantage of organizing the different layers (such as convolution layer, recursion layer, pooling layer and full-join layer) to achieve complex nonlinear feature transformations. Many neural network models are applied to NER tasks, such as Convolutional Neural Networks (CNNs), long-short-term memory neural networks (LSTMs), LSTM-CNNs, LSTM-CRFs, and the like. In biomedical data sets (JNLPBA corpus and BioCreAtivE II Gene Mention (GM) corpus), gridach et al combined deep neural networks with CRF, word embedded representation and character-level word representation, demonstrated good performance in BioNER. However, these approaches hardly recognize the nested entities that are widely present in BioNEs, which creates a great impediment to improving the performance of the BioNER task.
There are fewer studies of named entity recognition with nested structures than planar entity recognition. The earliest studies on nested NEs were Alex et al, who compared three classical nested named entity recognition methods, hierarchical, cascading, and federated. Based on the same dataset (GENIA corpus), finkel et al used a flatter parse tree to identify nested NEs. In this model, rules are used to append entity candidates to the parse tree. Then, a CRF model is implemented on the tree, outputting a normalized tag sequence. Chen et al use a cascading framework to identify nested named entities, and the process can be divided into three steps: boundary detection, boundary combination and entity screening. In this model, a CRF model is used to detect entity boundaries. And after the entity candidate set is completed, searching for the entity positive example by adopting a maximum entropy model. Lu et al devised a mention hypergraph method to identify nested named entities. Hypergraphs are compact representations of all probability combinations of possible entities. Based on the representation, each child hypergraph is labeled using a log-linear method to identify nested NEs. Because of the large number of manually defined features required for this model, based on this model, muis et al propose a neural network-based hypergraph model to implement nested NE recognition. Ju et al identify nested entities by generating a flat NER layer from the output of the previous LSTM layer. The model dynamically stacks flat NER layers until no external entities are extracted. Even though the BioNER has been widely studied, there is still much room for improvement in its performance.
Disclosure of Invention
The invention aims to solve the technical problems that: the method comprises the steps of firstly modeling discontinuous entities existing in biomedical texts into nested structures, constructing a boundary detection classifier by using a neural network model, identifying a start boundary and an end boundary of the entities, and generating a candidate entity set through a boundary combination strategy. Finally, a classifier is trained to screen candidate named entities, so that the problem of poor identification performance of biomedical entities is effectively solved.
The technical scheme of the invention is as follows: a biomedical named entity recognition method based on depth boundary combinations, the method comprising the steps of: step one, modeling a discontinuous entity in a biomedical entity as a nested entity structure; step two, representing biomedical vocabulary information by using character level Embedding and word level Embedding; step three, based on the word vector obtained in the step two, identifying the biomedical entity boundary by using a neural network model; step four, using a boundary combination strategy to generate a candidate entity set; and fifthly, constructing a neural network classifier, and screening the candidate entity set.
In the third step, the neural network model is a Bi-LSTM+CRF model.
In the fourth step, the boundary combining policy is a greedy matching policy.
In the fifth step, the sentence is divided into four parts by taking the candidate entity as the center: the method comprises the steps of dividing a left part of an entity, a positive sequence of the entity, a reverse sequence of the entity and a right part of the entity into four channels, transmitting the partial semantic information into a neural network, further mining potential partial semantic information by using a convolutional neural network model, accessing a full connection layer, acquiring sentence global information and completing the identification of named entities.
The beneficial effects of the invention are as follows: compared with the prior art, the method and the device have the advantages that aiming at the characteristics of the biomedical named entity, the method and the device adopt a depth boundary-based combined frame, and the biomedical vocabulary is represented more accurately by combining with available external resources, so that the problem of discontinuous entity identification in biomedical texts is solved, and the BioNER task is completed. Provides more powerful theoretical and technical support for the BioNER, further provides a convenient and efficient entity identification tool for researchers in the biomedical field, and effectively improves the performance of biomedical entity identification.
The depth boundary combination frame has the following advantages: (1) The granularity of the entity boundaries is small and does not depend on any NLP task. The boundary information is clear and more easily identified than NEs. (2) the frame is flexible. The framework is a cascade framework, and different models can be used for boundary detection, boundary combination and entity screening. (3) external resources can be effectively utilized. Word embedding before training can be obtained from large-scale original data, which is beneficial to the neural network model to better understand semantic information, so that the experimental performance can be improved by using external resources.
Drawings
FIG. 1 is a diagram of an exemplary nested entity and discontinuous entity of the present invention;
FIG. 2 is a diagram of a rule-optimized boundary detection model of the present invention;
FIG. 3 is a depth boundary combined model diagram of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings of the present specification.
Example 1: 1-3, a biomedical named entity identification method based on depth boundary combination, which comprises the following steps: step one, modeling a discontinuous entity in a biomedical entity as a nested entity structure; step two, representing biomedical vocabulary information by using character level Embedding and word level Embedding; step three, based on the word vector obtained in the step two, identifying the biomedical entity boundary by using a neural network model; step four, using a boundary combination strategy to generate a candidate entity set; and fifthly, constructing a neural network classifier, and screening the candidate entity set.
In step one, this step is intended to model a representation of a discontinuity entity. An example of a discontinuity entity is shown in fig. 1. For discontinuous entities that are more difficult to represent in biomedical text, almost all relevant studies ignore this process because it is difficult to model the process of identifying discontinuous variables. The invention converts the discontinuity entity into a nested structure, e.g., in the short phrases "HEL, KU812 and K562 cells", with three BioNEs: using this notation, "HEL cells", "KU812 cells" and "K562 cells", the previous examples can be converted into three nested named entities, "HEL, KU812 and K562 cells", "K562 cells".
In the second step, aiming at the characteristics of biomedical words, more accurate word vectors are used for representing the semantic and syntactic information of the biomedical words, and biomedical text mining tasks are effectively carried out. The invention splices the character-level Embedding vector and the word-level Embedding vector, and better represents the semantic information of biomedical vocabulary. Character-level Embedding vectors are generated using Bi-LSTM training on a per character basis for the re-word, and word-level Embedding vectors use a glove vector representation trained by Stanford university on a 60 hundred million word basis.
In the third step, the neural network model is a Bi-LSTM+CRF model. The method is used for identifying biomedical entities in sentences by constructing a Bi-LSTM+CRF model. According to the entity boundary characteristics, a generalized, accurate and unified entity boundary representation method is sought, rules with the entity characteristics in the biomedical field are added based on a neural network model, and the entity boundary detection performance is optimized, so that the entity boundary information is maximally reserved in the process of converting the original corpus into the advanced characteristics, and the efficient extraction and full utilization of the boundary semantic information are realized.
In the fourth step, the boundary combining policy is a greedy matching policy. Based on entity boundary recognition, a boundary assembly strategy is implemented to convert an entity structure containing a multi-layer nested structure into a flattened entity structure which is mutually independent, and nested entities or discontinuous entities contained in sentences are accurately represented. And combining by utilizing a proper mode according to entity boundary information which is subjected to unified representation to generate candidate entities so as to find nested entities and discontinuous entities contained in the entities.
And fifthly, screening out correct entities by using convolutional nerves, LSTM and other models on the basis of boundary information combination, and taking the accuracy (P value), recall (R value) and F1 value as performance indexes. Dividing sentences into four parts by taking candidate entities as centers: the method comprises the steps of dividing a left part of an entity, a positive sequence of the entity, a reverse sequence of the entity and a right part of the entity into four channels, transmitting the partial semantic information into a neural network, further mining potential partial semantic information by using a convolutional neural network model, accessing a full connection layer, acquiring sentence global information and completing the identification of named entities.
The invention is further illustrated in the following examples:
to carry out the method of the present invention, step one is first performed, modeling a discontinuity entity present in a biomedical entity as a nested structure. For discontinuous entities that are more difficult to represent in biomedical text, almost all relevant studies ignore this process because it is difficult to model the process of identifying discontinuous variables. The invention converts the discontinuity entity into a nested structure, e.g., in the short phrases "HEL, KU812 and K562 cells", with three BioNEs: using this notation, "HEL cells", "KU812 cells" and "K562 cells", the previous examples can be converted into three nested named entities, "HEL, KU812 and K562 cells", "K562 cells".
Further, the second step is executed to obtain semantic information of biomedical vocabularies. The invention splices the character-level Embedding vector and the word-level Embedding vector to express biomedical vocabulary. The word level encoding vector is generated by a lookup table. The look-up table may be initialized randomly or may be initialized with pre-trained values. In the present invention, the glove word vector trained by the Stanford university on a 60 hundred million word basis is used for initialization. The character-level Embedding vector is trained from the Bi-LSTM model. Each letter of a word (fixed 20 letters in length for each word) is mapped into a 30-dimensional random vector, trained using the Bi-LSTM model, and the output of the model is represented as a character-level vector for that word. And finally, splicing the generated character-level encoding vector and the word-level encoding vector to be used as the final word vector representation of the word.
After obtaining the vector representation of the biomedical vocabulary, executing the third step, and constructing a model detection entity boundary of Bi-LSTM+CRF+rule. The model framework is shown in fig. 3 (boundary classifier). The boundary detection model used in the invention is a classical Bi-LSTM+CRF structure. And D, the vector representation of the biomedical vocabulary obtained in the step two is transmitted into a neural network model, and then is connected into a full connection layer and a CRF layer, and a normalization sequence with the maximum probability is output. In addition, the invention uses two modes to introduce rules in the biomedical field into the boundary detection model. First, after the model is output, a series of rules (e.g., words with three or more consecutive uppercase, words with a prime notation such as "-", "/", words with a prefix such as "DNA", "RNA", etc.) are used to filter the possible entity boundaries. In the second mode, a series of rules in the biomedical field are mapped into a lookup table, a rule vector of each word is generated through the lookup table, the rule vector and the word vector are spliced, a word vector with larger dimension is generated, and the word vector is transmitted into a model, so that the detection of the entity boundary is completed.
Further, step four is performed. Using the boundary combining strategy, a candidate entity set is generated. The present invention uses a greedy matching strategy. The first n (n=1, 2,3 …) possible starting boundaries in the range between each ending boundary and the left end boundary are matched. By this strategy, possible planar entities, nested entities and discontinuous entities (modeled as nested structures) present in the sentence are found, yielding a candidate set of entities.
Further, executing step five, constructing an entity classifier by using the neural network model, screening correct entities in the candidate entity set, and filtering error entities. There are many models that can be used in this process, such as Convolutional Neural Network (CNN), long and short memory neural network (RNN), conditional Random Field (CRF), maximum entropy (SVM), etc., and the Convolutional Neural Network (CNN) model is used in the present invention. The input to this step is a sentence containing labeled candidate entities, each with a tag indicating whether the entity is correct. Thus, the input may be represented as a set:
Figure BDA0002337420230000071
wherein,,
Figure BDA0002337420230000072
representative is in a sentenceSkIs the first of (2)i numberPosition to kthPersonal (S)Candidate entity of position composition, its label is L k . Briefly, this step can be described as entering a sentence containing a labeled candidate entity, and requiring training a classifier to distinguish whether the current entity is the correct entity. The specific method comprises the following steps: dividing a sentence into four channels with entities as boundaries: the left part of the entity, the positive sequence of the entity, the reverse sequence of the entity and the right part of the entity. The length of each channel is fixed at 80. Each channel is processed by a neural network consisting of an Embedding layer, a convolution layer and a max pooling layer. Using the BERT model at the Embedding layer, each lane is mapped into 768-dimensional word vectors. And finally, inputting one-hot vectors representing respective categories through a softmax activation function.
Finally, the present invention verifies its validity on the real dataset GENIA dataset. The GENIA database is built by GENIA project for developing and evaluating molecular biology information retrieval and text mining systems. The dataset was derived from biomedical literature containing PubMed based on three medical topic terms, human, blood cells and transcription factors, for a total of 2000 medline abstracts. The dataset contains 36 fine-grained entity categories. There are 94584 entities in total. Wherein the proportion of nested and discontinuous entities is 35.27%. Table 1 shows the entity performance in identifying the GENIA dataset using the depth boundary combining method. The Layering method is used for respectively calculating the performances of the innermost layer and the outermost layer, comparing the result remembering of the two identification, and identifying two layers of nested entities, but also can not capture semantic information provided by different categories. The Cascade method is based on the LSTM sequence model to identify one kind of entity each time, 10 mutually independent models are respectively constructed, the performance is comprehensively obtained on the basis of 10 identification results, and obviously, the method can not consider the relation between different kinds and can not identify multi-layer nested entities to a certain extent;
table 1: performance of various entities on a GENIA dataset
Figure BDA0002337420230000081
Figure BDA0002337420230000091
To compare the present invention with the relevant work, we set up the experiment the same as Lu et al, and table 2 is an experimental comparison of the present invention with the relevant work.
Table 2: comparison of Experimental Properties
Figure BDA0002337420230000092
From tables 1 and 2, it can be seen that the present invention effectively models the discontinuous entity representation, and can accurately identify the discontinuous entity existing in the biomedical literature. In addition, the invention can overcome the defects of the traditional sequence marking method, and can more efficiently identify nested entities, and in combination, the biomedical named entity identification method based on depth boundary combination has excellent performance.
The depth boundary combination frame has the following advantages: (1) The granularity of the entity boundaries is small and does not depend on any NLP task. The boundary information is clear and more easily identified than NEs. (2) the frame is flexible. The framework is a cascade framework, and different models can be used for boundary detection, boundary combination and entity screening. (3) external resources can be effectively utilized. Word embedding before training can be obtained from large-scale original data, which is beneficial to the neural network model to better understand semantic information, so that the experimental performance can be improved by using external resources.
The present invention is not described in detail in the present application, and is well known to those skilled in the art. Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered by the scope of the claims of the present invention.

Claims (1)

1. A biomedical named entity recognition method based on depth boundary combination is characterized in that: the method comprises the following steps: step one, modeling a discontinuous entity in a biomedical entity as a nested entity structure; step two, representing biomedical vocabulary information by using character level Embedding and word level Embedding; step three, based on the word vector obtained in the step two, identifying the biomedical entity boundary by using a neural network model; step four, using a boundary combination strategy to generate a candidate entity set; step five, constructing a neural network classifier, and screening a candidate entity set;
in the fourth step, the boundary combination strategy is a greedy matching strategy; based on entity boundary identification, implementing a boundary assembly strategy, converting an entity structure containing a multi-layer nested structure into a mutually independent flattened entity structure, and accurately representing nested entities or discontinuous entities contained in sentences;
step five, screening out correct entities by using convolutional nerves and LSTM models on the basis of boundary information combination, and taking accuracy, recall rate and F1 value as performance indexes; dividing sentences into four parts by taking candidate entities as centers: the method comprises the steps of (1) dividing the left part of an entity, the positive sequence of the entity, the reverse sequence of the entity and the right part of the entity into four channels, transmitting the partial semantic information into a neural network, further mining potential partial semantic information by using a convolutional neural network model, and then accessing a full connection layer to obtain sentence global information to finish the identification of named entities;
in the third step, the neural network model is a Bi-LSTM+CRF model.
CN201911362019.XA 2019-12-26 2019-12-26 Biomedical named entity recognition method based on depth boundary combination Active CN111126040B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911362019.XA CN111126040B (en) 2019-12-26 2019-12-26 Biomedical named entity recognition method based on depth boundary combination

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911362019.XA CN111126040B (en) 2019-12-26 2019-12-26 Biomedical named entity recognition method based on depth boundary combination

Publications (2)

Publication Number Publication Date
CN111126040A CN111126040A (en) 2020-05-08
CN111126040B true CN111126040B (en) 2023-06-20

Family

ID=70502739

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911362019.XA Active CN111126040B (en) 2019-12-26 2019-12-26 Biomedical named entity recognition method based on depth boundary combination

Country Status (1)

Country Link
CN (1) CN111126040B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113807094B (en) * 2020-06-11 2024-03-19 株式会社理光 Entity recognition method, entity recognition device and computer readable storage medium
CN112257446A (en) * 2020-10-20 2021-01-22 平安科技(深圳)有限公司 Named entity recognition method and device, computer equipment and readable storage medium
CN112487812B (en) * 2020-10-21 2021-07-06 上海旻浦科技有限公司 Nested entity identification method and system based on boundary identification
CN114692631B (en) * 2020-12-28 2024-07-23 株式会社理光 Named entity identification method, named entity identification device and computer readable storage medium
CN113033207B (en) * 2021-04-07 2023-08-29 东北大学 Biomedical nested type entity identification method based on layer-by-layer perception mechanism
CN112989835B (en) * 2021-04-21 2021-10-08 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Extraction method of complex medical entities
CN113569573A (en) * 2021-06-28 2021-10-29 浙江工业大学 Method and system for identifying generalization entity facing financial field
CN116384399A (en) * 2023-03-27 2023-07-04 华润数字科技有限公司 Named entity recognition method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8626700B1 (en) * 2010-04-30 2014-01-07 The Intellisis Corporation Context aware device execution for simulating neural networks in compute unified device architecture
CN107644014A (en) * 2017-09-25 2018-01-30 南京安链数据科技有限公司 A kind of name entity recognition method based on two-way LSTM and CRF
CN108229582A (en) * 2018-02-01 2018-06-29 浙江大学 Entity recognition dual training method is named in a kind of multitask towards medical domain
WO2018136308A1 (en) * 2017-01-18 2018-07-26 Microsoft Technology Licensing, Llc Organization of signal segments supporting sensed features
CN108628970A (en) * 2018-04-17 2018-10-09 大连理工大学 A kind of biomedical event joint abstracting method based on new marking mode
CN110032737A (en) * 2019-04-10 2019-07-19 贵州大学 A kind of boundary combinations name entity recognition method neural network based

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2019278845B2 (en) * 2018-05-21 2024-06-13 Leverton Holding Llc Post-filtering of named entities with machine learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8626700B1 (en) * 2010-04-30 2014-01-07 The Intellisis Corporation Context aware device execution for simulating neural networks in compute unified device architecture
WO2018136308A1 (en) * 2017-01-18 2018-07-26 Microsoft Technology Licensing, Llc Organization of signal segments supporting sensed features
CN107644014A (en) * 2017-09-25 2018-01-30 南京安链数据科技有限公司 A kind of name entity recognition method based on two-way LSTM and CRF
CN108229582A (en) * 2018-02-01 2018-06-29 浙江大学 Entity recognition dual training method is named in a kind of multitask towards medical domain
CN108628970A (en) * 2018-04-17 2018-10-09 大连理工大学 A kind of biomedical event joint abstracting method based on new marking mode
CN110032737A (en) * 2019-04-10 2019-07-19 贵州大学 A kind of boundary combinations name entity recognition method neural network based

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Character-level neural network for biomedical named entity recognition;Mourad Gridach;《https://www.sciencedirect.com/science/article/pii/S1532046417300977》;20170731;正文第3部分 *

Also Published As

Publication number Publication date
CN111126040A (en) 2020-05-08

Similar Documents

Publication Publication Date Title
CN111126040B (en) Biomedical named entity recognition method based on depth boundary combination
CN109887484B (en) Dual learning-based voice recognition and voice synthesis method and device
US10606946B2 (en) Learning word embedding using morphological knowledge
CN111737496A (en) Power equipment fault knowledge map construction method
CN111274790B (en) Chapter-level event embedding method and device based on syntactic dependency graph
CN111222318B (en) Trigger word recognition method based on double-channel bidirectional LSTM-CRF network
CN112151183A (en) Entity identification method of Chinese electronic medical record based on Lattice LSTM model
CN115048447B (en) Database natural language interface system based on intelligent semantic completion
CN112541356A (en) Method and system for recognizing biomedical named entities
CN111914556A (en) Emotion guiding method and system based on emotion semantic transfer map
Wan et al. A self-attention based neural architecture for Chinese medical named entity recognition
CN111368542A (en) Text language association extraction method and system based on recurrent neural network
CN112328800A (en) System and method for automatically generating programming specification question answers
CN109741824A (en) A kind of medical way of inquisition based on machine learning
CN117371523A (en) Education knowledge graph construction method and system based on man-machine hybrid enhancement
Pipiras et al. Lithuanian speech recognition using purely phonetic deep learning
CN116955579B (en) Chat reply generation method and device based on keyword knowledge retrieval
CN114626378B (en) Named entity recognition method, named entity recognition device, electronic equipment and computer readable storage medium
CN117436522A (en) Biological event relation extraction method and large-scale biological event relation knowledge base construction method of cancer subject
CN116757195B (en) Implicit emotion recognition method based on prompt learning
CN116720519B (en) Seedling medicine named entity identification method
CN117131932A (en) Semi-automatic construction method and system for domain knowledge graph ontology based on topic model
CN116595170A (en) Medical text classification method based on soft prompt
CN116227594A (en) Construction method of high-credibility knowledge graph of medical industry facing multi-source data
CN115358227A (en) Open domain relation joint extraction method and system based on phrase enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant