CN113486666A

CN113486666A - Medical named entity recognition method and system

Info

Publication number: CN113486666A
Application number: CN202110770186.9A
Authority: CN
Inventors: 潘景山; 徐卫志; 范胜玉; 涂阳
Original assignee: Jinan Supercomputing Technology Research Institute; Shandong Normal University
Current assignee: Jinan Supercomputing Technology Research Institute; Shandong Normal University
Priority date: 2021-07-07
Filing date: 2021-07-07
Publication date: 2021-10-08

Abstract

The invention discloses a medical named entity recognition method and a system, wherein the method comprises the following steps: acquiring text data to be identified; the method comprises the steps of carrying out named entity recognition on text data to be recognized based on a medical named entity recognition model, wherein the medical named entity recognition model comprises an input layer, a feature extraction layer and a labeling layer which are sequentially connected, and the feature extraction layer comprises a character embedding module and a word embedding module. The method considers the sentences in the text from two aspects of character level and word level, fully obtains the information quantity and meaning of the embedded words, and is beneficial to improving the identification precision of the named entities.

Description

Medical named entity recognition method and system

Technical Field

The invention belongs to the technical field of medical text processing, and particularly relates to a medical named entity identification method and system.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Named Entity Recognition (NER) is a basic task in the field of NLP, and is also an important basic tool for most NLP tasks such as question and answer systems, machine translation, syntactic analysis, and the like. Previous approaches have been primarily dictionary-based and rule-based. The dictionary-based method is a method of fuzzy search or complete matching through character strings, but the quality and the size of the dictionary are limited as new entity names are continuously emerged; the rule-based method is to manually specify some rules and expand a rule set by common collocation of self characteristics and phrases of entity names, but huge human resources and time cost are consumed, the rules are generally effective only in a certain specific field, the cost of manual migration is high, and the rule portability is not strong. Named entity recognition is carried out, machine learning methods are mostly adopted, and model training is continuously optimized, so that the trained model shows good performance in test evaluation. Currently, the most applied models include Hidden Markov Models (HMMs), Support Vector Machines (SVMs), Maximum Entropy Markov Models (MEMMs), Conditional Random Fields (CRFs), and the like. The conditional random field model can effectively process the influence problem of the adjacent labels on the prediction sequence, so that the conditional random field model is applied to entity recognition more and has good effect. At present, a deep learning algorithm is generally adopted for the problem of sequence labeling. Compared with the traditional algorithm, the deep learning algorithm eliminates the step of manually extracting the features, and can effectively extract the distinguishing features.

In the biomedical field, literature resources are increased by thousands of times every year, the information is mostly stored in the form of unstructured texts, and the biomedical named entity recognition aims to convert the unstructured texts into structured texts and recognize and classify specific entity names such as genes, proteins, diseases and the like in the biomedical texts. At present, biomedical named entity recognition faces a lot of difficulties, namely the entity name is provided with a plurality of modifiers, and the difficulty in distinguishing entity boundaries is increased; multiple entity names share a word; lack of strict naming standards; ambiguity in abbreviations, etc. In recent years, a neural network method combining bidirectional long-short term memory (BilSTM) and Conditional Random Fields (CRF) has achieved better effects on various NER data sets. Although BilSTM explores a great deal of context information, in the existing embedding of training words, the occurrence frequency of medical professional vocabularies is low, more accurate word senses cannot be obtained, and the word labels obtained each time cannot be correctly predicted.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a medical named entity identification method and system. And a multidimensional Transformer is adopted to explore word embedding information, so that the word embedding information of professional vocabularies is made up, and the recognition accuracy of the named entities is improved.

In order to achieve the above object, one or more embodiments of the present invention provide the following technical solutions:

a medical named entity recognition method, comprising the steps of:

acquiring text data to be identified;

carrying out named entity recognition on the text data to be recognized based on the medical named entity recognition model,

the medical named entity recognition model comprises an input layer, a feature extraction layer and a labeling layer which are sequentially connected, wherein the feature extraction layer comprises a character embedding module and a word embedding module.

Further, the character embedding module firstly carries out local Transformer feature extraction and global Transformer feature extraction on the text data to be recognized respectively, and then fuses character features.

Further, the global transform feature extraction includes:

combining characters of all sentences in the text data to be recognized;

extracting character context information by using a bidirectional long-short term memory neural network;

and (5) carrying out global transform feature extraction.

Further, the fusing the character features comprises:

and splicing and fusing character features obtained by extracting the local Transformer features and the global Transformer features.

Further, the word embedding module adopts a BERT model for feature extraction.

Further, the marking layer is marked and divided by adopting a conditional random field.

One or more embodiments provide a medical named entity recognition system, comprising:

the data acquisition module is configured to acquire text data to be recognized;

the system comprises a named entity recognition module and a word embedding module, wherein the named entity recognition module is configured to perform named entity recognition on text data to be recognized based on a medical named entity recognition model, the medical named entity recognition model comprises an input layer, a feature extraction layer and a labeling layer which are sequentially connected, and the feature extraction layer comprises the character embedding module and the word embedding module.

One or more embodiments provide an electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the medical named entity recognition method when executing the program.

One or more embodiments provide a computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the medical named entity recognition method.

The above one or more technical solutions have the following beneficial effects:

the character-level and word-level multi-dimensional embedded information is acquired for the sentences in the text from the two aspects of character level and word level, so that the word embedded information of professional vocabularies is made up, and the accuracy of named entity recognition is improved.

The word embedding information is explored through a local Transformer and a global Transformer, word-level characteristic information is obtained through BERT, finally, the word embedding characteristic information with different dimensionalities is generated into an embedding vector through a splicing and fusing method, the training performance of the model is improved, and the vocabulary which can be processed by the model is greatly improved.

Before the global Transformer characteristic extraction, firstly, the BilSTM is used for extracting character context information, and then the global Transformer characteristic extraction is executed, so that the loss of the context information is avoided, and the characteristic extraction efficiency is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

Fig. 1 is a flowchart of a medical named entity recognition method according to one or more embodiments of the present disclosure.

Detailed Description

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

Transformer is an important tool for improving task performance in the field of natural language processing in recent years. Features within the sentence are extracted by utilizing a multi-head attention mechanism and position coding. And coding the position of the word in the sentence by using a sine and cosine function to obtain position coding information, and performing mask fusion on the embedded information and the original word embedding to form new word embedding. And (4) performing feature extraction on words in the sentence by using a multi-head attention mechanism, and strengthening important feature information in the words. The extraction of word embedding information in sentences by using a Transformer in the field of named entity recognition has become a novel technology.

In recent years, Bidirectional Transformer coders (BERTs) are mature feature extraction tools in the field of natural language processing, and the BERTs are pre-trained using a corpus in a specialized domain and then subjected to downstream task modeling. Recent studies have shown that fine-tuning the model by downstream tasks can achieve superior performance at each task.

Example one

The embodiment discloses a medical named entity recognition method, which is characterized in that deep word meaning information is mined through a neural network model of a multi-level transform and BERT, so that the accuracy of named entity recognition is improved, and as shown in FIG. 1, the method comprises the following steps:

step 1: acquiring text data to be identified, and preprocessing the text data;

the preprocessing specifically comprises preprocessing such as word segmentation and the like.

Step 2: and carrying out named entity recognition on the text data to be recognized based on the medical named entity recognition model.

The construction method of the medical named entity recognition model comprises the following steps:

step A: acquiring a training data set;

the training data set is text data subjected to word segmentation and pre-labeling, and a professional medical corpus is adopted as the training data set in the embodiment;

and B: and training the named entity recognition model based on the training data set to obtain the medical named entity recognition model.

The medical named entity recognition model architecture comprises an input layer, a feature extraction layer and a labeling layer which are sequentially connected, wherein the feature extraction layer comprises a character embedding module and a word embedding module.

Specifically, the character embedding module comprises a local Transformer feature extraction sub-module, a global Transformer feature extraction sub-module and a character feature fusion sub-module.

The local Transformer feature extraction (LTT) sub-module employs the Transformer to mine the key components of the local characters to embed the characters into words, and then extracts word embedding using max-pooling. As an extension of native word embedding, it increases the amount of information of the embedded word. The details of LTT are as follows:

and a global Transformer feature extraction (GTT) sub-module, which firstly merges the characters of all sentences in each batch, and then extracts words to embed at a global character level by using a Transformer feature extraction technology. However, the use of the Transformer feature extraction technique directly at the global character level may lose contextual information. Therefore, in this embodiment, first, the BiLSTM is used to extract the character context information, and then the global Transformer feature extraction is performed. Experiments have found that not only better context information but also better computational efficiency can be obtained using BiLSTM. The GTT describes a specific algorithm as follows:

the method comprises the steps that character-based Transformer feature extraction is mainly used for modeling characters in words in a form of one-hot (one-hot) coding, then position coding is respectively carried out on modeled character embedding matrixes, position coding information and original character feature information are fused, multi-head attention calculation is carried out on the fused feature information, and finally calculated attention character embedding is carried out and proper dimension information is selected by pooling layer sampling; the character-based global transform feature extraction mainly comprises the steps of firstly using Bi-GRU to search context information on sentence characters for a modeled character single hot matrix, then carrying out transform feature extraction, and finally carrying out sampling by using a pooling layer to form corresponding word embedding.

And the character feature fusion submodule fuses words with different dimensions by using a splicing fusion method to generate an embedded vector required by a downstream task.

In medical texts, pre-trained word embedding vectors are usually used for model training in the next step, however, in the commonly used pre-trained word embedding, there is a limitation on the support of specialized vocabularies, namely, a large number of word embedding vectors in the form of OOV exist. Therefore, in the embodiment, a multidimensional Transformer is used for searching the word embedding information, so as to make up for the word embedding information of the professional vocabulary.

In the biomedical field, when naming genes, diseases and proteins, entities are generally labeled by using label modes such as { B, I, O }, { B, I, O, E, S }, and the like, wherein B refers to the beginning of an entity, I refers to the inside of an entity, E refers to the end of an entity, and O refers to a non-entity component. For example, "B-GENE" refers to the start position tag of a GENE structure. BilSTM outputs label scores, and if the label with the highest score is selected from the labels in the unit, the method is inaccurate, and the legality of the label needs to be ensured by means of a CRF layer.

The word embedding module, namely the feature extraction based on the BERT, is used for acquiring word embedding information which is already mature in a pretrained model based on the BERT and is used for word embedding. In the traditional process of extracting word features, trained word embedding is used, but the method may cause that special word embedding cannot be obtained for a large amount of medical texts.

And the marking layer marks and divides sequence structure data through a Conditional Random Field (CRF) when carrying out a named entity recognition task, so that a more accurate final sequence marking effect can be realized. The CRF is a variant of a Markov random field, is constructed on a transform, generally represents a model by conditional probability for a given output identification tag and an observation sequence, and performs global normalization processing on all characteristics, so that the method has more advantages compared with other machine learning methods.

Example two

It is an object of the present embodiment to provide a medical named entity recognition system. The system comprises:

EXAMPLE III

The embodiment aims at providing an electronic device.

An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the program, comprising:

acquiring text data to be identified;

Example four

An object of the present embodiment is to provide a computer-readable storage medium.

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, performs the steps of:

The steps involved in the second to fourth embodiments correspond to the first embodiment of the method, and the detailed description thereof can be found in the relevant description of the first embodiment. The term "computer-readable storage medium" should be taken to include a single medium or multiple media containing one or more sets of instructions; it should also be understood to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by a processor and that cause the processor to perform any of the methods of the present invention.

When the named entity recognition task is carried out in one or more embodiments, the sentence in the text is considered from two aspects of character level and word level, feature extraction is carried out on local characters and global characters respectively by using a Transformer, word level feature information is obtained by using BERT, finally words with different dimensions are embedded into the feature information, a splicing and fusing method is used for embedding words with different dimensions into embedded vectors required by a downstream task generated by fusing, and the training performance of the model can be stably improved by using the scheme. Word-level representations can greatly enhance the vocabulary that our model can handle.

Those skilled in the art will appreciate that the modules or steps of the present invention described above can be implemented using general purpose computer means, or alternatively, they can be implemented using program code that is executable by computing means, such that they are stored in memory means for execution by the computing means, or they are separately fabricated into individual integrated circuit modules, or multiple modules or steps of them are fabricated into a single integrated circuit module. The present invention is not limited to any specific combination of hardware and software.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

Claims

1. A medical named entity recognition method, comprising the steps of:

acquiring text data to be identified;

2. The medical named entity recognition method of claim 1, wherein the character embedding module first performs local Transformer feature extraction and global Transformer feature extraction on text data to be recognized respectively, and then fuses character features.

3. The medical named entity recognition method of claim 2, wherein the global Transformer feature extraction comprises:

combining characters of all sentences in the text data to be recognized;

and (5) carrying out global transform feature extraction.

4. The medical named entity recognition method of claim 2, wherein the fusing character features comprises:

5. The medical named entity recognition method of claim 1, wherein the word embedding module employs a BERT model for feature extraction.

6. The medical named entity recognition method of claim 1, wherein the tagging layer employs conditional random fields for tagging and partitioning.

7. A medical named entity recognition system, comprising:

8. The medical named entity recognition system of claim 7, wherein the character embedding module first performs local Transformer feature extraction and global Transformer feature extraction on text data to be recognized, respectively, and then fuses character features.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the medical named entity recognition method according to any one of claims 1 to 6 when executing the program.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a medical named entity recognition method according to any one of claims 1 to 6.