CN113779993B - Medical entity identification method based on multi-granularity text embedding - Google Patents

Medical entity identification method based on multi-granularity text embedding Download PDF

Info

Publication number
CN113779993B
CN113779993B CN202110890112.9A CN202110890112A CN113779993B CN 113779993 B CN113779993 B CN 113779993B CN 202110890112 A CN202110890112 A CN 202110890112A CN 113779993 B CN113779993 B CN 113779993B
Authority
CN
China
Prior art keywords
embedding
medical
word
character
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110890112.9A
Other languages
Chinese (zh)
Other versions
CN113779993A (en
Inventor
道捷
张春霞
彭成
薛晓军
王瞳
徐天祥
郭贵锁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Publication of CN113779993A publication Critical patent/CN113779993A/en
Application granted granted Critical
Publication of CN113779993B publication Critical patent/CN113779993B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a medical entity identification method based on multi-granularity text embedding, and belongs to the technical field of information extraction and knowledge graph construction. The medical entity identification method comprises the following steps: constructing multi-granularity text embedding: constructing multi-granularity text embedding through a pre-training language model, wherein the multi-granularity text embedding comprises character embedding, word embedding, part of speech embedding, substring embedding and phrase embedding; generating a pattern weight: generating mode weights of all characters in the Chinese sentence according to the medical term composition mode; node embedding represents learning: using a graph attention network and a mode enhanced attention mechanism to carry out node embedding representation learning; outputting a medical text entity recognition result: and generating an entity category label of the medical text by adopting the conditional random field, and outputting a medical entity recognition result. The method solves the problems of insufficient utilization of graph representation information and single embedding granularity of text distributed representation in the medical entity identification, and improves the performance of the medical entity identification.

Description

Medical entity identification method based on multi-granularity text embedding
Technical Field
The invention relates to a medical entity identification method based on multi-granularity text embedding, and belongs to the technical field of information extraction and knowledge graph construction.
Background
Medical entity identification is an important research topic in the fields of information extraction and medical knowledge graph construction. Medical entity recognition refers to the recognition of entities or terms in the medical field from unstructured medical text. The medical entity recognition technology can provide technical and knowledge support for the fields of question-answering systems in the medical field, medical auxiliary diagnosis, accurate medical knowledge services and the like.
The medical entity identification method mainly comprises a rule-based method, a statistical machine learning-based method, a deep learning-based method and the like. The basic idea of the rule-based medical identification method is to identify medical entities from unstructured texts according to constructed medical entity construction rules, and the constituent elements of the rules comprise keywords, word categories and the like.
The medical entity recognition method based on statistical machine learning mainly comprises the steps of adopting models such as a maximum entropy model, a hidden Markov model, a conditional random field and a support vector machine to recognize medical entities. The method is used for converting medical entity identification into a classification problem or a sequence labeling problem. For example, a method based on the combination of conditional random fields and rules is used for identifying named entities of Chinese electronic medical records. Firstly, adopting conditional random field recognition according to language symbol characteristics, suffix characteristics, keyword characteristics, dictionary characteristics and length characteristics; then, the recognition result is optimized by using the rule.
Deep learning based medical entity recognition methods include distributed representation or embedded encoding of unstructured input text, contextual or contextual semantic encoding, and tag decoding. The embedded coding of the input text mainly comprises character embedding and word embedding. The context semantic coding model comprises a convolutional neural network, a bidirectional long-term and short-term memory network, a cyclic neural network and the like. For example, one approach is to perform medical entity recognition of Chinese electronic medical records based on a two-way long-short term memory network and a conditional random field model. First, generating a low-dimensional vector representation of each word; then, medical entity recognition is carried out by adopting a bidirectional long-short term memory network with attention mechanism and a conditional random field model.
The graph attention network is based on a graph convolution neural network and introduces an attention mechanism. The graph attention network has been applied to answer extraction, information recommendation, relationship extraction, and the like of a question-answering system.
The existing medical entity identification method mainly has the following problems: firstly, the existing medical entity recognition method mainly constructs character embedding, word embedding and part of speech embedding of texts, and rarely introduces phrase embedding and substring embedding. Second, current methods provide less for entity recognition of medical text modeling through a graph attention network. Third, current methods less fuse pattern or rule-based methods and deep learning-based methods to fully and efficiently integrate the advantages of both methods. The method based on the mode or the rule has high performance, and the method based on the deep learning does not need time-consuming and labor-consuming characteristic engineering and can realize end-to-end nonlinear learning.
Disclosure of Invention
The invention aims to solve the problems of insufficient utilization of graph representation information and single embedding granularity of text distributed representation in medical entity recognition, and provides a medical entity recognition method based on multi-granularity text embedding, which firstly constructs multi-granularity text embedding, including character embedding, word embedding, part of speech embedding, substring embedding and phrase embedding, and realizes multi-granularity text embedding representation learning of characters, words, parts of speech, phrases and substrings of a medical text; then, performing medical text entity recognition by adopting a graph attention network, a mode enhanced attention mechanism and a conditional random field, wherein the method specifically comprises the following steps: firstly, the construction of graph embedded representation of a medical text is realized by utilizing a graph attention network model, and secondly, a mode strengthening attention mechanism is introduced into the graph attention network to strengthen the attention weight of nodes, so that the identification performance of a medical entity is improved.
In order to achieve the purpose, the invention adopts the following technical scheme:
the medical entity identification method based on multi-granularity text embedding comprises the following steps:
step 1: the method for constructing the multi-granularity text embedding through the pre-training language model comprises the following steps:
step 1.1: for unstructured Chinese medical texts, constructing multi-granularity text embedding;
step 1.1.1: learning of symbol embedding, segmentation embedding and covering embedding is carried out by utilizing a pre-training language model MC-Bert, and character embedding of an unstructured Chinese medical text is generated;
MC-Bert is a pre-training model generated according to Chinese medical data training;
for an unstructured Chinese medical text, the input of a pre-training language model MC-Bert consists of three kinds of embedding, namely symbol embedding, segmentation embedding and covering embedding;
wherein symbol embedding refers to vector representation of each word; the segmentation embedding is used for distinguishing two natural language sentences, and the medical entity recognition task recognizes the entity by taking the sentences as units, so that each word has the same segmentation embedding; in the covering embedding, if the current position is the character of the input sentence, the value is assigned to be 1; if the current position is 0, namely the current position is not the character of the input sentence, the value is assigned to be 0;
for the Chinese sentence CS, CS = (cc) 1 ,cc 2 ,...,cc n ),cc 1 ,cc 2 ,...,cc n For the character, a pre-training language model MC-Bert is adopted to generate a character embedded representation E of the sentence CS cc As shown in (1):
E cc =(e cc1 ,e cc2 ,....,e ccn ),E cc ∈R n×m (1)
wherein n is the sentence length, namely the number of characters in the sentence, and if the number of characters is less than n, 0 is filled; m is the dimension set by the pre-training model MC-Bert;e cci (i =1, 2.., n) is the character cc i Embedding; character-embedded representation E cc The dimension of is n rows and m columns; r n×m A real number matrix representing n rows and m columns;
step 1.1.2: generating word embedding, part-of-speech embedding and phrase embedding of the Chinese medical text;
firstly, for a Chinese medical text, a jieba word segmentation tool is used for obtaining words of the Chinese medical text, a part of speech marker of the words of the Chinese medical text is obtained by using a part of speech marker Stanford posttagger, and a phrase marker of the Chinese medical text is obtained by using a syntax analyzer Stanford parser;
then, word embedding, part of speech embedding and phrase embedding are generated by using a word2vec tool;
for the Chinese sentence CS, CS = (cc) 1 ,cc 2 ,...,cc n ),cc 1 ,cc 2 ,...,cc n The method is characterized in that the method comprises the following steps: obtaining words of the Chinese medical text based on a jieba word segmentation tool, and obtaining word embedding of the affiliated words by using word2 vec; part of speech embedding, specifically: based on the part-of-speech tag of the word of the Chinese medical text, word2vec is used for acquiring part-of-speech embedding of the affiliated word; phrase embedding specifically includes: based on the phrase marks of the obtained Chinese medical texts, obtaining the type embedding of the affiliated phrases by using word2 vec;
wherein the word is embedded as E cw =(e cw1 ,e cw2 ,....,e cwn ) Wherein the character cc 1 ,cc 2 ,...,cc n The subordinate words are cw in sequence 1 ,cw 2 ,...,cw n ;e cwi (i =1, 2.., n) is the word cw i Embedding;
part of speech embedded as E cpos =(e cpo1 ,e cpo2 ,....,e cpon ) Wherein the word cw 1 ,cw 2 ,...,cw n Of part of speech are cpo in sequence 1 ,cpo 2 ,...,cpo n ;e cpoi (i =1, 2.., n) is a part of speech cpo i The embedding of (2);
phrase embedding as E cph =(e cph1 ,e cph2 ,....,e cphn ) Wherein the character cc 1 ,cc 2 ,...,cc n The subordinate phrase types are in turn cph 1 ,cph 2 ,...,cph n ;e cphi (i =1, 2.., n) is the phrase type cph i The embedding of (2);
step 1.1.3: for an unstructured Chinese medical text, generating substring embedding by using a word2vec tool, specifically:
firstly, collecting a medical term dictionary, and constructing a medical term substring set, specifically comprising the following steps: for any two terms, extracting the longest common substring of the two terms, and adding the substring to the medical term substring set; if two terms have a plurality of longest common substrings with the same length, the first longest common substring is taken and added into the medical term substring set;
secondly, generating embedded expressions of all substrings in the substring set of the medical terms by using a word2vec tool;
then, for each word cw i (i =1, 2.. N.), judging the word cw i Whether substrings in the medical term substring set are included; for the Chinese sentence CS, CS = (cc) 1 ,cc 2 ,...,cc n ),cc 1 ,cc 2 ,...,cc n Is a character, cc 1 ,cc 2 ,...,cc n The words to which they belong are in turn: cw 1 ,cw 2 ,...,cw n (ii) a Let the word cw i Containing substrings csubs in a set of medical term substrings 1 ,csubs 2 ,...,csubs p String csubs 1 ,csubs 2 ,...,csubs p Is respectively represented by e cs1 ,e cs2 ,....,e csp Then word cw i Substring embedding representation e cssi Comprises the following steps: to e cs1 ,e cs2 ,....,e csp Adding and summing, and then dividing by the result of the number p of the substrings; if the word cw i If the substring does not contain any substring in the medical term substring set, outputting a custom value;
finally, for each character cc in the sentence CS i (i =1, 2.. N), generating its corresponding substring embeddings as per the above steps;substring embedding of sentence CS as E css =(e css1 ,e css2 ,....,e cssn );
Step 1.1.4: character embedding, word embedding, part of speech embedding, phrase embedding and substring embedding of the Chinese medical text are spliced to construct multi-granularity text embedding, and the method specifically comprises the following steps:
for the Chinese sentence CS, CS = (cc) 1 ,cc 2 ,...,cc n ),cc 1 ,cc 2 ,...,cc n For a character, construct the character cc i Multi-granularity text embedding of (i =1, 2...., n), i.e., character embedding E of the concatenated sentence CS cc Word embedding E cw Part of speech embedding E cpos Phrase embedding E cph And substring embedding E css Constructing multi-granularity text embedding of the sentence CS, as shown in (2):
E cme =Concate(E cc ,E cw ,E cpos ,E cph ,E css ) (2)
wherein, concate represents splicing operation; in addition, E cme Dimension of = E cc Dimension + E of cw Dimension + E of cpos Dimension + E of cph Dimension + E of css Dimension of (d);
thus, from step 1.1.1 to step 1.1.4, the Chinese character cc is constructed i Multiple-granularity text embedding of (i =1, 2.. Said., n) as E cme
Step 1.2: for unstructured English medical text, multi-granularity text embedding is constructed, and the method comprises the following steps:
step 1.2.1: learning of symbol embedding, segmentation embedding and covering embedding is carried out by utilizing a pre-training language model BioBert, and word embedding of an unstructured English medical text is generated;
BioBert is a pre-training model generated by training according to English medical data;
for English sentence ES, ES = (ew) 1 ,ew 2 ,...,ew n ),ew 1 ,ew 2 ,...,ew n For a word, a word-embedded representation of the sentence ES is generated using the pre-trained language model BioBert, as shown in (3):
E ew =(e ew1 ,e ew2 ,....,e ewn ),E ew ∈R n×m (3)
wherein n is the length of the sentence, namely the number of words in the sentence, and if the number of the words is less than n, 0 is filled; m is the dimension set by the pre-training model BioBert; e.g. of the type ewi (i =1, 2.. N.) is the word ew i Embedding;
step 1.2.2: generating character embedding, part-of-speech embedding and phrase embedding of English medical texts;
for English medical texts, word2vec tools are used for generating character embedding, part of speech embedding and phrase embedding;
for English sentence ES, ES = (ew) 1 ,ew 2 ,...,ew n ),ew 1 ,ew 2 ,...,ew n Is word, ew i (i =1, 2.. Once, n) characters are embedded to form a view i All the characters of (2) are embedded and averaged, and part of speech is embedded as ew i Embedding part of speech, wherein phrase embedding is embedding of the type of the affiliated phrase;
wherein the character is embedded as E ec =(e ec1 ,e ec2 ,....,e ecn ) Wherein e is eci (i =1, 2.. N.) is the word ew i The character embedding of (2);
part of speech embedded as E epos =(e epo1 ,e epo2 ,....,e epon ) Wherein the word ew 1 ,ew 2 ,...,ew n In order of part of speech of epo 1 ,epo 2 ,...,epo n ;e epoi (i =1, 2.. N.) is a part-of-speech epo i Embedding;
phrase embedding as E eph =(e eph1 ,e eph2 ,....,e ephn ) Wherein the word ew 1 ,ew 2 ,...,ew n The subordinate phrase types are in turn eph 1 ,eph 2 ,...,eph n ;e ephi (i =1,2.. N.) is the phrase type eph i Embedding;
step 1.2.3: for unstructured English medical texts, generating substring embedding by using a word2vec tool;
firstly, collecting an English medical term dictionary and constructing a medical term substring set; for any two terms, extracting the longest common substring of the two terms, and adding the substring to the medical term substring set; if two terms have a plurality of longest common substrings with the same length, the first longest common substring is taken and added into the medical term substring set;
secondly, generating embedded representations of all substrings by using a word2vec tool;
then, for English sentence ES, ES = (ew) 1 ,ew 2 ,...,ew n ),ew 1 ,ew 2 ,...,ew n For words, for each word ew i (i =1, 2.. N.), judge word ew i Whether or not to include substrings in the medical term substring set; word setting view i Including substrings esubs in a set of substrings of medical terms 1 ,esubs 2 ,...,esubs q Substrings esubs 1 ,esubs 2 ,...,esubs q Is denoted as e es1 ,e es2 ,....,e esq Then word ew i Substring embedding representation e essi Comprises the following steps: to e es1 ,e es2 ,....,e esq Adding and summing, and then dividing by the result of the number q of the substrings; if word ew i If the substring does not contain any substring in the medical term substring set, outputting a user-defined value;
finally, for each character ew in the sentence ES i (i =1, 2.. Multidot.n), generating its corresponding substring embedding as per the above steps; substring embedding of sentence ES as E ess =(e ess1 ,e ess2 ,....,e essn );
Step 1.2.4: splicing character embedding, word embedding, part of speech embedding, phrase embedding and substring embedding of the English medical text to construct multi-granularity text embedding, specifically comprising the following steps of:
for English sentence ES, ES = (ew) 1 ,ew 2 ,...,ew n ),ew 1 ,ew 2 ,...,ew n Constructing an ew for a word i Multi-granularity text embedding of (i =1, 2...., n), i.e., character embedding of a spliced sentence ESE is introduced into ec Word embedding E ew Part of speech embedding E epos Phrase embedding E eph And substring embedding E ess Constructing multi-granularity text embedding of the sentence ES, as shown in (4);
E eme =Concate(E ec ,E ew ,E epos ,E eph ,E ess ) (4)
wherein, concate represents splicing operation; in addition, E eme Dimension of = E ec Dimension + E of ew Dimension + E- epos Dimension + E of eph Dimension +2E ess Dimension (d);
to this end, from step 1.2.1 to step 1.2.4, the English word ew is constructed i Multiple-granularity text embedding of (i =1,2,. Ang., n) as E eme
Step 2: generating the mode weight of all characters in the Chinese sentence according to the medical entity constitution mode, comprising the following steps:
step 2.1: constructing a Chinese medical entity composition mode;
the medical entity constitution mode has the constitution form: "Y" is 1 +Y 2 +Y 3 +...+Y k ”;
Wherein, Y 1 ,Y 2 ,Y 3 ,...,Y k Represents the category of the word, "+" represents the link operation of the character string;
the category of the words comprises negative words, clinical manifestations, anatomical parts, modifiers, disease names, physical examination, numerical values, quantifiers, and drugs;
step 2.2: generating a mode weight of characters in the Chinese sentence;
for the Chinese sentence CS, CS = (cc) 1 ,cc 2 ,...,cc n ),cc 1 ,cc 2 ,...,cc n And for characters, judging whether the Chinese sentence CS is matched with the medical entity constitution mode, wherein the constructed mode matching weight vector is as follows: (w) 1 ,w 2 ,...,w n );
Case 1: if the character string cc i ,cc i+1 ,...,cc j Satisfies the mode of ' anatomical region ', ' diseaseOr "medicine", for each character cc i ,cc i+1 ,...,cc j Giving a mode weight of 2;
case 2: if the character string cc i ,cc i+1 ,...,cc j Satisfying other modes, then for each character cc i ,cc i+1 ,...,cc j Giving a mode weight of 1.5;
case 3: if the character string cc i ,cc i+1 ,...,cc j Not satisfying the pattern, then for each character cc i ,cc i+1 ,...,cc j Giving a mode weight of 1;
and 3, step 3: using the graph attention network and the mode enhanced attention mechanism to carry out node embedding representation learning, comprising the following steps of:
step 3.1, the embedding dimensions of Chinese character symbol nodes or English word nodes are transformed by utilizing a full connection layer;
inputting the multi-granularity text embedding of each character in the Chinese sentence CS to a full connection layer, and converting the embedding dimension of the multi-granularity text embedding of the Chinese sentence CS; the reason for converting dimensions is: the multi-granularity text embedding dimension needs to be consistent with the node vector input dimension of the graph attention network used in the step 3.2;
similarly, for the multi-granularity text embedding of each word in the English sentence ES, inputting the multi-granularity text embedding into the full connection layer, and converting the embedding dimension of the English multi-granularity text embedding.
In the fully-connected layer, firstly, the dimensionality is converted through a linear layer; then, using dropout method to prevent overfitting; finally, the gradient is prevented from disappearing by activating the function Relu;
step 3.2: for the Chinese medical text, multiplying the attention weight and the mode weight of Chinese character nodes of the graph attention network by the mode weight of characters in a Chinese sentence; for English medical texts, the mode weight of English word nodes in the graph attention network is set to be 1;
for the Chinese medical text, node embedding of the graph attention network is character embedding, and the character embedding is the character embedding generated in the step 3.1; for English medical texts, node embedding of the attention network is word embedding, and word embedding is word embedding generated in the step 3.1;
step 3.2.1: calculating attention weights of nodes in the graph attention network;
first, a multi-granular text embedding h of a sentence is input into an attention layer in an attention network, wherein,
Figure BDA0003195630660000071
embedding Chinese character nodes or English word nodes of a sentence of a medical text, wherein M is the number of the nodes, and H is the dimension of character embedding, as shown in (5);
Figure BDA0003195630660000072
performing linear transformation on the input node embedding, and converting the node embedding into the dimension of the number of all category labels; and calculating attention weight by using LeakyRelu function, namely calculating importance degree e of node v to node u uv As shown in (6);
Figure BDA0003195630660000073
wherein, W 1 A shared weight matrix is represented that is,
Figure BDA0003195630660000074
the embedding of a chinese character node or an english word node u representing a sentence of medical text,
Figure BDA0003195630660000075
embedding Chinese character nodes or English word nodes v representing one sentence of the medical text;
then, using Softmax function to pair e uv Normalization is carried out to obtain alpha uv As shown in (7);
Figure BDA0003195630660000081
wherein e is uk Represents the degree of importance of node u to node k, α uv Denotes e uv Normalized value of (1), N u A neighbor node representing node u;
finally, the attention weight alpha of the node u is generated u As shown in (8);
Figure BDA0003195630660000082
wherein, W 2 Representing a weight matrix;
step 3.2.2: updating attention weights of nodes in the graph attention network;
first, for node u, the mode weight w of Chinese character or English word represented by node u is used u Updating the attention weight alpha of the node u u As shown in (9);
α u =α u ×w u (9)
second, construct an attention weight attention for the sentence l (1 is more than or equal to l and less than or equal to k) as shown in (10);
attention l =(α 12 ,...,α M ) (10)
then, a multi-head attention mechanism is introduced into the graph attention network, specifically: computing k attention weights, multiplying each attention weight by the input h, generating a feature h 'of the sentence' l As shown in (11);
h′ l =attention l ×h (11)
generating a single-ended output elu (h ') by activating a function elu' 1 ),elu(h′ 2 ),...,elu(h′ k );
Thirdly, splicing the outputs of the k heads to generate h' as shown in (12);
h'=Concat(elu(h′ 1 ),elu(h′ 2 ),...,elu(h' k )) (12)
finally, pass log _ softmax function generates the final output h final As shown in (13);
h final =log_softmax(h')) (13)
and 4, step 4: generating an entity category label of the medical text by adopting a conditional random field, and outputting a medical entity recognition result, wherein the method specifically comprises the following steps: generating entity category labels of Chinese characters or English words;
calculating the conditional probability distribution density of each character based on the conditional random field, namely calculating the probability of each character belonging to each entity class label, allocating the label with the highest probability to the corresponding character to serve as the entity class label of the character, and further outputting a medical entity recognition result;
and (3) carrying out sequence labeling on sentences in the medical text by adopting a conditional random field, generating entity category labels of Chinese characters or English words, and outputting a medical text entity recognition result.
Advantageous effects
Compared with the traditional medical entity identification method, the medical entity identification method based on multi-granularity text embedding provided by the invention has the following beneficial effects:
1. the identification method has portability and robustness, and is not limited to the source of the corpus; the method comprises the steps of carrying out graph representation modeling on a medical text based on a graph attention network, wherein the language of a language material is not limited, and Chinese texts and English texts can be processed;
2. the method comprises the steps of establishing multi-granularity text embedding of an unstructured medical text, wherein the multi-granularity text embedding comprises character embedding, word embedding, part of speech embedding, substring embedding and phrase embedding; by introducing multi-granularity text embedding, the characteristics of the medical text in the aspects of characters, words, parts of speech, phrases and substrings are mined, the distributed representation learning of a character string level, a lexical level and a grammar level is realized, the entity characteristic information of the medical text is enhanced, and the accuracy of medical entity recognition is improved;
3. the method adopts an image attention network, a mode enhanced attention mechanism and a conditional random field to identify the medical entity: firstly, a graph attention network model is utilized to realize graph representation modeling of a medical text, and graph structure information between Chinese characters or English words of the medical text is captured; secondly, a mode strengthening attention mechanism is introduced, the medical entity forming mode characteristics are introduced into the attention weight of the nodes in the graph attention network, the mode-based medical entity identification method and the deep learning-based medical entity identification method are effectively integrated, the characteristics and the advantages of the two methods are fully utilized, and the medical entity identification performance is improved;
4. the method can identify the medical entities of the unstructured Chinese medical text and the English medical text, and has wide application prospects in the fields of information retrieval, text classification, question-answering systems and the like.
Drawings
Fig. 1 is a flowchart illustrating a medical entity recognition based on multi-granular text embedding according to an embodiment of the present invention.
Detailed Description
The medical entity recognition system based on the method takes Pycharm as a development tool, python as a development language and Pythroch as a development framework.
Preferred embodiments of the process of the present invention will be described in detail with reference to examples.
Examples
This embodiment describes a process of using the method for medical entity recognition based on multi-granularity text embedding according to the present invention, as shown in fig. 1.
Firstly, generating character embedding of Chinese medical texts and word embedding of English medical texts by using pre-training language models MC-Bert and BioBert; word embedding, part-of-speech embedding, phrase embedding, substring embedding of the Chinese medical text and character embedding, part-of-speech embedding, phrase embedding and substring embedding of the English medical text are generated by using a word2vec tool, and the embedding is spliced to construct final Chinese and English medical text multi-granularity text embedding; secondly, generating mode weights of all characters in the Chinese sentence according to the medical entity composition mode, wherein the mode weights of all words in the English sentence are set to be 1; then, carrying out node embedding expression learning by using a graph attention network and a mode enhanced attention mechanism, and updating the attention weight of the nodes in the graph attention network by using a mode weight; finally, predicting the entity label of each character in the Chinese medical text or predicting the entity label of each word in the English medical text by using a conditional random field, and outputting a medical text entity recognition result; experiments were performed under the CCKS2019 dataset; firstly, generating multi-granularity text embedding of each sentence of medical text in a CCKS2019 data set; secondly, generating mode weights of all characters in each sentence of medical text according to the medical entity composition mode; then, transmitting the multi-granularity text embedding and pattern matching weight into a graph attention network, multiplying the pattern matching weight with the attention weight of the node, and calculating to obtain the final embedded expression of the input text; finally, outputting a final predicted entity identification label by the conditional random field according to the calculated probability; the experimental results prove the effectiveness of the invention; the method can also be applied to an English medical text data set NCBI Disease, a biochemical field data set BC5CDR and the like; the process applied to the data set NCBI Disease is generally consistent with the data set CCKS2019, and the difference is that when the attention coefficient is calculated in the attention network, the mode weights of English word nodes are all set to be 1; the flow applied to the data set BC5CDR differs: when multi-granularity text embedding is constructed, a term dictionary aiming at the field of biochemistry is required to be used for generating substring embedding, and character embedding, word embedding, part of speech embedding and phrase embedding are spliced, and the spliced embedding is transmitted into a graph attention network; when the attention network calculates the attention weight, adding entity formation pattern matching weight aiming at the biochemical field, and transmitting the result of the attention network into a conditional random field; and outputting a final predicted entity recognition result by the conditional random field according to the probability.
As can be seen from fig. 1, the method specifically includes the following steps:
step 1: the method for constructing the multi-granularity text embedding through the pre-training language model comprises the following steps:
step 1.1: for unstructured Chinese medical texts, constructing multi-granularity text embedding;
step 1.1.1: learning of symbol embedding, segmentation embedding and covering embedding is carried out by utilizing a pre-training language model MC-Bert, and character embedding of an unstructured Chinese medical text is generated;
MC-Bert is a pre-training model generated according to Chinese medical data training;
for unstructured Chinese medical text, the input of the pre-training language model MC-Bert consists of three kinds of Embedding, namely symbol Embedding (Token Embedding), segment Embedding (Segment Embedding) and Mask Embedding (Mask Embedding), wherein symbol Embedding refers to vector representation of each word. Segmentation embedding is used for distinguishing two natural language sentences, and the medical entity recognition task recognizes an entity in sentence units, so that each word has the same segmentation embedding. In covering embedding, if the current position is the character of the input sentence, the value is assigned to 1; if the current position is 0, namely the current position is not the character of the input sentence, the value is assigned to be 0;
for the Chinese sentence CS, CS = (cc) 1 ,cc 2 ,...,cc n ),cc 1 ,cc 2 ,...,cc n Is a character; generating a character-embedded representation E of a sentence CS using a pre-trained language model MC-Bert cc As shown in (1);
E cc =(e cc1 ,e cc2 ,....,e ccn ),E cc ∈R n×m (1)
wherein n is the sentence length 512, i.e. the number of characters in the sentence, and if the number of characters is less than 512, 0 is filled; m is a dimensionality 768 set by the pre-training model MC-Bert; e.g. of the type cci (i =1, 2.., n) is the character cc i 768 dimensions; character-embedded representation E cc Has a dimension of 512 × 768; r n×m A real number matrix representing n rows and m columns;
for example, in the sentence "the patient has yellow skin and sclera, with decreased appetite before 4 months, after dinner, with paroxysmal abdominal pain, nausea, no diarrhea, vomiting, chest distress, suffocating, dizziness, headache, no object rotation, no fever, cough, chest pain, asthma, and light stool. ", the characters are divided by" \ t ". The mark of 'CLS' and 'SEP' is added at the beginning and end of sentence. To make the dimensions of the character-embedded representations of different sentences consistent, the sentence length is extended to 512 by filling in 0. The character embedding representation for generating the sentence by using the pre-training model MC-Bert is as follows:
Figure BDA0003195630660000111
where n is the sentence length 512, m represents the character embedding dimension 768 dimensions. The character-embedded vector of the character "sick" is (x) 11 ,x 12 ,...x 1m );
Step 1.1.2: generating word embedding, part of speech embedding and phrase embedding of the Chinese medical text;
firstly, for a Chinese medical text, a jieba word segmentation tool is used for obtaining words of the Chinese medical text, a part of speech marker of the words of the Chinese medical text is obtained by using a part of speech marker Stanford posttagger, and a phrase marker of the Chinese medical text is obtained by using a syntax analyzer Stanford parser;
then, word embedding, part of speech embedding and phrase embedding are generated by using a word2vec tool, and the three embedding dimensions are all 200;
for the Chinese sentence CS, CS = (cc) 1 ,cc 2 ,...,cc n ),cc 1 ,cc 2 ,...,cc n The method is characterized in that the method comprises the following steps: obtaining words of the Chinese medical text based on a jieba word segmentation tool, and obtaining word embedding of the affiliated words by using word2 vec; part of speech embedding, specifically: based on the part-of-speech tag of the word of the Chinese medical text, word2vec is used for acquiring part-of-speech embedding of the affiliated word; phrase embedding specifically includes: based on the phrase marks of the obtained Chinese medical texts, obtaining the type embedding of the affiliated phrases by using word2 vec;
wherein the word is embedded as E cw =(e cw1 ,e cw2 ,....,e cwn ) Wherein the character cc 1 ,cc 2 ,...,cc n The subordinate words are cw in sequence 1 ,cw 2 ,...,cw n ;e cwi (i =1, 2.., n) is the word cw i The embedding of (2);
part of speech embedded as E cpos =(e cpo1 ,e cpo2 ,....,e cpon ) Wherein the word cw 1 ,cw 2 ,...,cw n Of part of speech are cpo in sequence 1 ,cpo 2 ,...,cpo n ;e cpoi (i =1, 2.., n) is a part of speech cpo i The embedding of (2);
phrase embedding as E cph =(e cph1 ,e cph2 ,....,e cphn ) Wherein, the character cc 1 ,cc 2 ,...,cc n The subordinate phrase types are in turn cph 1 ,cph 2 ,...,cph n ;e cphi (i =1, 2.., n) is the phrase type cph i Embedding;
for example, for the Chinese sentence "the patient found yellow skin and sclera 4 months ago", the segmentation result obtained by the jieba segmentation tool is "the patient found yellow skin and sclera 4 months ago". Further, the expanded word segmentation result is that the patient finds yellow and yellow dyeing of the skin and sclera of the skin 4 months ago, and words to which each character in the sentence belongs are given. For example, the character "affected" is subordinate to the word "patient", and the word of the character "affected" is embedded as an embedding of the word "patient";
the part-of-speech mark of the Chinese sentence after being expanded is obtained as 'NNNNCD NNLC VV NNNNPU NNNNNR NR NR' through a part-of-speech mark device Stanford popper. Obtaining a phrase label of the Chinese sentence after being expanded, namely 'NP NP NP NP NP NP NP NP NP LCP VP NP NP PU NP NP NP NP NP NP by using a syntax analyzer Stanford parser, for example, the character' suffers 'from' and is subordinate to the word 'patient', the part of speech of the character 'suffers' is embedded into the part of speech 'NN' of the word 'patient', and the phrase of the character 'suffers' is embedded into the type label 'NP' of the subordinate phrase;
step 1.1.3: for an unstructured Chinese medical text, generating substring embedding by using a word2vec tool, wherein the embedding dimension is 200, and specifically comprises the following steps:
firstly, collecting a medical term dictionary and constructing a medical term substring set;
for any two terms, the longest common substring of the two terms is extracted and added to the medical term substring set. If the two terms have a plurality of longest common substrings with the same length, taking the first longest common substring and adding the first longest common substring to the medical term substring set;
secondly, generating embedded expressions of all substrings in the substring set of the medical terms by using a word2vec tool;
then, for each word cw i (i =1, 2.. N.), judging the word cw i Whether or not to include substrings in the medical term substring set; for the Chinese sentence CS, CS = (cc) 1 ,cc 2 ,...,cc n ),cc 1 ,cc 2 ,...,cc n Is a character, cc 1 ,cc 2 ,...,cc n The subordinate words are as follows: cw 1 ,cw 2 ,...,cw n (ii) a Let the word cw i Containing substrings csubs in a set of medical term substrings 1 ,csubs 2 ,...,csubs p String csubs 1 ,csubs 2 ,...,csubs p Is denoted as e cs1 ,e cs2 ,....,e csp Then word cw i Substring embedding representation e cssi Comprises the following steps: to e cs1 ,e cs2 ,....,e csp Adding and summing, and then dividing by the result of the number p of the substrings; if the word cw i If the substring does not contain any substring in the medical term substring set, outputting a custom value;
finally, for each character cc in the sentence CS i (i =1, 2.. Multidot.n), generating its corresponding substring embedding as per the above steps; substring embedding of sentence CS as E css =(e css1 ,e css2 ,....,e cssn );
For example, a medical dictionary is collected: medical system nomenclature-clinical term SNOMED CT, the medical term substring set is constructed. For the word "digestive tract" in a Chinese sentence, the word includes the substrings "digestion, tract, digestive tract" in the substring set of medical terms. Further, the substring embedding of the character string "digestive tract" is: the embedding of six substrings of 'elimination', 'melting', 'tract', 'digestion', 'melting tract' and 'melting tract' represents the result of adding and summing and dividing by the number of the substrings of 6;
step 1.1.4: character embedding, word embedding, part of speech embedding, phrase embedding and substring embedding of the Chinese medical text are spliced to construct multi-granularity text embedding:
for the Chinese sentence CS, CS = (cc) 1 ,cc 2 ,...,cc n ),cc 1 ,cc 2 ,...,cc n For a character, construct the character cc i Multi-granularity text embedding of (i =1, 2...., n), i.e., character embedding E of the concatenated sentence CS cc Word embedding E cw Part of speech embedding E cpos Phrase embedding E cph And substring embedding E css Constructing multi-granularity text embedding of the sentence CS, as shown in (2);
E cme =Concate(E cc ,E cw ,E cpos ,E cph ,E css ) (2)
wherein, concate represents the splicing operation. In addition, E cme Dimension of (c) is 1568, i.e. 1568 (E) cme Dimension of) =768 (E cc Dimension of) +200 (E) cw Dimension of) +200 (E) cpos Dimension of) +200 (E) cph Dimension of) +200 (E) css Dimension (d);
thus, from step 1.1.1 to step 1.1.4, the Chinese character cc is constructed i Multiple-granularity text embedding of (i =1, 2.. Said., n) as E cme
Step 1.2: for unstructured English medical text, generating multi-granularity text embedding, comprising the steps of:
step 1.2.1: learning of symbol embedding, segmentation embedding and covering embedding is carried out by utilizing a pre-training language model BioBert, and word embedding of an unstructured English medical text is generated;
BioBert is a pre-training model generated according to English medical data training;
for English sentence ES, ES = (ew) 1 ,ew 2 ,...,ew n ),ew 1 ,ew 2 ,...,ew n Is a word; using pre-training wordsThe speech model BioBert, which generates a word-embedded representation of the sentence ES, as shown in (3);
E ew =(e ew1 ,e ew2 ,....,e ewn ),E ew ∈R n×m (3)
wherein n is the sentence length 512, i.e. the number of words in the sentence, and if the number of words is less than 512, 0 is filled; m is a dimensionality 768 dimension set by a pre-training model BioBert; e.g. of the type ewi (i =1, 2.. Times.n) is the word ew i 768 dimensions; character-embedded representation E ew Has a dimension of 512 × 768; r n×m A real number matrix representing n rows and m columns;
step 1.2.2: generating character embedding, part of speech embedding and phrase embedding of English medical texts;
for English medical texts, word2vec tools are used for generating character embedding, part of speech embedding and phrase embedding, and the embedding dimension is 200;
for English sentence ES, ES = (ew) 1 ,ew 2 ,...,ew n ),ew 1 ,ew 2 ,...,ew n Is word, ew i (i =1, 2.. Once, n) characters are embedded to form a view i All character embedding and averaging, part of speech embedding as ew i Embedding part of speech, wherein phrase embedding is embedding of the type of the affiliated phrase;
wherein the character is embedded as E ec =(e ec1 ,e ec2 ,....,e ecn ) Wherein e is eci (i =1, 2.. Times.n) is the word ew i The character embedding of (2);
part of speech embedded as E epos =(e epo1 ,e epo2 ,....,e epon ) Wherein the word ew 1 ,ew 2 ,...,ew n In order of part of speech of epo 1 ,epo 2 ,...,epo n ;e epoi (i =1,2.. N) is a part-of-speech epo i The embedding of (2);
phrase embedding as E eph =(e eph1 ,e eph2 ,....,e ephn ) Wherein the word ew 1 ,ew 2 ,...,ew n The subordinate phrase types are in turn eph 1 ,eph 2 ,...,eph n ;e ephi (i =1,2.. N.) is the phrase type eph i The embedding of (2);
step 1.2.3: for an unstructured English medical text, generating substring embedding by using a word2vec tool, wherein the embedding dimension is 200, and specifically comprises the following steps:
firstly, collecting an English medical term dictionary and constructing a medical term substring set; for any two terms, extracting the longest common substring of the two terms, and adding the substring to the medical term substring set; if two terms have a plurality of longest common substrings with the same length, the first longest common substring is taken and added into the medical term substring set;
secondly, generating embedded representations of all the substrings by using a word2vec tool;
then, for English sentence ES, ES = (ew) 1 ,ew 2 ,...,ew n ),ew 1 ,ew 2 ,...,ew n For words, for each word ew i (i =1, 2.. N.), judge word ew i Whether substrings in the medical term substring set are included; word setting ew i Including substrings esubs in a set of substrings of medical terms 1 ,esubs 2 ,...,esubs q Sub-strings esubs 1 ,esubs 2 ,...,esubs q Is denoted as e es1 ,e es2 ,....,e esq Then word ew i Substring embedding representation e essi Comprises the following steps: to e es1 ,e es2 ,....,e esq Adding and summing, and then dividing by the result of the number q of the substrings; if the word ew i If the substring does not contain any substring in the medical term substring set, outputting a custom value;
finally, for each character ew in the sentence ES i (i =1, 2.. N), generating its corresponding substring embeddings as per the above steps; substring embedding of sentence ES as E ess =(e ess1 ,e ess2 ,....,e essn );
Step 1.2.4: character embedding, word embedding, part of speech embedding, phrase embedding and substring embedding of the English medical text are spliced to construct multi-granularity text embedding:
for English sentence ES, ES = (ew) 1 ,ew 2 ,...,ew n ),ew 1 ,ew 2 ,...,ew n Constructing an ew for a word i Multi-granularity text embedding of (i =1,2.. Ang., n), i.e., character embedding E of a spliced sentence ES ec Word embedding E ew Part of speech embedding E epos Phrase embedding E eph And substring embedding E ess Constructing multi-granularity text embedding of the sentence ES, as shown in (4);
E eme =Concate(E ec ,E ew ,E epos ,E eph ,E ess ) (4)
wherein, concate represents the splicing operation. In addition, E eme Dimension of (c) is 1568, i.e. 1568 (E) eme Dimension of) =768 (E) ec Dimension of) +200 (E) ew Dimension of) +200 (E) epos Dimension of) +200 (E) eph Dimension of) +200 (E) ess Dimension (d);
so far, from step 1.2.1 to step 1.2.4, the English word ew is constructed i Multiple-granularity text embedding of (i =1, 2.. Said., n) as E eme
Step 2: generating the mode weight of all characters in the Chinese sentence according to the medical entity constitution mode, comprising the following steps:
step 2.1: constructing a Chinese medical entity composition mode;
the medical entity constitution mode has the constitution form: "Y" is 1 +Y 2 +Y 3 +...+Y k ", wherein Y 1 ,Y 2 ,Y 3 ,...,Y k Indicating the category of the word, "+" indicates the link operation of the character string. The category of the words comprises negative words, clinical manifestations, anatomical parts, modifiers, disease names, physical examination, numerical values, quantifier words and medicines;
for example, negation words include none, absent, etc. The clinical manifestations include chill, sweating, increased heart rate, etc. Anatomical sites include the back, meniscus, left colon artery, etc. Modifiers include mild, more severe, etc. The disease names include rheumatic heart disease, multiple cancers, etc. Physical examination includes cardiopulmonary examination, electrocardiography, and the like. Quantifier includes degree, group, only, etc. The medicines comprise cedilanid, cefuroxime axetil, aspirin and the like;
for example, constructing a medical entity constitutes a pattern of "negation + clinical manifestation", which is satisfied by the terms "no nausea" and "no fever". Because the term "no nausea" consists of the negation of the word "no" and clinical manifestations of "nausea", the term "no fever" consists of the negation of the word "no" and clinical manifestations of "fever";
step 2.2: generating a mode weight of characters in the Chinese sentence;
for the Chinese sentence CS, CS = (cc) 1 ,cc 2 ,...,cc n ),cc 1 ,cc 2 ,...,cc n And for characters, judging whether the Chinese sentence CS is matched with the medical entity constitution mode, wherein the constructed mode matching weight vector is as follows: (w) 1 ,w 2 ,...,w n );
Case 1: if the character string cc i ,cc i+1 ,...,cc j Satisfying the pattern "anatomical region", "disease" or "medicine", it is cc for each character i ,cc i+1 ,...,cc j Giving a mode weight of 2;
case 2: if the character string cc i ,cc i+1 ,...,cc j Satisfying other patterns, for each character cc i ,cc i+1 ,...,cc j Giving a weight of 1.5 to the mode;
case 3: if the character string cc i ,cc i+1 ,...,cc j Not satisfying the pattern, for each character cc i ,cc i+1 ,...,cc j Giving a mode weight of 1;
for example, the input text "patient found yellow skin, sclera 4 months ago" constructs a pattern from the medical entity, generating a pattern weight vector as: (1,1,1,1,1,1,1,1,2,2,1,2,2,1,1,1,1,1);
and 3, step 3: using a graph attention network and a pattern-reinforced attention mechanism to perform node-embedded representation learning, comprising the steps of:
3.1, transforming the embedding dimensions of the Chinese character nodes or English word nodes by using the full connection layer;
for multi-granularity text embedding of each character in the Chinese sentence CS, the input is input into a full connection layer, and the embedding dimension of the multi-granularity text embedding of the Chinese sentence is converted from 1568 dimensions to 768 dimensions. The reason for converting dimensions is: the multi-granularity text embedding dimension needs to be consistent with the node vector input dimension of the graph attention network used in the step 3.2, namely 768 dimensions. Similarly, for the multi-granularity text embedding of each word in the English sentence ES, inputting the multi-granularity text embedding into a full connection layer, and converting the embedding dimensionality of the English multi-granularity text embedding from 1568 dimensionality to 768 dimensionality;
in the fully-connected layer, firstly, the dimensionality is converted through a linear layer; then, overfitting is prevented by using a dropout method; finally, the gradient is prevented from disappearing by the activation function Relu;
step 3.2: for the Chinese medical text, multiplying the attention weight and the mode weight of Chinese character nodes of the graph attention network by the mode weight of characters in a Chinese sentence; for English medical texts, the mode weight of English word nodes in the graph attention network is set to be 1;
for Chinese medical text, the node embedding of the graph attention network is character embedding, and the character embedding is 768-dimensional character embedding generated in step 3.1. For English medical texts, node embedding of the graph attention network is word embedding, and the word embedding is 768-dimensional word embedding generated in the step 3.1;
step 3.2.1: calculating attention weights of nodes in the graph attention network;
first, a multi-granular text embedding h of a sentence is input into an attention layer in an attention network, wherein,
Figure BDA0003195630660000171
embedding Chinese character nodes or English word nodes of a sentence of a medical text, wherein M is the number of the nodes 512, H is the dimension of character embedding, and the value is 768, as shown in (5);
Figure BDA0003195630660000172
and performing linear transformation on the input node embedding, converting the 768-dimensional node embedding into 16-dimensional node embedding, wherein 16 is the number of all class labels. And calculating attention weight by utilizing LeakyRelu function, namely calculating importance degree e of the node v to the node u uv As shown in (6);
Figure BDA0003195630660000173
wherein, W 1 A shared weight matrix is represented that is,
Figure BDA0003195630660000174
the embedding of a chinese character node or an english word node u representing a sentence of medical text,
Figure BDA0003195630660000175
embedding Chinese character nodes or English word nodes v representing one sentence of the medical text;
then, e is paired with Softmax function uv Normalization is carried out to obtain alpha uv As shown in (7);
Figure BDA0003195630660000176
wherein e is uk Representing the importance degree of the node u to the node k;
wherein alpha is uv Denotes e uv Normalized value of (1), N u A neighbor node representing node u;
finally, the attention weight α of the node u is generated u As shown in (8);
Figure BDA0003195630660000181
wherein W 2 Representing a weight matrix;
step 3.2.2: updating the attention weights of the nodes in the graph attention network;
first, for node u, the mode weight w of Chinese character or English word represented by node u is used u Updating the attention weight alpha of the node u u As shown in (9);
α u =α u ×w u (9) Second, construct attention weight attention of sentence l (l is more than or equal to 1 and less than or equal to k) as shown in (10);
attention l =(α 12 ,...,α M ) (10)
a multi-point attention mechanism is then introduced to the graph attention network. Specifically, k attention weights are calculated, each attention weight is multiplied by the input h, and the feature h 'of the sentence is generated' l As shown in (11);
h′ l =attention l ×h (11)
by activating the function elu, a single-headed output elu (h' 1 ),elu(h' 2 ),...,elu(h' k ) Finally, splicing the outputs of the k heads to generate h' as shown in (12);
h'=concat(elu(h′ 1 ),elu(h′ 2 ),...,elu(h' k )) (12)
finally, generating a final output h through a log _ softmax function final As shown in (13);
h final =log_softmax(h')) (13)
and 4, step 4: generating an entity category label of the medical text by adopting a conditional random field, and outputting a medical entity recognition result, wherein the method specifically comprises the following steps: generating entity category labels of Chinese characters or English words;
calculating the conditional probability distribution density of each character based on the conditional random field, namely calculating the probability of each character belonging to each entity class label, allocating the label with the highest probability to the corresponding character to serve as the entity class label of the character, and further outputting a medical entity recognition result;
and (3) carrying out sequence labeling on sentences in the medical text by adopting a conditional random field, generating entity category labels of Chinese characters or English words, and outputting a medical text entity recognition result.
For example, for a data set, its entity class labels include: "PAD", "CLS", "SEP", "O", "B-disease and diagnosis", "I-disease and diagnosis", "B-surgery", "I-surgery", "B-anatomical site", "I-anatomical site", "B-drug", "I-drug", "B-imaging examination", "I-imaging examination", "B-laboratory examination", "I-laboratory examination";
for example, in the sentence "the patient has yellow skin and sclera, with decreased appetite before 4 months, after dinner, the patient has a significant symptom, with paroxysmal abdominal pain, nausea, no diarrhea, vomiting, chest distress, suffocating, dizziness, headache, no object rotation, no fever, cough, chest pain, panting, and light stool color. <xnotran> ", [3,3,3,3,3,3,3,3,3,3,8,9,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,8,3,3,3,3,3,3,8,3,3,3,3,3,3,3,8,3,3,3,3,3,8,3,3,3,8,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,8,3,3,3,3,3,3,3,3,3,3,3,3,3,3]; </xnotran> Each number in the list represents an index of the entity class label of the bit character predicted. The index is converted into a corresponding entity label through an idx2tag function, and the final entity recognition result is O O O O O O O O B-anatomical part I-anatomical part O O O O O O O O O O O O O O O O O O O B-anatomical part O O O O O O O O O O O O B-anatomical part O O O O O O O O O O O O O O O O O O O O O O O O B.
In order to illustrate the medical entity recognition effect of the invention, the experiment is carried out by comparing the same training set and test set by two methods respectively under the same condition; the first method is a medical entity recognition method based on a bidirectional long-short term memory network, an attention mechanism and a conditional random field, and introduces a medical dictionary and part-of-speech characteristics; the second method is the medical entity identification method of the present invention;
the adopted evaluation indexes are as follows: accuracy, recall and F1 value; the medical entity recognition results are: the accuracy of the medical entity recognition results of the two-way long-short term memory network, the attention mechanism and the conditional random field in the prior art is 76.42 percent, the recall rate is 73.80 percent, and the F1 value is 75.08 percent; the accuracy of the medical entity recognition result by adopting the method is 86.38 percent, the recall rate is 85.82 percent, and the F1 value is 86.10 percent; the effectiveness of the medical entity identification method based on multi-granularity text embedding provided by the invention is shown through experiments;
while the foregoing is directed to the preferred embodiment of the present invention, it is not intended that the invention be limited to the embodiment and the drawings disclosed herein. It is intended that all equivalents and modifications which do not depart from the spirit of the invention disclosed herein are deemed to be within the scope of the invention.

Claims (9)

1. A medical entity recognition method based on multi-granularity text embedding is characterized by comprising the following steps: the method comprises the following steps:
step 1: the method for constructing the multi-granularity text embedding through the pre-training language model comprises the following steps:
step 1.1: constructing multi-granularity text embedding for an unstructured Chinese medical text;
step 1.1.1: learning of symbol embedding, segmentation embedding and covering embedding is carried out by utilizing a pre-training language model MC-Bert, and character embedding of an unstructured Chinese medical text is generated;
for an unstructured Chinese medical text, the input of a pre-training language model MC-Bert consists of three kinds of embedding, namely symbol embedding, segmentation embedding and covering embedding;
step 1.1.2: generating word embedding, part-of-speech embedding and phrase embedding of the Chinese medical text;
firstly, for a Chinese medical text, a jieba word segmentation tool is used for obtaining words of the Chinese medical text, a part of speech marker of the words of the Chinese medical text is obtained by using a part of speech marker Stanford posttagger, and a phrase marker of the Chinese medical text is obtained by using a syntax analyzer Stanford parser;
then, word embedding, part of speech embedding and phrase embedding are generated by using a word2vec tool;
step 1.1.3: for an unstructured Chinese medical text, generating substring embedding by using a word2vec tool, specifically:
firstly, collecting a medical term dictionary, and constructing a medical term substring set, specifically comprising the following steps: for any two terms, extracting the longest common substring of the two terms, and adding the common substring to the medical term substring set; if two terms have a plurality of longest common substrings with the same length, the first longest common substring is taken and added into the medical term substring set;
secondly, generating embedded expressions of all substrings in the substring set of the medical terms by using a word2vec tool;
step 1.1.4: character embedding, word embedding, part of speech embedding, phrase embedding and substring embedding of the Chinese medical text are spliced to construct multi-granularity text embedding;
step 1.2: for unstructured English medical text, constructing multi-granularity text embedding, comprising the following steps:
step 1.2.1: learning of symbol embedding, segmentation embedding and covering embedding is carried out by utilizing a pre-training language model BioBert, and word embedding of an unstructured English medical text is generated;
BioBert is a pre-training model generated according to English medical data training;
step 1.2.2: generating character embedding, part of speech embedding and phrase embedding of English medical texts;
for English medical texts, character embedding, part-of-speech embedding and phrase embedding are generated by using a word2vec tool;
step 1.2.3: for unstructured English medical texts, generating substrings by using a word2vec tool for embedding;
firstly, collecting an English medical term dictionary and constructing a medical term substring set; for any two terms, extracting the longest common substring of the two terms, and adding the common substring to the medical term substring set; if the two terms have a plurality of longest common substrings with the same length, taking the first longest common substring and adding the first longest common substring to the medical term substring set;
secondly, generating embedded representations of all the substrings by using a word2vec tool;
then, for English sentence ES, ES = (ew) 1 ,ew 2 ,...,ew n ),ew 1 ,ew 2 ,...,ew n For words, for each word ew i (i =1, 2.. Eta., n), judge the word ew i Whether substrings in the medical term substring set are included; word setting view i Including substrings esubs in a set of substrings of medical terms 1 ,esubs 2 ,...,esubs q Substrings esubs 1 ,esubs 2 ,...,esubs q Is denoted as e es1 ,e es2 ,....,e esq Then word ew i Substring embedding representation e essi Comprises the following steps: to e es1 ,e es2 ,....,e esq Adding and summing, and then dividing by the result of the number q of the substrings; if the word ew i If the substring does not contain any substring in the medical term substring set, outputting a custom value;
finally, for each word ew in the sentence ES i (i =1, 2.. Multidot.n), generating its corresponding substring embedding as per the above steps; substring embedding of sentence ES as E ess =(e ess1 ,e ess2 ,....,e essn );
Step 1.2.4: splicing character embedding, word embedding, part of speech embedding, phrase embedding and substring embedding of the English medical text to construct multi-granularity text embedding;
step 2: generating the mode weight of all characters in the Chinese sentence according to the medical entity composition mode, comprising the following steps:
step 2.1: constructing a Chinese medical entity composition mode;
the medical entity constitution mode has the constitution form: "Y 1 +Y 2 +Y 3 +...+Y k ”;
Wherein Y is 1 ,Y 2 ,Y 3 ,...,Y k Represents the category of the word, "+" represents the link operation of the character string;
the category of the words comprises negative words, clinical manifestations, anatomical parts, modifiers, disease names, physical examination, numerical values, quantifiers and medicines;
step 2.2: generating a mode weight of characters in the Chinese sentence;
and 3, step 3: using the graph attention network and the mode enhanced attention mechanism to carry out node embedding representation learning, comprising the following steps of:
step 3.1, the embedding dimensions of Chinese character symbol nodes or English word nodes are transformed by utilizing a full connection layer;
inputting the multi-granularity text embedding of each character in the Chinese sentence CS to a full connection layer, and converting the embedding dimension of the multi-granularity text embedding of the Chinese sentence CS; the reason for converting dimensions is: the multi-granularity text embedding dimension needs to be consistent with the node vector input dimension of the graph attention network used in the step 3.2;
similarly, for multi-granularity text embedding of each word in the English sentence ES, inputting the multi-granularity text embedding into the full connection layer, and converting the embedding dimension of the English multi-granularity text embedding;
in the fully-connected layer, firstly, the dimensionality is converted through a linear layer; then, using dropout method to prevent overfitting; finally, the gradient is prevented from disappearing by activating the function Relu;
step 3.2: for the Chinese medical text, multiplying the attention weight and the mode weight of Chinese character nodes of the graph attention network by the mode weight of characters in a Chinese sentence; for English medical texts, the mode weight of the Chinese and English word nodes in the attention network is set to be 1;
for Chinese medical texts, node embedding of the attention network is character embedding, and the character embedding is character embedding generated in the step 3.1; for English medical texts, node embedding of the attention network is word embedding, and word embedding is word embedding generated in the step 3.1;
step 3.2.1: calculating attention weights of nodes in the graph attention network;
step 3.2.2: updating attention weights of nodes in the graph attention network;
and 4, step 4: generating an entity category label of the medical text by adopting a conditional random field, and outputting a medical entity recognition result, wherein the method specifically comprises the following steps: generating entity category labels of Chinese characters or English words;
calculating the conditional probability distribution density of each character based on the conditional random field, namely calculating the probability of each character belonging to each entity class label, allocating the label with the highest probability to the corresponding character to serve as the entity class label of the character, and further outputting a medical entity recognition result;
adopting a conditional random field to carry out sequence labeling on sentences in the medical text, generating entity category labels of Chinese characters or English words, and outputting a medical text entity recognition result;
2. the method for recognizing medical entities based on multi-granularity text embedding according to claim 1, wherein: the MC-Bert in the step 1.1.1 is a pre-training model generated according to Chinese medical data training; and the symbol embedding in step 1.1.1 refers to the vector representation of each word; the segmentation embedding is used for distinguishing two natural language sentences, and the medical entity recognition task recognizes the entity by taking the sentences as units, so that each word has the same segmentation embedding; in the covering embedding, if the current position is the character of the input sentence, the value is assigned to be 1; if the current position is 0, namely the current position is not the character of the input sentence, the value is assigned to be 0;
3. the method for recognizing medical entities based on multi-granularity text embedding according to claim 2, wherein: in step 1.1.1, for the Chinese sentence CS, CS = (cc) 1 ,cc 2 ,...,cc n ),cc 1 ,cc 2 ,...,cc n For the character, a pre-training language model MC-Bert is adopted to generate a character embedded representation E of the sentence CS cc As shown in (1):
E cc =(e cc1 ,e cc2 ,....,e ccn ),E cc ∈R n×m (1)
wherein n is the sentence length, i.e. sentenceThe number of characters in the son, if the number of characters is less than n, filling 0; m is the dimension set by the pre-training model MC-Bert; e.g. of the type cci (i =1, 2.., n) is the character cc i Embedding; character-embedded representation E cc The dimension of (a) is n rows and m columns; r n×m A real number matrix representing n rows and m columns;
4. the method for recognizing medical entities based on multi-granularity text embedding according to claim 3, wherein the method comprises the following steps: for the chinese sentence CS, CS = (cc) in step 1.1.2 1 ,cc 2 ,...,cc n ),cc 1 ,cc 2 ,...,cc n The method is characterized in that the method comprises the following steps: obtaining words of the Chinese medical text based on a jieba word segmentation tool, and obtaining word embedding of the affiliated words by using word2 vec; part of speech embedding, specifically: based on the part-of-speech tag of the word of the Chinese medical text, word2vec is used for acquiring part-of-speech embedding of the affiliated word; phrase embedding, specifically: acquiring the type embedding of the affiliated phrase by using word2vec based on the phrase mark of the Chinese medical text;
wherein the word is embedded as E cw =(e cw1 ,e cw2 ,....,e cwn ) Wherein the character cc 1 ,cc 2 ,...,cc n The subordinate words are cw in sequence 1 ,cw 2 ,...,cw n ;e cwi (i =1,2.. Ang., n) is the word cw i The embedding of (2);
part of speech embedded as E cpos =(e cpo1 ,e cpo2 ,....,e cpon ) Wherein the word cw 1 ,cw 2 ,...,cw n Has a part of speech of cpo 1 ,cpo 2 ,...,cpo n ;e cpoi (i =1, 2.., n) is a part of speech cpo i Embedding;
phrase embedding as E cph =(e cph1 ,e cph2 ,....,e cphn ) Wherein the character cc 1 ,cc 2 ,...,cc n The subordinate phrase types are in turn cph 1 ,cph 2 ,...,cph n ;e cphi (i =1, 2.., n) is the phrase type cph i Embedding;
5. the method for recognizing medical entities based on multi-granularity text embedding according to claim 4, wherein: after the word2vec tool is used to generate the embedded representation of all the substrings in step 1.1.3, for each word cw i (i =1, 2.. Multidot., n), the word cw is judged to be present i Whether or not to include substrings in the medical term substring set; for the Chinese sentence CS, CS = (cc) 1 ,cc 2 ,...,cc n ),cc 1 ,cc 2 ,...,cc n Is a character, cc 1 ,cc 2 ,...,cc n The words to which they belong are in turn: cw 1 ,cw 2 ,...,cw n (ii) a Let the word cw i Containing substrings csubs in a set of medical term substrings 1 ,csubs 2 ,...,csubs p Sub-string csubs 1 ,csubs 2 ,...,csubs p Are respectively represented as e cs1 ,e cs2 ,....,e csp Then word cw i Substring embedding representation e cssi Comprises the following steps: to e for cs1 ,e cs2 ,....,e csp Adding and summing, and then dividing by the result of the number p of the substrings; if the word cw i If the substring does not contain any substring in the medical term substring set, outputting a custom value;
finally, for each character cc in the sentence CS i (i =1, 2.. N), generating its corresponding substring embeddings as per the above steps; substring embedding of sentence CS as E css =(e css1 ,e css2 ,....,e cssn );
6. The method for recognizing medical entities based on multi-granularity text embedding according to claim 5, wherein: step 1.1.4, specifically:
for the Chinese sentence CS, CS = (cc) 1 ,cc 2 ,...,cc n ),cc 1 ,cc 2 ,...,cc n For a character, construct the character cc i Multi-granularity text embedding of (i =1,2.. Ang., n), i.e., character embedding E of a spliced sentence CS cc Word embedding E cw Part of speech embedding E cpos Phrase embedding E cph And substring embedding E css Constructing multi-granularity text embedding of the sentence CS, as shown in (2):
E cme =Concate(E cc ,E cw ,E cpos ,E cph ,E css ) (2)
wherein, concate represents splicing operation; in addition, E cme Dimension of = E cc Dimension + E of cw Dimension + E of cpos Dimension + E of cph Dimension + E of css Dimension (d);
thus, from step 1.1.1 to step 1.1.4, the Chinese character cc is constructed i Multiple-granularity text embedding of (i =1, 2.. Said., n) as E cme
7. The method of claim 6, wherein the method comprises the following steps: in step 1.2.1, for English sentence ES, ES = (ew) 1 ,ew 2 ,...,ew n ),ew 1 ,ew 2 ,...,ew n For a word, a word-embedded representation of the sentence ES is generated using the pre-trained language model BioBert, as shown in (3):
E ew =(e ew1 ,e ew2 ,....,e ewn ),E ew ∈R n×m (3)
wherein n is the length of the sentence, namely the number of words in the sentence, and if the number of the words is less than n, 0 is filled; m is the dimension set by the pre-training model BioBert; e.g. of the type ewi (i =1, 2.. N.) is the word ew i The embedding of (2);
in step 1.2.2, for English sentence ES, ES = (ew) 1 ,ew 2 ,...,ew n ),ew 1 ,ew 2 ,...,ew n Is word, ew i (i =1, 2.. Once, n) characters are embedded to form a view i All character embedding and averaging, part of speech embedding as ew i Embedding part of speech, wherein phrase embedding is embedding of the type of the affiliated phrase;
wherein the character is embedded as E ec =(e ec1 ,e ec2 ,....,e ecn ) Wherein e is eci (i =1, 2.. Times.n) is the word ew i Embedding the characters;
part of speech embedded as E epos =(e epo1 ,e epo2 ,....,e epon ) Wherein the word ew 1 ,ew 2 ,...,ew n In order of part of speech of epo 1 ,epo 2 ,...,epo n ;e epoi (i =1, 2.. N.) is a part-of-speech epo i Embedding;
phrase embedding as E eph =(e eph1 ,e eph2 ,....,e ephn ) Wherein the word ew 1 ,ew 2 ,...,ew n The subordinate phrase types are eph in sequence 1 ,eph 2 ,...,eph n ;e ephi (i =1, 2.. N.) is the phrase type eph i Embedding;
8. the method of claim 7, wherein the method comprises: step 1.2.4, specifically:
for English sentence ES, ES = (ew) 1 ,ew 2 ,...,ew n ),ew 1 ,ew 2 ,...,ew n Constructing an ew for a word i Multi-granularity text embedding of (i =1, 2...., n), i.e., character embedding E of the spliced sentence ES ec Word embedding E ew Part of speech embedding E epos Phrase embedding E eph And substring embedding E ess Constructing multi-granularity text embedding of the sentence ES, as shown in (4);
E eme =Concate(E ec ,E ew ,E epos ,E eph ,E ess ) (4)
wherein, concate represents splicing operation; in addition, E eme Dimension of = E ec Dimension + E of ew Dimension + E of epos Dimension + E of eph Dimension +2E ess Dimension (d);
so far, from step 1.2.1 to step 1.2.4, the English word ew is constructed i Multiple-granularity text embedding of (i =1, 2.. Said., n) as E eme
9. The method for recognizing medical entities based on multi-granularity text embedding according to claim 8, wherein: in step 2.2, for the Chinese sentence CS, CS = (cc) 1 ,cc 2 ,...,cc n ),cc 1 ,cc 2 ,...,cc n For characters, judging whether the Chinese sentence CS is matched with a medical entity to form a pattern, and constructing a pattern matching weight vector as follows: (w) 1 ,w 2 ,...,w n );
Case 1: if the character string cc i ,cc i+1 ,...,cc j Satisfying the pattern "anatomical region", "disease" or "medicine", cc is each character i ,cc i+1 ,...,cc j Giving a mode weight of 2;
case 2: if the character string cc i ,cc i+1 ,...,cc j Satisfying other patterns, for each character cc i ,cc i+1 ,...,cc j Giving a weight of 1.5 to the mode;
case 3: if the character string cc i ,cc i+1 ,...,cc j Not satisfying the pattern, for each character cc i ,cc i+1 ,...,cc j Giving a mode weight of 1;
step 3.2.1, specifically:
first, a multi-granular text embedding h of a sentence is input into an attention layer in an attention network, wherein,
Figure FDA0003195630650000061
embedding Chinese character nodes or English word nodes of a sentence of a medical text, wherein M is the number of the nodes, and H is the dimension of character embedding, as shown in (5);
Figure FDA0003195630650000062
performing linear transformation on the input node embedding, and converting the node embedding into the dimension of the number of all category labels; and calculating attention weight by using LeakyRelu function, namely calculating importance degree e of node v to node u uv As shown in (6);
Figure FDA0003195630650000071
wherein, W 1 A shared weight matrix is represented that is,
Figure FDA0003195630650000072
the embedding of a chinese character node or an english word node u representing a sentence of medical text,
Figure FDA0003195630650000073
embedding Chinese character nodes or English word nodes v representing one sentence of the medical text;
then, using Softmax function to pair e uv Normalization is carried out to obtain alpha uv As shown in (7);
Figure FDA0003195630650000074
wherein e is uk Representing the degree of importance of node u to node k, α uv Denotes e uv Normalized value of (2), N u A neighbor node representing node u;
finally, the attention weight α of the node u is generated u As shown in (8);
Figure FDA0003195630650000075
wherein, W 2 Representing a weight matrix;
step 3.2.2, specifically:
first, for node u, the mode weight w of Chinese character or English word represented by node u is used u Updating the attention weight alpha of the node u u As shown in (9);
α u =α u ×w u (9)
second, construct attention weight attention of sentence l (1. Ltoreq. L. Ltoreq.k) as shown in (10);
attention l =(α 1 ,α 2 ,...,α M ) (10)
then, a multi-head attention mechanism is introduced into the graph attention network, specifically: calculating k attention weights, multiplying each attention weight by the input h, and generating a feature h 'of the sentence' l As shown in (11);
h′ l =attention l ×h (11)
generating a single-ended output elu (h ') by activating a function elu' 1 ),elu(h′ 2 ),...,elu(h′ k );
Thirdly, splicing the outputs of the k heads to generate h' as shown in (12);
h′=Concat(elu(h′ 1 ),elu(h′ 2 ),...,elu(h′ k )) (12)
finally, generating a final output h through a log _ softmax function final As shown in (13)
h final =log_softmax(h')) (13) 。
CN202110890112.9A 2021-06-09 2021-08-04 Medical entity identification method based on multi-granularity text embedding Active CN113779993B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110641595 2021-06-09
CN2021106415959 2021-06-09

Publications (2)

Publication Number Publication Date
CN113779993A CN113779993A (en) 2021-12-10
CN113779993B true CN113779993B (en) 2023-02-28

Family

ID=78836880

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110890112.9A Active CN113779993B (en) 2021-06-09 2021-08-04 Medical entity identification method based on multi-granularity text embedding

Country Status (1)

Country Link
CN (1) CN113779993B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114972929B (en) * 2022-07-29 2022-10-18 中国医学科学院医学信息研究所 Pre-training method and device for medical multi-modal model
CN115618824B (en) * 2022-10-31 2023-10-27 上海苍阙信息科技有限公司 Data set labeling method and device, electronic equipment and medium
CN115512859B (en) * 2022-11-21 2023-04-07 北京左医科技有限公司 Internet-based in-call quality management method, management device and storage medium
CN116629267B (en) * 2023-07-21 2023-12-08 云筑信息科技(成都)有限公司 Named entity identification method based on multiple granularities

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111079377A (en) * 2019-12-03 2020-04-28 哈尔滨工程大学 Method for recognizing named entities oriented to Chinese medical texts
CN111444721A (en) * 2020-05-27 2020-07-24 南京大学 Chinese text key information extraction method based on pre-training language model
CN112101031A (en) * 2020-08-25 2020-12-18 厦门渊亭信息科技有限公司 Entity identification method, terminal equipment and storage medium
CN112541356A (en) * 2020-12-21 2021-03-23 山东师范大学 Method and system for recognizing biomedical named entities
CN112597774A (en) * 2020-12-14 2021-04-02 山东师范大学 Chinese medical named entity recognition method, system, storage medium and equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11526765B2 (en) * 2019-01-10 2022-12-13 Arizona Board Of Regents On Behalf Of Arizona State University Systems and methods for a supra-fusion graph attention model for multi-layered embeddings and deep learning applications

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111079377A (en) * 2019-12-03 2020-04-28 哈尔滨工程大学 Method for recognizing named entities oriented to Chinese medical texts
CN111444721A (en) * 2020-05-27 2020-07-24 南京大学 Chinese text key information extraction method based on pre-training language model
CN112101031A (en) * 2020-08-25 2020-12-18 厦门渊亭信息科技有限公司 Entity identification method, terminal equipment and storage medium
CN112597774A (en) * 2020-12-14 2021-04-02 山东师范大学 Chinese medical named entity recognition method, system, storage medium and equipment
CN112541356A (en) * 2020-12-21 2021-03-23 山东师范大学 Method and system for recognizing biomedical named entities

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Entity-Aware Dependency-based Deep Graph Attention Network for Comparative Preference Classification;Ma Nianzu 等;《NSF Public Access》;20200701;5782-5788 *

Also Published As

Publication number Publication date
CN113779993A (en) 2021-12-10

Similar Documents

Publication Publication Date Title
CN113779993B (en) Medical entity identification method based on multi-granularity text embedding
CN110032648B (en) Medical record structured analysis method based on medical field entity
CN110210037B (en) Syndrome-oriented medical field category detection method
CN110162784B (en) Entity identification method, device and equipment for Chinese medical record and storage medium
CN112597774A (en) Chinese medical named entity recognition method, system, storage medium and equipment
Hu et al. HITSZ_CNER: a hybrid system for entity recognition from Chinese clinical text
Wang et al. An attention-based Bi-GRU-CapsNet model for hypernymy detection between compound entities
CN114943230A (en) Chinese specific field entity linking method fusing common knowledge
CN115392256A (en) Drug adverse event relation extraction method based on semantic segmentation
Liang et al. Asynchronous deep interaction network for natural language inference
CN111523320A (en) Chinese medical record word segmentation method based on deep learning
Ansari et al. Language Identification of Hindi-English tweets using code-mixed BERT
CN114927177A (en) Medical entity identification method and system fusing Chinese medical field characteristics
Goswami et al. ULD@ NUIG at SemEval-2020 Task 9: Generative morphemes with an attention model for sentiment analysis in code-mixed text
Zhou et al. Emotion classification by jointly learning to lexiconize and classify
Wang et al. A hybrid model based on deep convolutional network for medical named entity recognition
Li et al. Bacterial named entity recognition based on language model
Ahnaf et al. An improved extrinsic monolingual plagiarism detection approach of the Bengali text.
Mi et al. A neural network based model for loanword identification in Uyghur
CN115759102A (en) Chinese poetry wine culture named entity recognition method
CN114444467A (en) Traditional Chinese medicine literature content analysis method and device
Zhang et al. Pre-trained language model augmented adversarial training network for Chinese clinical event detection
CN113990420A (en) Electronic medical record named entity identification method
Nunsanga et al. Part-of-speech tagging in Mizo language: A preliminary study
Bharti et al. Sarcasm as a contradiction between a tweet and its temporal facts: a pattern-based approach

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant