CN113779993A

CN113779993A - Medical entity identification method based on multi-granularity text embedding

Info

Publication number: CN113779993A
Application number: CN202110890112.9A
Authority: CN
Inventors: 道捷; 张春霞; 彭成; 薛晓军; 王瞳; 徐天祥; 郭贵锁
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2021-06-09
Filing date: 2021-08-04
Publication date: 2021-12-10
Anticipated expiration: 2041-08-04
Also published as: CN113779993B

Abstract

The invention relates to a medical entity identification method based on multi-granularity text embedding, and belongs to the technical field of information extraction and knowledge graph construction. The medical entity identification method comprises the following steps: constructing multi-granularity text embedding: constructing multi-granularity text embedding through a pre-training language model, wherein the multi-granularity text embedding comprises character embedding, word embedding, part of speech embedding, substring embedding and phrase embedding; generating a pattern weight: generating mode weights of all characters in the Chinese sentence according to the medical term composition mode; node embedding represents learning: using a graph attention network and a mode enhanced attention mechanism to carry out node embedding representation learning; outputting a medical text entity recognition result: and generating an entity category label of the medical text by adopting the conditional random field, and outputting a medical entity recognition result. The method solves the problems of insufficient utilization of graph representation information and single embedding granularity of text distributed representation in the medical entity identification, and improves the performance of the medical entity identification.

Description

Medical entity identification method based on multi-granularity text embedding

Technical Field

The invention relates to a medical entity identification method based on multi-granularity text embedding, and belongs to the technical field of information extraction and knowledge graph construction.

Background

Medical entity identification is an important research topic in the fields of information extraction and medical knowledge graph construction. Medical entity recognition refers to the recognition of entities or terms in the medical field from unstructured medical text. The medical entity recognition technology can provide technical and knowledge support for the fields of question-answering systems in the medical field, medical auxiliary diagnosis, accurate medical knowledge services and the like.

The medical entity identification method mainly comprises a rule-based method, a statistical machine learning-based method, a deep learning-based method and the like. The basic idea of the rule-based medical identification method is to identify medical entities from unstructured texts according to constructed medical entity construction rules, and the constituent elements of the rules comprise keywords, word categories and the like.

The medical entity recognition method based on statistical machine learning mainly comprises the steps of recognizing the medical entity by adopting models such as a maximum entropy model, a hidden Markov model, a conditional random field, a support vector machine and the like. The method is used for converting medical entity identification into a classification problem or a sequence labeling problem. For example, a method based on the combination of conditional random fields and rules is used for identifying named entities of Chinese electronic medical records. Firstly, adopting conditional random field recognition according to language symbol characteristics, suffix characteristics, keyword characteristics, dictionary characteristics and length characteristics; then, the recognition result is optimized by using the rule.

Deep learning based medical entity recognition methods include distributed representation or embedded encoding of unstructured input text, contextual or contextual semantic encoding, and tag decoding. The embedded coding of the input text mainly comprises character embedding and word embedding. The context semantic coding model comprises a convolutional neural network, a bidirectional long-short term memory network, a cyclic neural network and the like. For example, one approach is to perform medical entity recognition of Chinese electronic medical records based on a two-way long-short term memory network and a conditional random field model. First, generating a low-dimensional vector representation of each word; then, the two-way long-short term memory network with the attention mechanism and the conditional random field model are adopted for medical entity recognition.

The attention network of the graph is based on a graph convolution neural network, and an attention mechanism is introduced. The graph attention network has been applied to answer extraction, information recommendation, relationship extraction, and the like of a question-answering system.

The existing medical entity identification method mainly has the following problems: firstly, the existing medical entity recognition method mainly constructs character embedding, word embedding and part of speech embedding of texts, and rarely introduces phrase embedding and substring embedding. Second, current methods perform less entity recognition of medical text modeling through a graph attention network. Third, current methods less fuse pattern or rule-based methods and deep learning-based methods to fully and efficiently integrate the advantages of both methods. The method based on the mode or the rule has high performance, and the method based on the deep learning does not need time-consuming and labor-consuming characteristic engineering and can realize end-to-end nonlinear learning.

Disclosure of Invention

The invention aims to solve the problems of insufficient utilization of graph representation information and single embedding granularity of text distributed representation in medical entity recognition, and provides a medical entity recognition method based on multi-granularity text embedding, which firstly constructs multi-granularity text embedding, including character embedding, word embedding, part of speech embedding, substring embedding and phrase embedding, and realizes multi-granularity text embedding representation learning of characters, words, parts of speech, phrases and substrings of a medical text; then, performing medical text entity recognition by adopting a graph attention network, a mode enhanced attention mechanism and a conditional random field, wherein the method specifically comprises the following steps: firstly, the construction of graph embedded representation of a medical text is realized by utilizing a graph attention network model, and secondly, a mode strengthening attention mechanism is introduced into the graph attention network to strengthen the attention weight of nodes, so that the identification performance of a medical entity is improved.

In order to achieve the purpose, the invention adopts the following technical scheme:

the medical entity identification method based on multi-granularity text embedding comprises the following steps:

step 1: the method for constructing the multi-granularity text embedding through the pre-training language model comprises the following steps:

step 1.1: for unstructured Chinese medical texts, constructing multi-granularity text embedding;

step 1.1.1: learning of symbol embedding, segmentation embedding and covering embedding is carried out by utilizing a pre-training language model MC-Bert, and character embedding of an unstructured Chinese medical text is generated;

MC-Bert is a pre-training model generated according to Chinese medical data training;

for an unstructured Chinese medical text, the input of a pre-training language model MC-Bert consists of three kinds of embedding, namely symbol embedding, segmentation embedding and covering embedding;

wherein symbol embedding refers to vector representation of each word; the segmentation embedding is used for distinguishing two natural language sentences, and the medical entity recognition task recognizes the entity by taking the sentences as units, so that each word has the same segmentation embedding; in the covering embedding, if the current position is the character of the input sentence, the value is assigned to be 1; if the current position is 0, namely the current position is not the character of the input sentence, the value is assigned to be 0;

for the Chinese sentence CS, CS ═ cc₁,cc₂,...,cc_n)，cc₁,cc₂,...,cc_nFor the character, a pre-training language model MC-Bert is adopted to generate a character embedded representation E of the sentence CS_ccAs shown in (1):

E_cc＝(e_cc1,e_cc2,....,e_ccn)，E_cc∈R^n×m (1)

wherein n is the sentence length, namely the number of characters in the sentence, and if the number of characters is less than n, 0 is filled; m is the dimension set by the pre-training model MC-Bert; e.g. of the type_cci(i ═ 1, 2.., n) is the character cc_iEmbedding; character-embedded representation E_ccThe dimension of is n rows and m columns; r^n×mA real number matrix representing n rows and m columns;

step 1.1.2: generating word embedding, part of speech embedding and phrase embedding of the Chinese medical text;

firstly, for a Chinese medical text, a jieba word segmentation tool is used for obtaining words of the Chinese medical text, a part of speech marker of the words of the Chinese medical text is obtained by using a part of speech marker Stanford posttagger, and a phrase marker of the Chinese medical text is obtained by using a syntax analyzer Stanford parser;

then, word embedding, part of speech embedding and phrase embedding are generated by using a word2vec tool;

for the Chinese sentence CS, CS ═ cc₁,cc₂,...,cc_n)，cc₁,cc₂,...,cc_nThe method is characterized in that the method comprises the following steps: obtaining words of the Chinese medical text based on a jieba word segmentation tool, and obtaining word embedding of the affiliated words by using word2 vec; part of speech embedding, specifically: based on the part-of-speech tag of the word of the Chinese medical text, word2vec is used for acquiring part-of-speech embedding of the affiliated word; phrase embedding, specifically: acquiring the subordinate short by using word2vec based on the phrase mark for acquiring the Chinese medical textEmbedding the type of the language;

wherein the word is embedded as E_cw＝(e_cw1,e_cw2,....,e_cwn) Wherein, the character cc₁,cc₂,...,cc_nThe subordinate words are cw in sequence₁,cw₂,...,cw_n；e_cwi(i ═ 1, 2.., n) is the word cw_iEmbedding;

part of speech embedded as E_cpos＝(e_cpo1,e_cpo2,....,e_cpon) Wherein the word cw₁,cw₂,...,cw_nHas a part of speech of cpo₁,cpo₂,...,cpo_n；e_cpoi(i 1, 2.., n) is a part of speech cpo_iEmbedding;

phrase embedding as E_cph＝(e_cph1,e_cph2,....,e_cphn) Wherein, the character cc₁,cc₂,...,cc_nThe subordinate phrase types are in turn cph₁,cph₂,...,cph_n；e_cphi(i ═ 1, 2.., n) is the phrase type cph_iEmbedding;

step 1.1.3: for an unstructured Chinese medical text, generating substring embedding by using a word2vec tool, specifically:

firstly, collecting a medical term dictionary, and constructing a medical term substring set, specifically comprising the following steps: for any two terms, extracting the longest common substring of the two terms, and adding the substring to the medical term substring set; if two terms have a plurality of longest common substrings with the same length, the first longest common substring is taken and added into the medical term substring set;

secondly, generating embedded representations of all substrings in the substring set of the medical terms by using a word2vec tool;

then, for each word cw_i(i ═ 1, 2.., n), the word cw is judged_iWhether or not to include substrings in the medical term substring set; for the Chinese sentence CS, CS ═ cc₁,cc₂,...,cc_n)，cc₁,cc₂,...,cc_nIs a character, cc₁,cc₂,...,cc_nThe words to which they belong are in turn: cw₁,cw₂,...,cw_n(ii) a Let the word cw_iIncluding substrings csubs in a set of medical term substrings₁,csubs₂,...,csubs_pString csubs₁,csubs₂,...,csubs_pIs respectively represented by e_cs1,e_cs2,....,e_cspThen word cw_iSubstring embedding representation e_cssiComprises the following steps: to e_cs1,e_cs2,....,e_cspAdding and summing, and then dividing by the result of the number p of the substrings; if the word cw_iIf the substring does not contain any substring in the medical term substring set, outputting a custom value;

finally, for each character cc in the sentence CS_i(i ═ 1, 2.. times, n), generating corresponding substring embeddings according to the steps; substring embedding of sentence CS as E_css＝(e_css1,e_css2,....,e_cssn)；

Step 1.1.4: character embedding, word embedding, part of speech embedding, phrase embedding and substring embedding of the Chinese medical text are spliced to construct multi-granularity text embedding, and the method specifically comprises the following steps:

for the Chinese sentence CS, CS ═ cc₁,cc₂,...,cc_n)，cc₁,cc₂,...,cc_nFor a character, construct the character cc_iMulti-granularity text embedding of (i ═ 1, 2.. said., n), i.e. character embedding E of the spliced sentence CS_ccWord embedding E_cwPart of speech embedding E_cposPhrase embedding E_cphAnd substring embedding E_cssConstructing multi-granularity text embedding of the sentence CS, as shown in (2):

E_cme＝Concate(E_cc,E_cw,E_cpos,E_cph,E_css) (2)

wherein, Concate represents splicing operation; in addition, E_cmeE is_ccDimension + E of_cwDimension + E of_cposDimension + E of_cphDimension + E of_cssDimension (d);

thus, from step 1.1.1 to step 1.1.4, the Chinese character cc is constructed_iA multi-granularity text embedding of (i ═ 1, 2.., n) as E_cme；

Step 1.2: for unstructured English medical text, constructing multi-granularity text embedding, comprising the following steps:

step 1.2.1: learning of symbol embedding, segmentation embedding and covering embedding is carried out by utilizing a pre-training language model BioBert, and word embedding of an unstructured English medical text is generated;

BioBert is a pre-training model generated by training according to English medical data;

for English sentence ES, ES ═ ew₁,ew₂,...,ew_n)，ew₁,ew₂,...,ew_nFor a word, a word-embedded representation of the sentence ES is generated using the pre-trained language model BioBert, as shown in (3):

E_ew＝(e_ew1,e_ew2,....,e_ewn)，E_ew∈R^n×m (3)

wherein n is the sentence length, namely the number of words in the sentence, and if the number of words is less than n, 0 is filled; m is the dimension set by the pre-training model BioBert; e.g. of the type_ewi(i 1, 2.. n.) is the word ew_iEmbedding;

step 1.2.2: generating character embedding, part of speech embedding and phrase embedding of English medical texts;

for English medical texts, character embedding, part-of-speech embedding and phrase embedding are generated by using a word2vec tool;

for English sentence ES, ES ═ ew₁,ew₂,...,ew_n)，ew₁,ew₂,...,ew_nIs a word, ew_iEmbedding characters of (i ═ 1, 2.., n) to form ew_iAll character embedding and averaging, part of speech embedding as ew_iEmbedding part of speech, wherein phrase embedding is embedding of the type of the affiliated phrase;

wherein the character is embedded as E_ec＝(e_ec1,e_ec2,....,e_ecn) Wherein e is_eci(i 1, 2.. n.) is the word ew_iEmbedding the characters;

part of speech embedded as E_epos＝(e_epo1,e_epo2,....,e_epon) Wherein the word ew₁,ew₂,...,ew_nIn order of part of speech of epo₁,epo₂,...,epo_n；e_epoi(i ═ 1, 2.., n) is a part-of-speech epo_iEmbedding;

phrase embedding as E_eph＝(e_eph1,e_eph2,....,e_ephn) Wherein the word ew₁,ew₂,...,ew_nThe subordinate phrase types are eph in sequence₁,eph₂,...,eph_n；e_ephi(i ═ 1, 2.., n) is the phrase type eph_iEmbedding;

step 1.2.3: for unstructured English medical texts, generating substring embedding by using a word2vec tool;

firstly, collecting an English medical term dictionary and constructing a medical term substring set; for any two terms, extracting the longest common substring of the two terms, and adding the substring to the medical term substring set; if two terms have a plurality of longest common substrings with the same length, the first longest common substring is taken and added into the medical term substring set;

secondly, generating embedded representations of all the substrings by using a word2vec tool;

then, for the english sentence ES, ES ═ ew (ew)₁,ew₂,...,ew_n)，ew₁,ew₂,...,ew_nFor words, for each word ew_i(i 1, 2.. times.n), judging the word ew_iWhether or not to include substrings in the medical term substring set; word setting view_iIncluding substrings esubs in a set of substrings of medical terms₁,esubs₂,...,esubs_qSubstrings esubs₁,esubs₂,...,esubs_qIs denoted as e_es1,e_es2,....,e_esqThen word ew_iSubstring embedding representation e_essiComprises the following steps: to e_es1,e_es2,....,e_esqAdding and summing, and then dividing by the result of the number q of the substrings; if the word ew_iNot containing medical term substringsOutputting any substring in the set as a custom value;

finally, for each character ew in the sentence ES_i(i ═ 1, 2.. times, n), generating corresponding substring embeddings according to the steps; substring embedding of sentence ES as E_ess＝(e_ess1,e_ess2,....,e_essn)；

Step 1.2.4: the method comprises the following steps of splicing character embedding, word embedding, part of speech embedding, phrase embedding and substring embedding of the English medical text to construct multi-granularity text embedding, and specifically comprises the following steps:

for English sentence ES, ES ═ ew₁,ew₂,...,ew_n)，ew₁,ew₂,...,ew_nConstruct ew for words_iMulti-granularity text embedding of (i ═ 1, 2.. multideck, n), i.e. character embedding E of the concatenated sentence ES_ecWord embedding E_ewPart of speech embedding E_eposPhrase embedding E_ephAnd substring embedding E_essConstructing multi-granularity text embedding of the sentence ES, as shown in (4);

E_eme＝Concate(E_ec,E_ew,E_epos,E_eph,E_ess) (4)

wherein, Concate represents splicing operation; in addition, E_emeE is_ecDimension + E of_ewDimension + E-_eposDimension + E of_ephDimension +2E_essDimension (d);

so far, from step 1.2.1 to step 1.2.4, the English word ew is constructed_iA multi-granularity text embedding of (i ═ 1, 2.., n) as E_eme；

Step 2: generating the mode weight of all characters in the Chinese sentence according to the medical entity constitution mode, comprising the following steps:

step 2.1: constructing a Chinese medical entity composition mode;

the medical entity constitution mode has the constitution form: "Y" is₁+Y₂+Y₃+...+Y_k”；

Wherein, Y₁,Y₂,Y₃,...,Y_kMeaning termThe "+" indicates a link operation of a character string;

the category of the words comprises negative words, clinical manifestations, anatomical parts, modifiers, disease names, physical examination, numerical values, quantifiers, and drugs;

step 2.2: generating a mode weight of characters in the Chinese sentence;

for the Chinese sentence CS, CS ═ cc₁,cc₂,...,cc_n)，cc₁,cc₂,...,cc_nAnd for characters, judging whether the Chinese sentence CS is matched with the medical entity constitution mode, wherein the constructed mode matching weight vector is as follows: (w)₁,w₂,...,w_n)；

Case 1: if the character string cc_i,cc_i+1,...,cc_jSatisfying the pattern "anatomical region", "disease" or "medicine", it is cc for each character_i,cc_i+1,...,cc_jGiving a mode weight of 2;

case 2: if the character string cc_i,cc_i+1,...,cc_jSatisfying other patterns, for each character cc_i,cc_i+1,...,cc_jGiving a mode weight of 1.5;

case 3: if the character string cc_i,cc_i+1,...,cc_jNot satisfying the pattern, for each character cc_i,cc_i+1,...,cc_jGiving a mode weight of 1;

and step 3: using a graph attention network and a pattern-reinforced attention mechanism to perform node-embedded representation learning, comprising the steps of:

3.1, transforming the embedding dimensions of the Chinese character nodes or English word nodes by using the full connection layer;

inputting the multi-granularity text embedding of each character in the Chinese sentence CS into a full connection layer, and converting the embedding dimension of the multi-granularity text embedding of the Chinese sentence CS; the reason for converting dimensions is: the multi-granularity text embedding dimension needs to be consistent with the node vector input dimension of the graph attention network used in the step 3.2;

similarly, for the multi-granularity text embedding of each word in the English sentence ES, inputting the multi-granularity text embedding into the full connection layer, and converting the embedding dimension of the English multi-granularity text embedding.

In the fully-connected layer, firstly, the dimensionality is converted through a linear layer; then, using dropout method to prevent overfitting; finally, the gradient is prevented from disappearing by activating the function Relu;

step 3.2: for the Chinese medical text, multiplying the attention weight and the mode weight of Chinese character nodes of the graph attention network by the mode weight of characters in a Chinese sentence; for English medical texts, the mode weight of the Chinese and English word nodes in the attention network is set to be 1;

for the Chinese medical text, node embedding of the graph attention network is character embedding, and the character embedding is the character embedding generated in the step 3.1; for the English medical text, the node embedding of the graph attention network is word embedding, and the word embedding is the word embedding generated in the step 3.1;

step 3.2.1: calculating attention weights of nodes in the graph attention network;

first, a multi-granular text embedding h of a sentence is input into an attention layer in an attention network, wherein,

embedding Chinese character nodes or English word nodes of a sentence of a medical text, wherein M is the number of the nodes, and H is the dimension of character embedding, as shown in (5);

performing linear transformation on the input node embedding, and converting the node embedding into the dimension of the number of all category labels; and calculating attention weight by using LeakyRelu function, namely calculating importance degree e of node v to node u_uvAs shown in (6);

wherein, W₁A shared weight matrix is represented that is,

the embedding of a chinese character node or an english word node u representing a sentence of medical text,

embedding Chinese character nodes or English word nodes v representing one sentence of the medical text;

then, using Softmax function to pair e_uvNormalization is carried out to obtain alpha_uvAs shown in (7);

wherein e is_ukRepresenting the degree of importance of node u to node k, α_uvDenotes e_uvNormalized value of (1), N_uA neighbor node representing node u;

finally, the attention weight α of the node u is generated_uAs shown in (8);

wherein, W₂Representing a weight matrix;

step 3.2.2: updating attention weights of nodes in the graph attention network;

first, for node u, the mode weight w of Chinese character or English word represented by node u is used_uUpdating the attention weight alpha of the node u_uAs shown in (9);

α_u＝α_u×w_u (9)

second, construct an attention weight attention for the sentence_l(1. ltoreq. l. ltoreq.k) as shown in (10);

attention_l＝(α₁,α₂,...,α_M) (10)

then, a multi-head attention mechanism is introduced into the graph attention network, specifically: calculating k attention weights, multiplying each attention weight by the input h, and generating a feature h 'of the sentence'_lAs shown in (11);

h′_l＝attention_l×h (11)

by activating function elu, a one-headed output elu (h'₁),elu(h′₂),...,elu(h′_k)；

Thirdly, splicing the outputs of the k heads to generate h' as shown in (12);

h'＝Concat(elu(h′₁),elu(h′₂),...,elu(h'_k)) (12)

finally, generating a final output h through a log _ softmax function_finalAs shown in (13);

h_final＝log_softmax(h')) (13)

and 4, step 4: generating an entity category label of the medical text by adopting a conditional random field, and outputting a medical entity recognition result, wherein the method specifically comprises the following steps: generating entity category labels of Chinese characters or English words;

calculating the conditional probability distribution density of each character based on the conditional random field, namely calculating the probability of each character belonging to each entity class label, allocating the label with the highest probability to the corresponding character as the entity class label of the character, and further outputting the medical entity recognition result;

and (3) carrying out sequence labeling on sentences in the medical text by adopting a conditional random field, generating entity category labels of Chinese characters or English words, and outputting a medical text entity recognition result.

Advantageous effects

Compared with the traditional medical entity identification method, the medical entity identification method based on multi-granularity text embedding provided by the invention has the following beneficial effects:

1. the identification method has portability and robustness, and is not limited to the source of the corpus; the method comprises the steps of carrying out graph representation modeling on a medical text based on a graph attention network, wherein the language of a language material is not limited, and Chinese texts and English texts can be processed;

2. the method comprises the steps of establishing multi-granularity text embedding of an unstructured medical text, wherein the multi-granularity text embedding comprises character embedding, word embedding, part of speech embedding, substring embedding and phrase embedding; by introducing multi-granularity text embedding, the characteristics of the medical text in the aspects of characters, words, parts of speech, phrases and substrings are mined, the distributed representation learning of a character string level, a lexical level and a grammatical level is realized, the entity characteristic information of the medical text is enhanced, and the accuracy of medical entity recognition is improved;

3. the method adopts a graph attention network, a mode enhanced attention mechanism and a conditional random field to identify the medical entity: firstly, a graph attention network model is utilized to realize graph representation modeling of a medical text, and graph structure information between Chinese characters or English words of the medical text is captured; secondly, a mode strengthening attention mechanism is introduced, the medical entity forming mode characteristics are introduced into the attention weight of the nodes in the graph attention network, the mode-based medical entity identification method and the deep learning-based medical entity identification method are effectively integrated, the characteristics and the advantages of the two methods are fully utilized, and the medical entity identification performance is improved;

4. the method can identify the medical entities of the unstructured Chinese medical text and the English medical text, and has wide application prospects in the fields of information retrieval, text classification, question-answering systems and the like.

Drawings

Fig. 1 is a flowchart illustrating a medical entity recognition based on multi-granular text embedding according to an embodiment of the present invention.

Detailed Description

The medical entity recognition system based on the method takes Pycharm as a development tool, Python as a development language and Pytrch as a development framework.

Preferred embodiments of the method of the present invention will be described in detail with reference to examples.

Examples

This embodiment describes a process of using the method for medical entity recognition based on multi-granularity text embedding according to the present invention, as shown in fig. 1.

Firstly, generating character embedding of a Chinese medical text and word embedding of an English medical text by using a pre-training language model MC-Bert and a Biobert; word embedding, part-of-speech embedding, phrase embedding, substring embedding of the Chinese medical text and character embedding, part-of-speech embedding, phrase embedding and substring embedding of the English medical text are generated by using a word2vec tool, and the embedding is spliced to construct final Chinese and English medical text multi-granularity text embedding; secondly, generating mode weights of all characters in the Chinese sentence according to the medical entity composition mode, wherein the mode weights of all words in the English sentence are set to be 1; then, carrying out node embedding expression learning by using a graph attention network and a mode enhanced attention mechanism, and updating the attention weight of the nodes in the graph attention network by using the mode weight; finally, predicting the entity label of each character in the Chinese medical text or predicting the entity label of each word in the English medical text by using a conditional random field, and outputting a medical text entity recognition result; experiments were performed under the CCKS2019 dataset; firstly, generating multi-granularity text embedding of each sentence of medical text in a CCKS2019 data set; secondly, generating mode weights of all characters in each sentence of medical text according to the medical entity composition mode; then, transmitting the multi-granularity text embedding and pattern matching weight into a graph attention network, multiplying the pattern matching weight with the attention weight of the node, and calculating to obtain the final embedded expression of the input text; finally, outputting a final predicted entity identification label by the conditional random field according to the calculated probability; the experimental results prove the effectiveness of the invention; the method can also be applied to an English medical text data set NCBI Disease, a biochemical field data set BC5CDR and the like; the process applied to the data set NCBI Disease is generally consistent with the data set CCKS2019, and the difference is that when the attention coefficient is calculated in the attention network, the mode weights of English word nodes are all set to be 1; the flow applied to the data set BC5CDR differs: when multi-granularity text embedding is constructed, a term dictionary aiming at the field of biochemistry is required to be used for generating substring embedding, and character embedding, word embedding, part of speech embedding and phrase embedding are spliced, and the spliced embedding is transmitted into a graph attention network; when the attention network calculates the attention weight, adding entity formation pattern matching weight aiming at the biochemical field, and transmitting the result of the attention network into a conditional random field; and outputting a final predicted entity recognition result by the conditional random field according to the probability.

As can be seen from fig. 1, the method specifically includes the following steps:

for unstructured Chinese medical text, the input of the pre-training language model MC-Bert consists of three kinds of Embedding, namely symbol Embedding (Token Embedding), Segment Embedding (Segment Embedding) and Mask Embedding (Mask Embedding), wherein symbol Embedding refers to vector representation of each word. Segmentation embedding is used to distinguish two natural language sentences, and the medical entity recognition task recognizes entities in sentence units, so that each word has the same segmentation embedding. In the covering embedding, if the current position is the character of the input sentence, the value is assigned to be 1; if the current position is 0, namely the current position is not the character of the input sentence, the value is assigned to be 0;

for the Chinese sentence CS, CS ═ cc₁,cc₂,...,cc_n)，cc₁,cc₂,...,cc_nIs a character; generating a character-embedded representation E of a sentence CS using a pre-trained language model MC-Bert_ccAs shown in (1);

E_cc＝(e_cc1,e_cc2,....,e_ccn)，E_cc∈R^n×m (1)

wherein n is the sentence length 512, i.e. the number of characters in the sentence, and if the number of characters is less than 512, 0 is filled; m is a dimensionality 768 set by the pre-training model MC-Bert; e.g. of the type_cci(i ═ 1, 2.., n) is the character cc_i768 dimensions; character-embedded representation E_ccHas a dimension of 512 × 768; r^n×mA real number matrix representing n rows and m columns;

for example, in the sentence "the patient has yellow skin and sclera, with decreased appetite before 4 months, after dinner, with paroxysmal abdominal pain, nausea, no diarrhea, vomiting, chest distress, suffocating, dizziness, headache, no object rotation, no fever, cough, chest pain, asthma, and light stool. ", the characters are divided by" \ t ". The mark of 'CLS' and 'SEP' is added at the beginning and end of sentence. To make the dimensions of the character-embedded representations of different sentences consistent, the sentence length is extended to 512 by filling in 0. The character embedding representation for generating the sentence by using the pre-training model MC-Bert is as follows:

where n is the sentence length 512 and m represents the character embedding dimension 768 dimensions. The character-embedded vector of the character "sick" is (x)₁₁,x₁₂,...x_1m)；

then, word embedding, part of speech embedding and phrase embedding are generated by using a word2vec tool, and the three embedding dimensions are all 200;

for Chinese sentences CS, CS ═(cc₁,cc₂,...,cc_n)，cc₁,cc₂,...,cc_nThe method is characterized in that the method comprises the following steps: obtaining words of the Chinese medical text based on a jieba word segmentation tool, and obtaining word embedding of the affiliated words by using word2 vec; part of speech embedding, specifically: based on the part-of-speech tag of the word of the Chinese medical text, word2vec is used for acquiring part-of-speech embedding of the affiliated word; phrase embedding, specifically: acquiring the type embedding of the affiliated phrase by using word2vec based on the phrase mark of the Chinese medical text;

for example, for the Chinese sentence "the patient found yellow skin and sclera 4 months ago", the segmentation result obtained by the jieba segmentation tool is "the patient found yellow skin and sclera 4 months ago". Further, the word segmentation result is that the patient finds yellow and yellow dyeing of skin and sclera of the skin 4 months ago, and words to which each character in the sentence belongs are given. For example, the character "patient" is affiliated with the word "patient", and the word of the character "patient" is embedded as an embedding of the word "patient";

the part-of-speech tag of the Chinese sentence after being expanded is obtained as 'NN NN CD NN LC VV VV NN NN PU NNNNNR NR NR' through a part-of-speech tag device Stanford posttagger. Obtaining a phrase mark 'NP NP NP NP LCP VP VP NP NP PU NP NP NP NP NP NP NP after the Chinese sentence expansion by using a syntax analyzer Stanford parser, for example, the character' suffers 'from' and is attached to a word 'patient', the part of speech of the character 'suffers' is embedded into the part of speech 'NN' of the word 'patient', and the phrase of the character 'suffers' is embedded into the type mark 'NP' of the attached phrase;

step 1.1.3: for an unstructured Chinese medical text, generating substring embedding by using a word2vec tool, wherein the embedding dimension is 200, and specifically comprises the following steps:

firstly, collecting a medical term dictionary and constructing a medical term substring set;

for any two terms, the longest common substring of the two terms is extracted and added to the medical term substring set. If two terms have a plurality of longest common substrings with the same length, the first longest common substring is taken and added into the medical term substring set;

then, for each word cw_i(i ═ 1, 2.., n), the word cw is judged_iWhether or not to include substrings in the medical term substring set; for the Chinese sentence CS, CS ═ cc₁,cc₂,...,cc_n)，cc₁,cc₂,...,cc_nIs a character, cc₁,cc₂,...,cc_nThe words to which they belong are in turn: cw₁,cw₂,...,cw_n(ii) a Let the word cw_iIncluding substrings csubs in a set of medical term substrings₁,csubs₂,...,csubs_pString csubs₁,csubs₂,...,csubs_pIs denoted as e_cs1,e_cs2,....,e_cspThen word cw_iSubstring embedding representation e_cssiComprises the following steps: to e_cs1,e_cs2,....,e_cspAdd and sum, then divide by subThe result of the number p of strings; if the word cw_iIf the substring does not contain any substring in the medical term substring set, outputting a custom value;

For example, a medical dictionary is collected: medical system nomenclature-clinical term SNOMED CT, the medical term substring set is constructed. For the word "digestive tract" in a Chinese sentence, the word includes the substrings "digestion, tract, digestive tract" in the substring set of medical terms. Further, the substring embedding of the character string "digestive tract" is: the embedding of six substrings of 'elimination', 'melting', 'tract', 'digestion', 'melting tract' and 'melting tract' represents the result of adding and summing and dividing by the number of the substrings of 6;

step 1.1.4: character embedding, word embedding, part of speech embedding, phrase embedding and substring embedding of the Chinese medical text are spliced to construct multi-granularity text embedding:

for the Chinese sentence CS, CS ═ cc₁,cc₂,...,cc_n)，cc₁,cc₂,...,cc_nFor a character, construct the character cc_iMulti-granularity text embedding of (i ═ 1, 2.. said., n), i.e. character embedding E of the spliced sentence CS_ccWord embedding E_cwPart of speech embedding E_cposPhrase embedding E_cphAnd substring embedding E_cssConstructing multi-granularity text embedding of the sentence CS, as shown in (2);

E_cme＝Concate(E_cc,E_cw,E_cpos,E_cph,E_css) (2)

wherein, Concate represents the splicing operation. In addition, E_cmeDimension of (c) is 1568, i.e. 1568 (E)_cmeDimension of 768 (E)_ccDimension of) +200 (E)_cwDimension of) +200 (E)_cposDimension of) +200 (E)_cphDimension of) +200 (E)_cssDimension (d);

Step 1.2: for unstructured English medical text, generating multi-granularity text embedding, comprising the steps of:

for English sentence ES, ES ═ ew₁,ew₂,...,ew_n)，ew₁,ew₂,...,ew_nIs a word; generating a word-embedded representation of the sentence ES using the pre-trained language model BioBert, as shown in (3);

E_ew＝(e_ew1,e_ew2,....,e_ewn)，E_ew∈R^n×m (3)

wherein n is the sentence length 512, i.e. the number of words in the sentence, and if the number of words is less than 512, 0 is filled; m is a dimensionality 768 dimension set by a pre-training model BioBert; e.g. of the type_ewi(i 1, 2.. n.) is the word ew_i768 dimensions; character-embedded representation E_ewHas a dimension of 512 × 768; r^n×mA real number matrix representing n rows and m columns;

for English medical texts, word2vec tools are used for generating character embedding, part of speech embedding and phrase embedding, and the embedding dimension is 200;

step 1.2.3: for an unstructured English medical text, generating substring embedding by using a word2vec tool, wherein the embedding dimension is 200, and specifically comprises the following steps:

then, for the english sentence ES, ES ═ ew (ew)₁,ew₂,...,ew_n)，ew₁,ew₂,...,ew_nFor words, for each word ew_i(i 1, 2.. times.n), judging the word ew_iWhether or not to include substrings in the medical term substring set; word setting view_iIncluding substrings esubs in a set of substrings of medical terms₁,esubs₂,...,esubs_qSubstrings esubs₁,esubs₂,...,esubs_qIs denoted as e_es1,e_es2,....,e_esqThen word ew_iSubstring embedding representation e_essiComprises the following steps: to e_es1,e_es2,....,e_esqAdding and summing, and then dividing by the result of the number q of the substrings; if the word ew_iIf the substring does not contain any substring in the medical term substring set, outputting a custom value;

Step 1.2.4: character embedding, word embedding, part of speech embedding, phrase embedding and substring embedding of the English medical text are spliced to construct multi-granularity text embedding:

E_eme＝Concate(E_ec,E_ew,E_epos,E_eph,E_ess) (4)

wherein, Concate represents the splicing operation. In addition, E_emeDimension of (c) is 1568, i.e. 1568 (E)_emeDimension of 768 (E)_ecDimension of) +200 (E)_ewDimension of) +200 (E)_eposDimension of) +200 (E)_ephDimension of) +200 (E)_essDimension (d);

step 2.1: constructing a Chinese medical entity composition mode;

the medical entity constitution mode has the constitution form: "Y" is₁+Y₂+Y₃+...+Y_k", wherein Y₁,Y₂,Y₃,...,Y_kIndicating the category of the word, "+" indicates the link operation of the character string. The category of the words comprises negative words, clinical manifestations, anatomical parts, modifiers, disease names, physical examination, numerical values, quantifier words and medicines;

for example, negation words include none, absent, etc. The clinical manifestations include chill, sweating, increased heart rate, etc. Anatomical sites include the back, meniscus, left colon artery, etc. Modifiers include mild, more severe, etc. The disease names include rheumatic heart disease, multiple cancers, etc. Physical examination includes cardiopulmonary examination, electrocardiography, and the like. Quantifier includes degree, group, only, and the like. The medicines comprise cedilanid, cefuroxime axetil, aspirin and the like;

for example, constructing a medical entity constitutes a pattern of "negation + clinical manifestation", which is satisfied by the terms "no nausea" and "no fever". Because the term "no nausea" consists of the negation of the word "no" and clinical manifestations of "nausea", the term "no fever" consists of the negation of the word "no" and clinical manifestations of "fever";

step 2.2: generating a mode weight of characters in the Chinese sentence;

case 2: if the character string cc_i,cc_i+1,...,cc_jSatisfy other patterns, then eachIndividual character cc_i,cc_i+1,...,cc_jGiving a mode weight of 1.5;

for example, the input text "patient found yellow skin, sclera, 4 months ago" constructs a pattern from the medical entity, and the generated pattern weight vector is: (1,1,1,1,1,1,1,2,2,1,2,2,1, 1);

for multi-granularity text embedding of each character in the Chinese sentence CS, the input is input into a full connection layer, and the embedding dimension of the multi-granularity text embedding of the Chinese sentence is converted from 1568 dimensions to 768 dimensions. The reason for converting dimensions is: the multi-granularity text embedding dimension needs to be consistent with the node vector input dimension of the graph attention network used in the step 3.2, namely 768 dimensions. Similarly, for the multi-granularity text embedding of each word in the English sentence ES, inputting the multi-granularity text embedding into a full connection layer, and converting the embedding dimensionality of the English multi-granularity text embedding from 1568 dimensionality to 768 dimensionality;

for Chinese medical text, the node embedding of the graph attention network is character embedding, and the character embedding is 768-dimensional character embedding generated in step 3.1. For English medical texts, node embedding of the graph attention network is word embedding, and the word embedding is 768-dimensional word embedding generated in the step 3.1;

embedding Chinese character nodes or English word nodes of a sentence of a medical text, wherein M is the number of the nodes 512, H is the dimension of character embedding, and the value is 768, as shown in (5);

and performing linear transformation on the input node embedding, converting the 768-dimensional node embedding into 16-dimensional node embedding, wherein 16 is the number of all class labels. And calculating attention weight by using LeakyRelu function, namely calculating importance degree e of node v to node u_uvAs shown in (6);

wherein, W₁A shared weight matrix is represented that is,

wherein e is_ukRepresenting the importance degree of the node u to the node k;

wherein alpha is_uvDenotes e_uvNormalized value of (1), N_uA neighbor node representing node u;

finally, the attention weight α of the node u is generated_uAs shown in (8);

wherein W₂Representing a weight matrix;

step 3.2.2: updating attention weights of nodes in the graph attention network;

α_u＝α_u×w_u (9)

second, construct an attention weight attention for the sentence_l(l is more than or equal to 1 and less than or equal to k) as shown in (10);

attention_l＝(α₁,α₂,...,α_M) (10)

a multi-point attention mechanism is then introduced to the graph attention network. Specifically, k attention weights are calculated, each attention weight is multiplied by an input h, and a feature h 'of a sentence is generated'_lAs shown in (11);

h′_l＝attention_l×h (11)

by activating function elu, a one-headed output elu (h'₁),elu(h'₂),...,elu(h'_k) Finally, splicing the outputs of the k heads to generate h' as shown in (12);

h'＝concat(elu(h′₁),elu(h′₂),...,elu(h'_k)) (12)

h_final＝log_softmax(h')) (13)

For example, for a dataset, its entity class labels include: "PAD", "CLS", "SEP", "O", "B-disease and diagnosis", "I-disease and diagnosis", "B-surgery", "I-surgery", "B-anatomical site", "I-anatomical site", "B-drug", "I-drug", "B-imaging examination", "I-imaging examination", "B-laboratory test", "I-laboratory test";

for example, in the sentence "the patient has yellow skin and sclera, with decreased appetite before 4 months, after dinner, with paroxysmal abdominal pain, nausea, no diarrhea, vomiting, chest distress, suffocating, dizziness, headache, no object rotation, no fever, cough, chest pain, asthma, and light stool. ", the result after labeling with a conditional random field sequence is [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 8, 9, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 8, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3 ]; each number in the list represents an index of the predicted entity class label of the bit character. The index is converted into a corresponding entity label through an idx2tag function, and the final entity recognition result is O O O O O O O O B-anatomical part I-anatomical part O O O O O O O O O O O O O O O O O O O B-anatomical part O O O O O O O O O O O O B-anatomical part O O O O O O O O O O O O O O O O O O O O O O O O B.

In order to illustrate the medical entity recognition effect of the invention, the experiment is carried out by comparing the same training set and test set by two methods respectively under the same condition; the first method is a medical entity recognition method based on a bidirectional long-short term memory network, an attention mechanism and a conditional random field, and introduces a medical dictionary and part-of-speech characteristics; the second method is the medical entity identification method of the present invention;

the adopted evaluation indexes are as follows: accuracy, recall, and F1 values; the medical entity recognition result is: the accuracy of the medical entity recognition results of the two-way long-short term memory network, the attention mechanism and the conditional random field in the prior art is 76.42 percent, the recall rate is 73.80 percent, and the F1 value is 75.08 percent; the accuracy of the medical entity recognition result obtained by the method is 86.38%, the recall rate is 85.82%, and the F1 value is 86.10%; the effectiveness of the medical entity identification method based on multi-granularity text embedding provided by the invention is shown through experiments;

while the foregoing is directed to the preferred embodiment of the present invention, it is not intended that the invention be limited to the embodiment and the drawings disclosed herein. Equivalents and modifications may be made without departing from the spirit of the disclosure, which is to be considered as within the scope of the invention.

Claims

1. A medical entity recognition method based on multi-granularity text embedding is characterized in that: the method comprises the following steps:

step 1.1.4: character embedding, word embedding, part of speech embedding, phrase embedding and substring embedding of the Chinese medical text are spliced to construct multi-granularity text embedding;

then, for the english sentence ES, ES ═ ew (ew)₁，ew₂，...，ew_n)，ew₁，ew₂，...，ew_nFor words, for each word ew_i(i 1, 2.. times.n), judging the word ew_iWhether or not to include substrings in the medical term substring set; word setting view_iIncluding substrings esubs in a set of substrings of medical terms₁，esubs₂，...，esubs_qSubstrings esubs₁，esubs₂，...，esubs_qIs denoted as e_es1，e_es2，....，e_esqThen word ew_iSubstring embedding representation e_essiComprises the following steps: to e_es1，e_es2，....，e_esqAdding and summing, and then dividing by the result of the number q of the substrings; if the word ew_iIf the substring does not contain any substring in the medical term substring set, outputting a custom value;

finally, for each word ew in the sentence ES_i(i ═ 1, 2.. times, n), generating corresponding substring embeddings according to the steps; substring embedding of sentence ES intoE_ess＝(e_ess1，e_ess2，....，e_essn)；

Step 1.2.4: splicing character embedding, word embedding, part of speech embedding, phrase embedding and substring embedding of the English medical text to construct multi-granularity text embedding;

step 2.1: constructing a Chinese medical entity composition mode;

Wherein, Y₁，Y₂，Y₃，...，Y_kRepresents the category of the word, "+" represents the link operation of the character string;

step 2.2: generating a mode weight of characters in the Chinese sentence;

similarly, for multi-granularity text embedding of each word in the English sentence ES, inputting the multi-granularity text embedding into the full connection layer, and converting the embedding dimension of the English multi-granularity text embedding;

step 3.2.2: updating attention weights of nodes in the graph attention network;

2. The method for recognizing medical entities based on multi-granularity text embedding according to claim 1, wherein: the MC-Bert in the step 1.1.1 is a pre-training model generated according to Chinese medical data training; and the symbol embedding in step 1.1.1 refers to the vector representation of each word; the segmentation embedding is used for distinguishing two natural language sentences, and the medical entity recognition task recognizes the entity by taking the sentences as units, so that each word has the same segmentation embedding; in the covering embedding, if the current position is the character of the input sentence, the value is assigned to be 1; if the current position is 0, i.e. not a character of the input sentence, the value is assigned to 0.

3. The method for recognizing medical entities based on multi-granularity text embedding according to claim 2, wherein: in step 1.1.1, for the chinese sentence CS, CS ═ cc₁，cc₂，...，cc_n)，cc₁，cc₂，...，cc_nFor the character, a pre-training language model MC-Bert is adopted to generate a character embedded representation E of the sentence CS_ccAs shown in (1):

E_cc＝(e_cc1，e_cc2，....，e_ccn)，E_cc∈R^n×m(1)

wherein n is the sentence length, namely the number of characters in the sentence, and if the number of characters is less than n, 0 is filled; m is the dimension set by the pre-training model MC-Bert; e.g. of the type_cci(i ═ 1, 2.., n) is the character cc_iEmbedding; character-embedded representation E_ccThe dimension of (1) is n rows and m columns; r^n×mA matrix of real numbers representing n rows and m columns.

4. The method for recognizing medical entities based on multi-granularity text embedding according to claim 3, wherein the method comprises the following steps: for the chinese sentence CS, CS ═ cc in step 1.1.2₁，cc₂，...，cc_n)，cc₁，cc₂，...，cc_nThe method is characterized in that the method comprises the following steps: obtaining words of the Chinese medical text based on a jieba word segmentation tool, and obtaining word embedding of the affiliated words by using word2 vec; part of speech embedding, specifically: based on the part-of-speech tag of the word of the Chinese medical text, word2vec is used for acquiring part-of-speech embedding of the affiliated word; phrase embedding, specifically: acquiring the type embedding of the affiliated phrase by using word2vec based on the phrase mark of the Chinese medical text;

wherein the word is embedded as E_cw＝(e_cw1，e_cw2，....，e_cwn) Wherein, the character cc₁，cc₂，...，cc_nThe subordinate words are cw in sequence₁，cw₂，...，cw_n；e_cwi(i ═ 1, 2.., n) is the word cw_iEmbedding;

part of speech embedded as E_cpos＝(e_cpo1，e_cpo2，....，e_cpon) Wherein the word cw₁，cw₂，...，cw_nHas a part of speech of cpo₁，cpo₂，...，cpo_n；e_cpoi(i 1, 2.., n) is a part of speech cpo_iEmbedding;

phrase embedding as E_cph＝(e_cph1，e_cph2，....，e_cphn) Wherein, the character cc₁，cc₂，...，cc_nThe subordinate phrase types are in turn cph₁，cph₂，...，cph_n；e_cphi(i ═ 1, 2.., n) is the phrase type cph_iIs embedded.

5. The method for recognizing medical entities based on multi-granularity text embedding according to claim 4, wherein: after the word2vec tool is used to generate the embedded representation of all the substrings in step 1.1.3, for each word cw_i(i ═ 1, 2.., n), the word cw is judged_iWhether or not to include substrings in the medical term substring set; for the Chinese sentence CS, CS ═ cc₁，cc₂，...，cc_n)，cc₁，cc₂，...，cc_nIs a character, cc₁，cc₂，...，cc_nThe words to which they belong are in turn: cw₁，cw₂，...，cw_n(ii) a Let the word cw_iIncluding substrings csubs in a set of medical term substrings₁，csubs₂，...，csubs_pString csubs₁，csubs₂，...，csubs_pIs respectively represented by e_cs1，e_cs2，....，e_cspThen word cw_iSubstring embedding representation e_cssiComprises the following steps: to e_cs1，e_cs2，....，e_cspSumming the sums and then dividing by the number of substringsThe result of p; if the word cw_iIf the substring does not contain any substring in the medical term substring set, outputting a custom value;

finally, for each character cc in the sentence CS_i(i ═ 1, 2.. times, n), generating corresponding substring embeddings according to the steps; substring embedding of sentence CS as E_css＝(e_css1，e_css2，....，e_cssn)。

6. The method for recognizing medical entities based on multi-granularity text embedding according to claim 5, wherein: step 1.1.4, specifically:

for the Chinese sentence CS, CS ═ cc₁，cc₂，...，cc_n)，cc₁，cc₂，...，cc_nFor a character, construct the character cc_iMulti-granularity text embedding of (i ═ 1, 2.. said., n), i.e. character embedding E of the spliced sentence CS_ccWord embedding E_cwPart of speech embedding E_cposPhrase embedding E_cphAnd substring embedding E_cssConstructing multi-granularity text embedding of the sentence CS, as shown in (2):

E_cme＝Concate(E_cc，E_cw，E_cpos，E_cph，E_css) (2)

thus, from step 1.1.1 to step 1.1.4, the Chinese character cc is constructed_iA multi-granularity text embedding of (i ═ 1, 2.., n) as E_cme。

7. The method of claim 6, wherein the method comprises: in step 1.2.1, for english sentence ES, ES ═ ew₁，ew₂，...，ew_n)，ew₁，ew₂，...，ew_nFor words, using pre-trainingThe language model BioBert, which generates a word-embedded representation of the sentence ES, is shown in (3):

E_ew＝(e_ew1，e_ew2，....，e_ewn)，E_ew∈R^n×m(3)

in step 1.2.2, for english sentence ES, ES ═ ew₁，ew₂，...，ew_n)，ew₁，ew₂，...，ew_nIs a word, ew_iEmbedding characters of (i ═ 1, 2.., n) to form ew_iAll character embedding and averaging, part of speech embedding as ew_iEmbedding part of speech, wherein phrase embedding is embedding of the type of the affiliated phrase;

wherein the character is embedded as E_ec＝(e_ec1，e_ec2，....，e_ecn) Wherein e is_eci(i 1, 2.. n.) is the word ew_iEmbedding the characters;

part of speech embedded as E_epos＝(e_epo1，e_epo2，....，e_epon) Wherein the word ew₁，ew₂，...，ew_nIn order of part of speech of epo₁，epo₂，...，epo_n；e_epoi(i ═ 1, 2.., n) is a part-of-speech epo_iEmbedding;

phrase embedding as E_eph＝(e_eph1，e_eph2，....，e_ephn) Wherein the word ew₁，ew₂，...，ew_nThe subordinate phrase types are eph in sequence₁，eph₂，...，eph_n；e_ephi(i ═ 1, 2.., n) is the phrase type eph_iIs embedded.

8. The method of claim 7, wherein the method comprises: step 1.2.4, specifically:

for English sentence ES, ES ═ ew₁，ew₂，...，ew_n)，ew₁，ew₂，...，ew_nConstruct ew for words_iMulti-granularity text embedding of (i ═ 1, 2.. multideck, n), i.e. character embedding E of the concatenated sentence ES_ecWord embedding E_ewPart of speech embedding E_eposPhrase embedding E_ephAnd substring embedding E_essConstructing multi-granularity text embedding of the sentence ES, as shown in (4);

E_eme＝Concate(E_ec，E_ew，E_epos，E_eph，E_ess) (4)

wherein, Concate represents splicing operation; in addition, E_emeE is_ecDimension + E of_ewDimension + E of_eposDimension + E of_ephDimension +2E_essDimension (d);

so far, from step 1.2.1 to step 1.2.4, the English word ew is constructed_iA multi-granularity text embedding of (i ═ 1, 2.., n) as E_eme。

9. The method for recognizing medical entities based on multi-granularity text embedding according to claim 8, wherein: in step 2.2, for the Chinese sentence CS, CS ═ cc₁，cc₂，...，cc_n)，cc₁，cc₂，...，cc_nAnd for characters, judging whether the Chinese sentence CS is matched with the medical entity constitution mode, wherein the constructed mode matching weight vector is as follows: (w)₁，w₂，...，w_n)；

Case 1: if the character string cc_i，cc_i+1，...，cc_jSatisfying the pattern "anatomical region", "disease" or "medicine", it is cc for each character_i，cc_i+1，...，cc_jGiving a mode weight of 2;

case 2: if the character string cc_i，cc_i+1，...，cc_jSatisfying other patterns, for each character cc_i，cc_i+1，...，cc_jGiving a mode weight of 1.5;

case 3: if the character string cc_i，cc_i+1，...，cc_jNot satisfying the pattern, for each character cc_i，cc_i+1，...，cc_jGiving a mode weight of 1;

step 3.2.1, specifically:

wherein, W₁A shared weight matrix is represented that is,

finally, the attention weight α of the node u is generated_uAs shown in (8);

wherein, W₂Representing a weight matrix;

step 3.2.2, specifically:

α_u＝α_u×w_u(9)

attention_l＝(α₁，α₂，...，α_M) (10)

h′_l＝attention_l×h (11)

by activating function elu, a one-headed output elu (h'₁)，elu(h′₂)，...，elu(h′_k)；

Thirdly, splicing the outputs of the k heads to generate h' as shown in (12);

h′＝Concat(elu(h′₁)，elu(h′₂)，...，elu(h′_k)) (12)

finally, generating a final output h through a log _ softmax function_finalAs shown in (13).

h_final＝log_softmax(h')) (13)。