CN113779993B

CN113779993B - Medical entity identification method based on multi-granularity text embedding

Info

Publication number: CN113779993B
Application number: CN202110890112.9A
Authority: CN
Inventors: 道捷; 张春霞; 彭成; 薛晓军; 王瞳; 徐天祥; 郭贵锁
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2021-06-09
Filing date: 2021-08-04
Publication date: 2023-02-28
Anticipated expiration: 2041-08-04
Also published as: CN113779993A

Abstract

The invention relates to a medical entity identification method based on multi-granularity text embedding, and belongs to the technical field of information extraction and knowledge graph construction. The medical entity identification method comprises the following steps: constructing multi-granularity text embedding: constructing multi-granularity text embedding through a pre-training language model, wherein the multi-granularity text embedding comprises character embedding, word embedding, part of speech embedding, substring embedding and phrase embedding; generating a pattern weight: generating mode weights of all characters in the Chinese sentence according to the medical term composition mode; node embedding represents learning: using a graph attention network and a mode enhanced attention mechanism to carry out node embedding representation learning; outputting a medical text entity recognition result: and generating an entity category label of the medical text by adopting the conditional random field, and outputting a medical entity recognition result. The method solves the problems of insufficient utilization of graph representation information and single embedding granularity of text distributed representation in the medical entity identification, and improves the performance of the medical entity identification.

Description

Medical entity identification method based on multi-granularity text embedding

Technical Field

The invention relates to a medical entity identification method based on multi-granularity text embedding, and belongs to the technical field of information extraction and knowledge graph construction.

Background

Medical entity identification is an important research topic in the fields of information extraction and medical knowledge graph construction. Medical entity recognition refers to the recognition of entities or terms in the medical field from unstructured medical text. The medical entity recognition technology can provide technical and knowledge support for the fields of question-answering systems in the medical field, medical auxiliary diagnosis, accurate medical knowledge services and the like.

The medical entity identification method mainly comprises a rule-based method, a statistical machine learning-based method, a deep learning-based method and the like. The basic idea of the rule-based medical identification method is to identify medical entities from unstructured texts according to constructed medical entity construction rules, and the constituent elements of the rules comprise keywords, word categories and the like.

The medical entity recognition method based on statistical machine learning mainly comprises the steps of adopting models such as a maximum entropy model, a hidden Markov model, a conditional random field and a support vector machine to recognize medical entities. The method is used for converting medical entity identification into a classification problem or a sequence labeling problem. For example, a method based on the combination of conditional random fields and rules is used for identifying named entities of Chinese electronic medical records. Firstly, adopting conditional random field recognition according to language symbol characteristics, suffix characteristics, keyword characteristics, dictionary characteristics and length characteristics; then, the recognition result is optimized by using the rule.

Deep learning based medical entity recognition methods include distributed representation or embedded encoding of unstructured input text, contextual or contextual semantic encoding, and tag decoding. The embedded coding of the input text mainly comprises character embedding and word embedding. The context semantic coding model comprises a convolutional neural network, a bidirectional long-term and short-term memory network, a cyclic neural network and the like. For example, one approach is to perform medical entity recognition of Chinese electronic medical records based on a two-way long-short term memory network and a conditional random field model. First, generating a low-dimensional vector representation of each word; then, medical entity recognition is carried out by adopting a bidirectional long-short term memory network with attention mechanism and a conditional random field model.

The graph attention network is based on a graph convolution neural network and introduces an attention mechanism. The graph attention network has been applied to answer extraction, information recommendation, relationship extraction, and the like of a question-answering system.

The existing medical entity identification method mainly has the following problems: firstly, the existing medical entity recognition method mainly constructs character embedding, word embedding and part of speech embedding of texts, and rarely introduces phrase embedding and substring embedding. Second, current methods provide less for entity recognition of medical text modeling through a graph attention network. Third, current methods less fuse pattern or rule-based methods and deep learning-based methods to fully and efficiently integrate the advantages of both methods. The method based on the mode or the rule has high performance, and the method based on the deep learning does not need time-consuming and labor-consuming characteristic engineering and can realize end-to-end nonlinear learning.

Disclosure of Invention

The invention aims to solve the problems of insufficient utilization of graph representation information and single embedding granularity of text distributed representation in medical entity recognition, and provides a medical entity recognition method based on multi-granularity text embedding, which firstly constructs multi-granularity text embedding, including character embedding, word embedding, part of speech embedding, substring embedding and phrase embedding, and realizes multi-granularity text embedding representation learning of characters, words, parts of speech, phrases and substrings of a medical text; then, performing medical text entity recognition by adopting a graph attention network, a mode enhanced attention mechanism and a conditional random field, wherein the method specifically comprises the following steps: firstly, the construction of graph embedded representation of a medical text is realized by utilizing a graph attention network model, and secondly, a mode strengthening attention mechanism is introduced into the graph attention network to strengthen the attention weight of nodes, so that the identification performance of a medical entity is improved.

In order to achieve the purpose, the invention adopts the following technical scheme:

the medical entity identification method based on multi-granularity text embedding comprises the following steps:

step 1: the method for constructing the multi-granularity text embedding through the pre-training language model comprises the following steps:

step 1.1: for unstructured Chinese medical texts, constructing multi-granularity text embedding;

step 1.1.1: learning of symbol embedding, segmentation embedding and covering embedding is carried out by utilizing a pre-training language model MC-Bert, and character embedding of an unstructured Chinese medical text is generated;

MC-Bert is a pre-training model generated according to Chinese medical data training;

for an unstructured Chinese medical text, the input of a pre-training language model MC-Bert consists of three kinds of embedding, namely symbol embedding, segmentation embedding and covering embedding;

wherein symbol embedding refers to vector representation of each word; the segmentation embedding is used for distinguishing two natural language sentences, and the medical entity recognition task recognizes the entity by taking the sentences as units, so that each word has the same segmentation embedding; in the covering embedding, if the current position is the character of the input sentence, the value is assigned to be 1; if the current position is 0, namely the current position is not the character of the input sentence, the value is assigned to be 0;

for the Chinese sentence CS, CS = (cc) ₁ ,cc ₂ ,...,cc _n )，cc ₁ ,cc ₂ ,...,cc _n For the character, a pre-training language model MC-Bert is adopted to generate a character embedded representation E of the sentence CS _cc As shown in (1):

E _cc ＝(e _cc1 ,e _cc2 ,....,e _ccn )，E _cc ∈R ^n×m (1)

wherein n is the sentence length, namely the number of characters in the sentence, and if the number of characters is less than n, 0 is filled; m is the dimension set by the pre-training model MC-Bert；e _cci (i =1, 2.., n) is the character cc _i Embedding; character-embedded representation E _cc The dimension of is n rows and m columns; r ^n×m A real number matrix representing n rows and m columns;

step 1.1.2: generating word embedding, part-of-speech embedding and phrase embedding of the Chinese medical text;

firstly, for a Chinese medical text, a jieba word segmentation tool is used for obtaining words of the Chinese medical text, a part of speech marker of the words of the Chinese medical text is obtained by using a part of speech marker Stanford posttagger, and a phrase marker of the Chinese medical text is obtained by using a syntax analyzer Stanford parser;

then, word embedding, part of speech embedding and phrase embedding are generated by using a word2vec tool;

for the Chinese sentence CS, CS = (cc) ₁ ,cc ₂ ,...,cc _n )，cc ₁ ,cc ₂ ,...,cc _n The method is characterized in that the method comprises the following steps: obtaining words of the Chinese medical text based on a jieba word segmentation tool, and obtaining word embedding of the affiliated words by using word2 vec; part of speech embedding, specifically: based on the part-of-speech tag of the word of the Chinese medical text, word2vec is used for acquiring part-of-speech embedding of the affiliated word; phrase embedding specifically includes: based on the phrase marks of the obtained Chinese medical texts, obtaining the type embedding of the affiliated phrases by using word2 vec;

wherein the word is embedded as E _cw ＝(e _cw1 ,e _cw2 ,....,e _cwn ) Wherein the character cc ₁ ,cc ₂ ,...,cc _n The subordinate words are cw in sequence ₁ ,cw ₂ ,...,cw _n ；e _cwi (i =1, 2.., n) is the word cw _i Embedding;

part of speech embedded as E _cpos ＝(e _cpo1 ,e _cpo2 ,....,e _cpon ) Wherein the word cw ₁ ,cw ₂ ,...,cw _n Of part of speech are cpo in sequence ₁ ,cpo ₂ ,...,cpo _n ；e _cpoi (i =1, 2.., n) is a part of speech cpo _i The embedding of (2);

phrase embedding as E _cph ＝(e _cph1 ,e _cph2 ,....,e _cphn ) Wherein the character cc ₁ ,cc ₂ ,...,cc _n The subordinate phrase types are in turn cph ₁ ,cph ₂ ,...,cph _n ；e _cphi (i =1, 2.., n) is the phrase type cph _i The embedding of (2);

step 1.1.3: for an unstructured Chinese medical text, generating substring embedding by using a word2vec tool, specifically:

firstly, collecting a medical term dictionary, and constructing a medical term substring set, specifically comprising the following steps: for any two terms, extracting the longest common substring of the two terms, and adding the substring to the medical term substring set; if two terms have a plurality of longest common substrings with the same length, the first longest common substring is taken and added into the medical term substring set;

secondly, generating embedded expressions of all substrings in the substring set of the medical terms by using a word2vec tool;

then, for each word cw _i (i =1, 2.. N.), judging the word cw _i Whether substrings in the medical term substring set are included; for the Chinese sentence CS, CS = (cc) ₁ ,cc ₂ ,...,cc _n )，cc ₁ ,cc ₂ ,...,cc _n Is a character, cc ₁ ,cc ₂ ,...,cc _n The words to which they belong are in turn: cw ₁ ,cw ₂ ,...,cw _n (ii) a Let the word cw _i Containing substrings csubs in a set of medical term substrings ₁ ,csubs ₂ ,...,csubs _p String csubs ₁ ,csubs ₂ ,...,csubs _p Is respectively represented by e _cs1 ,e _cs2 ,....,e _csp Then word cw _i Substring embedding representation e _cssi Comprises the following steps: to e _cs1 ,e _cs2 ,....,e _csp Adding and summing, and then dividing by the result of the number p of the substrings; if the word cw _i If the substring does not contain any substring in the medical term substring set, outputting a custom value;

finally, for each character cc in the sentence CS _i (i =1, 2.. N), generating its corresponding substring embeddings as per the above steps;substring embedding of sentence CS as E _css ＝(e _css1 ,e _css2 ,....,e _cssn )；

Step 1.1.4: character embedding, word embedding, part of speech embedding, phrase embedding and substring embedding of the Chinese medical text are spliced to construct multi-granularity text embedding, and the method specifically comprises the following steps:

for the Chinese sentence CS, CS = (cc) ₁ ,cc ₂ ,...,cc _n )，cc ₁ ,cc ₂ ,...,cc _n For a character, construct the character cc _i Multi-granularity text embedding of (i =1, 2...., n), i.e., character embedding E of the concatenated sentence CS _cc Word embedding E _cw Part of speech embedding E _cpos Phrase embedding E _cph And substring embedding E _css Constructing multi-granularity text embedding of the sentence CS, as shown in (2):

E _cme ＝Concate(E _cc ,E _cw ,E _cpos ,E _cph ,E _css ) (2)

wherein, concate represents splicing operation; in addition, E _cme Dimension of = E _cc Dimension + E of _cw Dimension + E of _cpos Dimension + E of _cph Dimension + E of _css Dimension of (d);

thus, from step 1.1.1 to step 1.1.4, the Chinese character cc is constructed _i Multiple-granularity text embedding of (i =1, 2.. Said., n) as E _cme ；

Step 1.2: for unstructured English medical text, multi-granularity text embedding is constructed, and the method comprises the following steps:

step 1.2.1: learning of symbol embedding, segmentation embedding and covering embedding is carried out by utilizing a pre-training language model BioBert, and word embedding of an unstructured English medical text is generated;

BioBert is a pre-training model generated by training according to English medical data;

for English sentence ES, ES = (ew) ₁ ,ew ₂ ,...,ew _n )，ew ₁ ,ew ₂ ,...,ew _n For a word, a word-embedded representation of the sentence ES is generated using the pre-trained language model BioBert, as shown in (3):

E _ew ＝(e _ew1 ,e _ew2 ,....,e _ewn )，E _ew ∈R ^n×m (3)

wherein n is the length of the sentence, namely the number of words in the sentence, and if the number of the words is less than n, 0 is filled; m is the dimension set by the pre-training model BioBert; e.g. of the type _ewi (i =1, 2.. N.) is the word ew _i Embedding;

step 1.2.2: generating character embedding, part-of-speech embedding and phrase embedding of English medical texts;

for English medical texts, word2vec tools are used for generating character embedding, part of speech embedding and phrase embedding;

for English sentence ES, ES = (ew) ₁ ,ew ₂ ,...,ew _n )，ew ₁ ,ew ₂ ,...,ew _n Is word, ew _i (i =1, 2.. Once, n) characters are embedded to form a view _i All the characters of (2) are embedded and averaged, and part of speech is embedded as ew _i Embedding part of speech, wherein phrase embedding is embedding of the type of the affiliated phrase;

wherein the character is embedded as E _ec ＝(e _ec1 ,e _ec2 ,....,e _ecn ) Wherein e is _eci (i =1, 2.. N.) is the word ew _i The character embedding of (2);

part of speech embedded as E _epos ＝(e _epo1 ,e _epo2 ,....,e _epon ) Wherein the word ew ₁ ,ew ₂ ,...,ew _n In order of part of speech of epo ₁ ,epo ₂ ,...,epo _n ；e _epoi (i =1, 2.. N.) is a part-of-speech epo _i Embedding;

phrase embedding as E _eph ＝(e _eph1 ,e _eph2 ,....,e _ephn ) Wherein the word ew ₁ ,ew ₂ ,...,ew _n The subordinate phrase types are in turn eph ₁ ,eph ₂ ,...,eph _n ；e _ephi (i =1,2.. N.) is the phrase type eph _i Embedding;

step 1.2.3: for unstructured English medical texts, generating substring embedding by using a word2vec tool;

firstly, collecting an English medical term dictionary and constructing a medical term substring set; for any two terms, extracting the longest common substring of the two terms, and adding the substring to the medical term substring set; if two terms have a plurality of longest common substrings with the same length, the first longest common substring is taken and added into the medical term substring set;

secondly, generating embedded representations of all substrings by using a word2vec tool;

then, for English sentence ES, ES = (ew) ₁ ,ew ₂ ,...,ew _n )，ew ₁ ,ew ₂ ,...,ew _n For words, for each word ew _i (i =1, 2.. N.), judge word ew _i Whether or not to include substrings in the medical term substring set; word setting view _i Including substrings esubs in a set of substrings of medical terms ₁ ,esubs ₂ ,...,esubs _q Substrings esubs ₁ ,esubs ₂ ,...,esubs _q Is denoted as e _es1 ,e _es2 ,....,e _esq Then word ew _i Substring embedding representation e _essi Comprises the following steps: to e _es1 ,e _es2 ,....,e _esq Adding and summing, and then dividing by the result of the number q of the substrings; if word ew _i If the substring does not contain any substring in the medical term substring set, outputting a user-defined value;

finally, for each character ew in the sentence ES _i (i =1, 2.. Multidot.n), generating its corresponding substring embedding as per the above steps; substring embedding of sentence ES as E _ess ＝(e _ess1 ,e _ess2 ,....,e _essn )；

Step 1.2.4: splicing character embedding, word embedding, part of speech embedding, phrase embedding and substring embedding of the English medical text to construct multi-granularity text embedding, specifically comprising the following steps of:

for English sentence ES, ES = (ew) ₁ ,ew ₂ ,...,ew _n )，ew ₁ ,ew ₂ ,...,ew _n Constructing an ew for a word _i Multi-granularity text embedding of (i =1, 2...., n), i.e., character embedding of a spliced sentence ESE is introduced into _ec Word embedding E _ew Part of speech embedding E _epos Phrase embedding E _eph And substring embedding E _ess Constructing multi-granularity text embedding of the sentence ES, as shown in (4);

E _eme ＝Concate(E _ec ,E _ew ,E _epos ,E _eph ,E _ess ) (4)

wherein, concate represents splicing operation; in addition, E _eme Dimension of = E _ec Dimension + E of _ew Dimension + E- _epos Dimension + E of _eph Dimension +2E _ess Dimension (d);

to this end, from step 1.2.1 to step 1.2.4, the English word ew is constructed _i Multiple-granularity text embedding of (i =1,2,. Ang., n) as E _eme ；

Step 2: generating the mode weight of all characters in the Chinese sentence according to the medical entity constitution mode, comprising the following steps:

step 2.1: constructing a Chinese medical entity composition mode;

the medical entity constitution mode has the constitution form: "Y" is ₁ +Y ₂ +Y ₃ +...+Y _k ”；

Wherein, Y ₁ ,Y ₂ ,Y ₃ ,...,Y _k Represents the category of the word, "+" represents the link operation of the character string;

the category of the words comprises negative words, clinical manifestations, anatomical parts, modifiers, disease names, physical examination, numerical values, quantifiers, and drugs;

step 2.2: generating a mode weight of characters in the Chinese sentence;

for the Chinese sentence CS, CS = (cc) ₁ ,cc ₂ ,...,cc _n )，cc ₁ ,cc ₂ ,...,cc _n And for characters, judging whether the Chinese sentence CS is matched with the medical entity constitution mode, wherein the constructed mode matching weight vector is as follows: (w) ₁ ,w ₂ ,...,w _n )；

Case 1: if the character string cc _i ,cc _i+1 ,...,cc _j Satisfies the mode of ' anatomical region ', ' diseaseOr "medicine", for each character cc _i ,cc _i+1 ,...,cc _j Giving a mode weight of 2;

case 2: if the character string cc _i ,cc _i+1 ,...,cc _j Satisfying other modes, then for each character cc _i ,cc _i+1 ,...,cc _j Giving a mode weight of 1.5;

case 3: if the character string cc _i ,cc _i+1 ,...,cc _j Not satisfying the pattern, then for each character cc _i ,cc _i+1 ,...,cc _j Giving a mode weight of 1;

and 3, step 3: using the graph attention network and the mode enhanced attention mechanism to carry out node embedding representation learning, comprising the following steps of:

step 3.1, the embedding dimensions of Chinese character symbol nodes or English word nodes are transformed by utilizing a full connection layer;

inputting the multi-granularity text embedding of each character in the Chinese sentence CS to a full connection layer, and converting the embedding dimension of the multi-granularity text embedding of the Chinese sentence CS; the reason for converting dimensions is: the multi-granularity text embedding dimension needs to be consistent with the node vector input dimension of the graph attention network used in the step 3.2;

similarly, for the multi-granularity text embedding of each word in the English sentence ES, inputting the multi-granularity text embedding into the full connection layer, and converting the embedding dimension of the English multi-granularity text embedding.

In the fully-connected layer, firstly, the dimensionality is converted through a linear layer; then, using dropout method to prevent overfitting; finally, the gradient is prevented from disappearing by activating the function Relu;

step 3.2: for the Chinese medical text, multiplying the attention weight and the mode weight of Chinese character nodes of the graph attention network by the mode weight of characters in a Chinese sentence; for English medical texts, the mode weight of English word nodes in the graph attention network is set to be 1;

for the Chinese medical text, node embedding of the graph attention network is character embedding, and the character embedding is the character embedding generated in the step 3.1; for English medical texts, node embedding of the attention network is word embedding, and word embedding is word embedding generated in the step 3.1;

step 3.2.1: calculating attention weights of nodes in the graph attention network;

first, a multi-granular text embedding h of a sentence is input into an attention layer in an attention network, wherein,

embedding Chinese character nodes or English word nodes of a sentence of a medical text, wherein M is the number of the nodes, and H is the dimension of character embedding, as shown in (5);

performing linear transformation on the input node embedding, and converting the node embedding into the dimension of the number of all category labels; and calculating attention weight by using LeakyRelu function, namely calculating importance degree e of node v to node u _uv As shown in (6);

wherein, W ₁ A shared weight matrix is represented that is,

the embedding of a chinese character node or an english word node u representing a sentence of medical text,

embedding Chinese character nodes or English word nodes v representing one sentence of the medical text;

then, using Softmax function to pair e _uv Normalization is carried out to obtain alpha _uv As shown in (7);

wherein e is _uk Represents the degree of importance of node u to node k, α _uv Denotes e _uv Normalized value of (1), N _u A neighbor node representing node u;

finally, the attention weight alpha of the node u is generated _u As shown in (8);

wherein, W ₂ Representing a weight matrix;

step 3.2.2: updating attention weights of nodes in the graph attention network;

first, for node u, the mode weight w of Chinese character or English word represented by node u is used _u Updating the attention weight alpha of the node u _u As shown in (9);

α _u ＝α _u ×w _u (9)

second, construct an attention weight attention for the sentence _l (1 is more than or equal to l and less than or equal to k) as shown in (10);

attention _l ＝(α ₁ ,α ₂ ,...,α _M ) (10)

then, a multi-head attention mechanism is introduced into the graph attention network, specifically: computing k attention weights, multiplying each attention weight by the input h, generating a feature h 'of the sentence' _l As shown in (11);

h′ _l ＝attention _l ×h (11)

generating a single-ended output elu (h ') by activating a function elu' ₁ ),elu(h′ ₂ ),...,elu(h′ _k )；

Thirdly, splicing the outputs of the k heads to generate h' as shown in (12);

h'＝Concat(elu(h′ ₁ ),elu(h′ ₂ ),...,elu(h' _k )) (12)

finally, pass log _ softmax function generates the final output h _final As shown in (13);

h _final ＝log_softmax(h')) (13)

and 4, step 4: generating an entity category label of the medical text by adopting a conditional random field, and outputting a medical entity recognition result, wherein the method specifically comprises the following steps: generating entity category labels of Chinese characters or English words;

calculating the conditional probability distribution density of each character based on the conditional random field, namely calculating the probability of each character belonging to each entity class label, allocating the label with the highest probability to the corresponding character to serve as the entity class label of the character, and further outputting a medical entity recognition result;

and (3) carrying out sequence labeling on sentences in the medical text by adopting a conditional random field, generating entity category labels of Chinese characters or English words, and outputting a medical text entity recognition result.

Advantageous effects

Compared with the traditional medical entity identification method, the medical entity identification method based on multi-granularity text embedding provided by the invention has the following beneficial effects:

1. the identification method has portability and robustness, and is not limited to the source of the corpus; the method comprises the steps of carrying out graph representation modeling on a medical text based on a graph attention network, wherein the language of a language material is not limited, and Chinese texts and English texts can be processed;

2. the method comprises the steps of establishing multi-granularity text embedding of an unstructured medical text, wherein the multi-granularity text embedding comprises character embedding, word embedding, part of speech embedding, substring embedding and phrase embedding; by introducing multi-granularity text embedding, the characteristics of the medical text in the aspects of characters, words, parts of speech, phrases and substrings are mined, the distributed representation learning of a character string level, a lexical level and a grammar level is realized, the entity characteristic information of the medical text is enhanced, and the accuracy of medical entity recognition is improved;

3. the method adopts an image attention network, a mode enhanced attention mechanism and a conditional random field to identify the medical entity: firstly, a graph attention network model is utilized to realize graph representation modeling of a medical text, and graph structure information between Chinese characters or English words of the medical text is captured; secondly, a mode strengthening attention mechanism is introduced, the medical entity forming mode characteristics are introduced into the attention weight of the nodes in the graph attention network, the mode-based medical entity identification method and the deep learning-based medical entity identification method are effectively integrated, the characteristics and the advantages of the two methods are fully utilized, and the medical entity identification performance is improved;

4. the method can identify the medical entities of the unstructured Chinese medical text and the English medical text, and has wide application prospects in the fields of information retrieval, text classification, question-answering systems and the like.

Drawings

Fig. 1 is a flowchart illustrating a medical entity recognition based on multi-granular text embedding according to an embodiment of the present invention.

Detailed Description

The medical entity recognition system based on the method takes Pycharm as a development tool, python as a development language and Pythroch as a development framework.

Preferred embodiments of the process of the present invention will be described in detail with reference to examples.

Examples

This embodiment describes a process of using the method for medical entity recognition based on multi-granularity text embedding according to the present invention, as shown in fig. 1.

Firstly, generating character embedding of Chinese medical texts and word embedding of English medical texts by using pre-training language models MC-Bert and BioBert; word embedding, part-of-speech embedding, phrase embedding, substring embedding of the Chinese medical text and character embedding, part-of-speech embedding, phrase embedding and substring embedding of the English medical text are generated by using a word2vec tool, and the embedding is spliced to construct final Chinese and English medical text multi-granularity text embedding; secondly, generating mode weights of all characters in the Chinese sentence according to the medical entity composition mode, wherein the mode weights of all words in the English sentence are set to be 1; then, carrying out node embedding expression learning by using a graph attention network and a mode enhanced attention mechanism, and updating the attention weight of the nodes in the graph attention network by using a mode weight; finally, predicting the entity label of each character in the Chinese medical text or predicting the entity label of each word in the English medical text by using a conditional random field, and outputting a medical text entity recognition result; experiments were performed under the CCKS2019 dataset; firstly, generating multi-granularity text embedding of each sentence of medical text in a CCKS2019 data set; secondly, generating mode weights of all characters in each sentence of medical text according to the medical entity composition mode; then, transmitting the multi-granularity text embedding and pattern matching weight into a graph attention network, multiplying the pattern matching weight with the attention weight of the node, and calculating to obtain the final embedded expression of the input text; finally, outputting a final predicted entity identification label by the conditional random field according to the calculated probability; the experimental results prove the effectiveness of the invention; the method can also be applied to an English medical text data set NCBI Disease, a biochemical field data set BC5CDR and the like; the process applied to the data set NCBI Disease is generally consistent with the data set CCKS2019, and the difference is that when the attention coefficient is calculated in the attention network, the mode weights of English word nodes are all set to be 1; the flow applied to the data set BC5CDR differs: when multi-granularity text embedding is constructed, a term dictionary aiming at the field of biochemistry is required to be used for generating substring embedding, and character embedding, word embedding, part of speech embedding and phrase embedding are spliced, and the spliced embedding is transmitted into a graph attention network; when the attention network calculates the attention weight, adding entity formation pattern matching weight aiming at the biochemical field, and transmitting the result of the attention network into a conditional random field; and outputting a final predicted entity recognition result by the conditional random field according to the probability.

As can be seen from fig. 1, the method specifically includes the following steps:

for unstructured Chinese medical text, the input of the pre-training language model MC-Bert consists of three kinds of Embedding, namely symbol Embedding (Token Embedding), segment Embedding (Segment Embedding) and Mask Embedding (Mask Embedding), wherein symbol Embedding refers to vector representation of each word. Segmentation embedding is used for distinguishing two natural language sentences, and the medical entity recognition task recognizes an entity in sentence units, so that each word has the same segmentation embedding. In covering embedding, if the current position is the character of the input sentence, the value is assigned to 1; if the current position is 0, namely the current position is not the character of the input sentence, the value is assigned to be 0;

for the Chinese sentence CS, CS = (cc) ₁ ,cc ₂ ,...,cc _n )，cc ₁ ,cc ₂ ,...,cc _n Is a character; generating a character-embedded representation E of a sentence CS using a pre-trained language model MC-Bert _cc As shown in (1);

E _cc ＝(e _cc1 ,e _cc2 ,....,e _ccn )，E _cc ∈R ^n×m (1)

wherein n is the sentence length 512, i.e. the number of characters in the sentence, and if the number of characters is less than 512, 0 is filled; m is a dimensionality 768 set by the pre-training model MC-Bert; e.g. of the type _cci (i =1, 2.., n) is the character cc _i 768 dimensions; character-embedded representation E _cc Has a dimension of 512 × 768; r ^n×m A real number matrix representing n rows and m columns;

for example, in the sentence "the patient has yellow skin and sclera, with decreased appetite before 4 months, after dinner, with paroxysmal abdominal pain, nausea, no diarrhea, vomiting, chest distress, suffocating, dizziness, headache, no object rotation, no fever, cough, chest pain, asthma, and light stool. ", the characters are divided by" \ t ". The mark of 'CLS' and 'SEP' is added at the beginning and end of sentence. To make the dimensions of the character-embedded representations of different sentences consistent, the sentence length is extended to 512 by filling in 0. The character embedding representation for generating the sentence by using the pre-training model MC-Bert is as follows:

where n is the sentence length 512, m represents the character embedding dimension 768 dimensions. The character-embedded vector of the character "sick" is (x) ₁₁ ,x ₁₂ ,...x _1m )；

Step 1.1.2: generating word embedding, part of speech embedding and phrase embedding of the Chinese medical text;

then, word embedding, part of speech embedding and phrase embedding are generated by using a word2vec tool, and the three embedding dimensions are all 200;

wherein the word is embedded as E _cw ＝(e _cw1 ,e _cw2 ,....,e _cwn ) Wherein the character cc ₁ ,cc ₂ ,...,cc _n The subordinate words are cw in sequence ₁ ,cw ₂ ,...,cw _n ；e _cwi (i =1, 2.., n) is the word cw _i The embedding of (2);

phrase embedding as E _cph ＝(e _cph1 ,e _cph2 ,....,e _cphn ) Wherein, the character cc ₁ ,cc ₂ ,...,cc _n The subordinate phrase types are in turn cph ₁ ,cph ₂ ,...,cph _n ；e _cphi (i =1, 2.., n) is the phrase type cph _i Embedding;

for example, for the Chinese sentence "the patient found yellow skin and sclera 4 months ago", the segmentation result obtained by the jieba segmentation tool is "the patient found yellow skin and sclera 4 months ago". Further, the expanded word segmentation result is that the patient finds yellow and yellow dyeing of the skin and sclera of the skin 4 months ago, and words to which each character in the sentence belongs are given. For example, the character "affected" is subordinate to the word "patient", and the word of the character "affected" is embedded as an embedding of the word "patient";

the part-of-speech mark of the Chinese sentence after being expanded is obtained as 'NNNNCD NNLC VV NNNNPU NNNNNR NR NR' through a part-of-speech mark device Stanford popper. Obtaining a phrase label of the Chinese sentence after being expanded, namely 'NP NP NP NP NP NP NP NP NP LCP VP NP NP PU NP NP NP NP NP NP by using a syntax analyzer Stanford parser, for example, the character' suffers 'from' and is subordinate to the word 'patient', the part of speech of the character 'suffers' is embedded into the part of speech 'NN' of the word 'patient', and the phrase of the character 'suffers' is embedded into the type label 'NP' of the subordinate phrase;

step 1.1.3: for an unstructured Chinese medical text, generating substring embedding by using a word2vec tool, wherein the embedding dimension is 200, and specifically comprises the following steps:

firstly, collecting a medical term dictionary and constructing a medical term substring set;

for any two terms, the longest common substring of the two terms is extracted and added to the medical term substring set. If the two terms have a plurality of longest common substrings with the same length, taking the first longest common substring and adding the first longest common substring to the medical term substring set;

then, for each word cw _i (i =1, 2.. N.), judging the word cw _i Whether or not to include substrings in the medical term substring set; for the Chinese sentence CS, CS = (cc) ₁ ,cc ₂ ,...,cc _n )，cc ₁ ,cc ₂ ,...,cc _n Is a character, cc ₁ ,cc ₂ ,...,cc _n The subordinate words are as follows: cw ₁ ,cw ₂ ,...,cw _n (ii) a Let the word cw _i Containing substrings csubs in a set of medical term substrings ₁ ,csubs ₂ ,...,csubs _p String csubs ₁ ,csubs ₂ ,...,csubs _p Is denoted as e _cs1 ,e _cs2 ,....,e _csp Then word cw _i Substring embedding representation e _cssi Comprises the following steps: to e _cs1 ,e _cs2 ,....,e _csp Adding and summing, and then dividing by the result of the number p of the substrings; if the word cw _i If the substring does not contain any substring in the medical term substring set, outputting a custom value;

finally, for each character cc in the sentence CS _i (i =1, 2.. Multidot.n), generating its corresponding substring embedding as per the above steps; substring embedding of sentence CS as E _css ＝(e _css1 ,e _css2 ,....,e _cssn )；

For example, a medical dictionary is collected: medical system nomenclature-clinical term SNOMED CT, the medical term substring set is constructed. For the word "digestive tract" in a Chinese sentence, the word includes the substrings "digestion, tract, digestive tract" in the substring set of medical terms. Further, the substring embedding of the character string "digestive tract" is: the embedding of six substrings of 'elimination', 'melting', 'tract', 'digestion', 'melting tract' and 'melting tract' represents the result of adding and summing and dividing by the number of the substrings of 6;

step 1.1.4: character embedding, word embedding, part of speech embedding, phrase embedding and substring embedding of the Chinese medical text are spliced to construct multi-granularity text embedding:

for the Chinese sentence CS, CS = (cc) ₁ ,cc ₂ ,...,cc _n )，cc ₁ ,cc ₂ ,...,cc _n For a character, construct the character cc _i Multi-granularity text embedding of (i =1, 2...., n), i.e., character embedding E of the concatenated sentence CS _cc Word embedding E _cw Part of speech embedding E _cpos Phrase embedding E _cph And substring embedding E _css Constructing multi-granularity text embedding of the sentence CS, as shown in (2);

E _cme ＝Concate(E _cc ,E _cw ,E _cpos ,E _cph ,E _css ) (2)

wherein, concate represents the splicing operation. In addition, E _cme Dimension of (c) is 1568, i.e. 1568 (E) _cme Dimension of) =768 (E _cc Dimension of) +200 (E) _cw Dimension of) +200 (E) _cpos Dimension of) +200 (E) _cph Dimension of) +200 (E) _css Dimension (d);

Step 1.2: for unstructured English medical text, generating multi-granularity text embedding, comprising the steps of:

BioBert is a pre-training model generated according to English medical data training;

for English sentence ES, ES = (ew) ₁ ,ew ₂ ,...,ew _n )，ew ₁ ,ew ₂ ,...,ew _n Is a word; using pre-training wordsThe speech model BioBert, which generates a word-embedded representation of the sentence ES, as shown in (3);

E _ew ＝(e _ew1 ,e _ew2 ,....,e _ewn )，E _ew ∈R ^n×m (3)

wherein n is the sentence length 512, i.e. the number of words in the sentence, and if the number of words is less than 512, 0 is filled; m is a dimensionality 768 dimension set by a pre-training model BioBert; e.g. of the type _ewi (i =1, 2.. Times.n) is the word ew _i 768 dimensions; character-embedded representation E _ew Has a dimension of 512 × 768; r ^n×m A real number matrix representing n rows and m columns;

step 1.2.2: generating character embedding, part of speech embedding and phrase embedding of English medical texts;

for English medical texts, word2vec tools are used for generating character embedding, part of speech embedding and phrase embedding, and the embedding dimension is 200;

for English sentence ES, ES = (ew) ₁ ,ew ₂ ,...,ew _n )，ew ₁ ,ew ₂ ,...,ew _n Is word, ew _i (i =1, 2.. Once, n) characters are embedded to form a view _i All character embedding and averaging, part of speech embedding as ew _i Embedding part of speech, wherein phrase embedding is embedding of the type of the affiliated phrase;

wherein the character is embedded as E _ec ＝(e _ec1 ,e _ec2 ,....,e _ecn ) Wherein e is _eci (i =1, 2.. Times.n) is the word ew _i The character embedding of (2);

part of speech embedded as E _epos ＝(e _epo1 ,e _epo2 ,....,e _epon ) Wherein the word ew ₁ ,ew ₂ ,...,ew _n In order of part of speech of epo ₁ ,epo ₂ ,...,epo _n ；e _epoi (i =1,2.. N) is a part-of-speech epo _i The embedding of (2);

phrase embedding as E _eph ＝(e _eph1 ,e _eph2 ,....,e _ephn ) Wherein the word ew ₁ ,ew ₂ ,...,ew _n The subordinate phrase types are in turn eph ₁ ,eph ₂ ,...,eph _n ；e _ephi (i =1,2.. N.) is the phrase type eph _i The embedding of (2);

step 1.2.3: for an unstructured English medical text, generating substring embedding by using a word2vec tool, wherein the embedding dimension is 200, and specifically comprises the following steps:

secondly, generating embedded representations of all the substrings by using a word2vec tool;

then, for English sentence ES, ES = (ew) ₁ ,ew ₂ ,...,ew _n )，ew ₁ ,ew ₂ ,...,ew _n For words, for each word ew _i (i =1, 2.. N.), judge word ew _i Whether substrings in the medical term substring set are included; word setting ew _i Including substrings esubs in a set of substrings of medical terms ₁ ,esubs ₂ ,...,esubs _q Sub-strings esubs ₁ ,esubs ₂ ,...,esubs _q Is denoted as e _es1 ,e _es2 ,....,e _esq Then word ew _i Substring embedding representation e _essi Comprises the following steps: to e _es1 ,e _es2 ,....,e _esq Adding and summing, and then dividing by the result of the number q of the substrings; if the word ew _i If the substring does not contain any substring in the medical term substring set, outputting a custom value;

finally, for each character ew in the sentence ES _i (i =1, 2.. N), generating its corresponding substring embeddings as per the above steps; substring embedding of sentence ES as E _ess ＝(e _ess1 ,e _ess2 ,....,e _essn )；

Step 1.2.4: character embedding, word embedding, part of speech embedding, phrase embedding and substring embedding of the English medical text are spliced to construct multi-granularity text embedding:

for English sentence ES, ES = (ew) ₁ ,ew ₂ ,...,ew _n )，ew ₁ ,ew ₂ ,...,ew _n Constructing an ew for a word _i Multi-granularity text embedding of (i =1,2.. Ang., n), i.e., character embedding E of a spliced sentence ES _ec Word embedding E _ew Part of speech embedding E _epos Phrase embedding E _eph And substring embedding E _ess Constructing multi-granularity text embedding of the sentence ES, as shown in (4);

E _eme ＝Concate(E _ec ,E _ew ,E _epos ,E _eph ,E _ess ) (4)

wherein, concate represents the splicing operation. In addition, E _eme Dimension of (c) is 1568, i.e. 1568 (E) _eme Dimension of) =768 (E) _ec Dimension of) +200 (E) _ew Dimension of) +200 (E) _epos Dimension of) +200 (E) _eph Dimension of) +200 (E) _ess Dimension (d);

so far, from step 1.2.1 to step 1.2.4, the English word ew is constructed _i Multiple-granularity text embedding of (i =1, 2.. Said., n) as E _eme ；

step 2.1: constructing a Chinese medical entity composition mode;

the medical entity constitution mode has the constitution form: "Y" is ₁ +Y ₂ +Y ₃ +...+Y _k ", wherein Y ₁ ,Y ₂ ,Y ₃ ,...,Y _k Indicating the category of the word, "+" indicates the link operation of the character string. The category of the words comprises negative words, clinical manifestations, anatomical parts, modifiers, disease names, physical examination, numerical values, quantifier words and medicines;

for example, negation words include none, absent, etc. The clinical manifestations include chill, sweating, increased heart rate, etc. Anatomical sites include the back, meniscus, left colon artery, etc. Modifiers include mild, more severe, etc. The disease names include rheumatic heart disease, multiple cancers, etc. Physical examination includes cardiopulmonary examination, electrocardiography, and the like. Quantifier includes degree, group, only, etc. The medicines comprise cedilanid, cefuroxime axetil, aspirin and the like;

for example, constructing a medical entity constitutes a pattern of "negation + clinical manifestation", which is satisfied by the terms "no nausea" and "no fever". Because the term "no nausea" consists of the negation of the word "no" and clinical manifestations of "nausea", the term "no fever" consists of the negation of the word "no" and clinical manifestations of "fever";

step 2.2: generating a mode weight of characters in the Chinese sentence;

Case 1: if the character string cc _i ,cc _i+1 ,...,cc _j Satisfying the pattern "anatomical region", "disease" or "medicine", it is cc for each character _i ,cc _i+1 ,...,cc _j Giving a mode weight of 2;

case 2: if the character string cc _i ,cc _i+1 ,...,cc _j Satisfying other patterns, for each character cc _i ,cc _i+1 ,...,cc _j Giving a weight of 1.5 to the mode;

case 3: if the character string cc _i ,cc _i+1 ,...,cc _j Not satisfying the pattern, for each character cc _i ,cc _i+1 ,...,cc _j Giving a mode weight of 1;

for example, the input text "patient found yellow skin, sclera 4 months ago" constructs a pattern from the medical entity, generating a pattern weight vector as: (1,1,1,1,1,1,1,1,2,2,1,2,2,1,1,1,1,1);

and 3, step 3: using a graph attention network and a pattern-reinforced attention mechanism to perform node-embedded representation learning, comprising the steps of:

3.1, transforming the embedding dimensions of the Chinese character nodes or English word nodes by using the full connection layer;

for multi-granularity text embedding of each character in the Chinese sentence CS, the input is input into a full connection layer, and the embedding dimension of the multi-granularity text embedding of the Chinese sentence is converted from 1568 dimensions to 768 dimensions. The reason for converting dimensions is: the multi-granularity text embedding dimension needs to be consistent with the node vector input dimension of the graph attention network used in the step 3.2, namely 768 dimensions. Similarly, for the multi-granularity text embedding of each word in the English sentence ES, inputting the multi-granularity text embedding into a full connection layer, and converting the embedding dimensionality of the English multi-granularity text embedding from 1568 dimensionality to 768 dimensionality;

in the fully-connected layer, firstly, the dimensionality is converted through a linear layer; then, overfitting is prevented by using a dropout method; finally, the gradient is prevented from disappearing by the activation function Relu;

for Chinese medical text, the node embedding of the graph attention network is character embedding, and the character embedding is 768-dimensional character embedding generated in step 3.1. For English medical texts, node embedding of the graph attention network is word embedding, and the word embedding is 768-dimensional word embedding generated in the step 3.1;

embedding Chinese character nodes or English word nodes of a sentence of a medical text, wherein M is the number of the nodes 512, H is the dimension of character embedding, and the value is 768, as shown in (5);

and performing linear transformation on the input node embedding, converting the 768-dimensional node embedding into 16-dimensional node embedding, wherein 16 is the number of all class labels. And calculating attention weight by utilizing LeakyRelu function, namely calculating importance degree e of the node v to the node u _uv As shown in (6);

wherein, W ₁ A shared weight matrix is represented that is,

then, e is paired with Softmax function _uv Normalization is carried out to obtain alpha _uv As shown in (7);

wherein e is _uk Representing the importance degree of the node u to the node k;

wherein alpha is _uv Denotes e _uv Normalized value of (1), N _u A neighbor node representing node u;

finally, the attention weight α of the node u is generated _u As shown in (8);

wherein W ₂ Representing a weight matrix;

step 3.2.2: updating the attention weights of the nodes in the graph attention network;

α _u ＝α _u ×w _u (9) Second, construct attention weight attention of sentence _l (l is more than or equal to 1 and less than or equal to k) as shown in (10);

attention _l ＝(α ₁ ,α ₂ ,...,α _M ) (10)

a multi-point attention mechanism is then introduced to the graph attention network. Specifically, k attention weights are calculated, each attention weight is multiplied by the input h, and the feature h 'of the sentence is generated' _l As shown in (11);

h′ _l ＝attention _l ×h (11)

by activating the function elu, a single-headed output elu (h' ₁ ),elu(h' ₂ ),...,elu(h' _k ) Finally, splicing the outputs of the k heads to generate h' as shown in (12);

h'＝concat(elu(h′ ₁ ),elu(h′ ₂ ),...,elu(h' _k )) (12)

finally, generating a final output h through a log _ softmax function _final As shown in (13);

h _final ＝log_softmax(h')) (13)

For example, for a data set, its entity class labels include: "PAD", "CLS", "SEP", "O", "B-disease and diagnosis", "I-disease and diagnosis", "B-surgery", "I-surgery", "B-anatomical site", "I-anatomical site", "B-drug", "I-drug", "B-imaging examination", "I-imaging examination", "B-laboratory examination", "I-laboratory examination";

for example, in the sentence "the patient has yellow skin and sclera, with decreased appetite before 4 months, after dinner, the patient has a significant symptom, with paroxysmal abdominal pain, nausea, no diarrhea, vomiting, chest distress, suffocating, dizziness, headache, no object rotation, no fever, cough, chest pain, panting, and light stool color. <xnotran> ", [3,3,3,3,3,3,3,3,3,3,8,9,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,8,3,3,3,3,3,3,8,3,3,3,3,3,3,3,8,3,3,3,3,3,8,3,3,3,8,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,8,3,3,3,3,3,3,3,3,3,3,3,3,3,3]; </xnotran> Each number in the list represents an index of the entity class label of the bit character predicted. The index is converted into a corresponding entity label through an idx2tag function, and the final entity recognition result is O O O O O O O O B-anatomical part I-anatomical part O O O O O O O O O O O O O O O O O O O B-anatomical part O O O O O O O O O O O O B-anatomical part O O O O O O O O O O O O O O O O O O O O O O O O B.

In order to illustrate the medical entity recognition effect of the invention, the experiment is carried out by comparing the same training set and test set by two methods respectively under the same condition; the first method is a medical entity recognition method based on a bidirectional long-short term memory network, an attention mechanism and a conditional random field, and introduces a medical dictionary and part-of-speech characteristics; the second method is the medical entity identification method of the present invention;

the adopted evaluation indexes are as follows: accuracy, recall and F1 value; the medical entity recognition results are: the accuracy of the medical entity recognition results of the two-way long-short term memory network, the attention mechanism and the conditional random field in the prior art is 76.42 percent, the recall rate is 73.80 percent, and the F1 value is 75.08 percent; the accuracy of the medical entity recognition result by adopting the method is 86.38 percent, the recall rate is 85.82 percent, and the F1 value is 86.10 percent; the effectiveness of the medical entity identification method based on multi-granularity text embedding provided by the invention is shown through experiments;

while the foregoing is directed to the preferred embodiment of the present invention, it is not intended that the invention be limited to the embodiment and the drawings disclosed herein. It is intended that all equivalents and modifications which do not depart from the spirit of the invention disclosed herein are deemed to be within the scope of the invention.

Claims

1. A medical entity recognition method based on multi-granularity text embedding is characterized by comprising the following steps: the method comprises the following steps:

step 1.1: constructing multi-granularity text embedding for an unstructured Chinese medical text;

firstly, collecting a medical term dictionary, and constructing a medical term substring set, specifically comprising the following steps: for any two terms, extracting the longest common substring of the two terms, and adding the common substring to the medical term substring set; if two terms have a plurality of longest common substrings with the same length, the first longest common substring is taken and added into the medical term substring set;

step 1.1.4: character embedding, word embedding, part of speech embedding, phrase embedding and substring embedding of the Chinese medical text are spliced to construct multi-granularity text embedding;

step 1.2: for unstructured English medical text, constructing multi-granularity text embedding, comprising the following steps:

for English medical texts, character embedding, part-of-speech embedding and phrase embedding are generated by using a word2vec tool;

step 1.2.3: for unstructured English medical texts, generating substrings by using a word2vec tool for embedding;

firstly, collecting an English medical term dictionary and constructing a medical term substring set; for any two terms, extracting the longest common substring of the two terms, and adding the common substring to the medical term substring set; if the two terms have a plurality of longest common substrings with the same length, taking the first longest common substring and adding the first longest common substring to the medical term substring set;

then, for English sentence ES, ES = (ew) ₁ ，ew ₂ ，...，ew _n )，ew ₁ ，ew ₂ ，...，ew _n For words, for each word ew _i (i =1, 2.. Eta., n), judge the word ew _i Whether substrings in the medical term substring set are included; word setting view _i Including substrings esubs in a set of substrings of medical terms ₁ ，esubs ₂ ，...，esubs _q Substrings esubs ₁ ，esubs ₂ ，...，esubs _q Is denoted as e _es1 ，e _es2 ，....，e _esq Then word ew _i Substring embedding representation e _essi Comprises the following steps: to e _es1 ，e _es2 ，....，e _esq Adding and summing, and then dividing by the result of the number q of the substrings; if the word ew _i If the substring does not contain any substring in the medical term substring set, outputting a custom value;

finally, for each word ew in the sentence ES _i (i =1, 2.. Multidot.n), generating its corresponding substring embedding as per the above steps; substring embedding of sentence ES as E _ess ＝(e _ess1 ，e _ess2 ，....，e _essn )；

Step 1.2.4: splicing character embedding, word embedding, part of speech embedding, phrase embedding and substring embedding of the English medical text to construct multi-granularity text embedding;

step 2: generating the mode weight of all characters in the Chinese sentence according to the medical entity composition mode, comprising the following steps:

step 2.1: constructing a Chinese medical entity composition mode;

the medical entity constitution mode has the constitution form: "Y ₁ +Y ₂ +Y ₃ +...+Y _k ”；

Wherein Y is ₁ ，Y ₂ ，Y ₃ ，...，Y _k Represents the category of the word, "+" represents the link operation of the character string;

the category of the words comprises negative words, clinical manifestations, anatomical parts, modifiers, disease names, physical examination, numerical values, quantifiers and medicines;

step 2.2: generating a mode weight of characters in the Chinese sentence;

similarly, for multi-granularity text embedding of each word in the English sentence ES, inputting the multi-granularity text embedding into the full connection layer, and converting the embedding dimension of the English multi-granularity text embedding;

step 3.2: for the Chinese medical text, multiplying the attention weight and the mode weight of Chinese character nodes of the graph attention network by the mode weight of characters in a Chinese sentence; for English medical texts, the mode weight of the Chinese and English word nodes in the attention network is set to be 1;

for Chinese medical texts, node embedding of the attention network is character embedding, and the character embedding is character embedding generated in the step 3.1; for English medical texts, node embedding of the attention network is word embedding, and word embedding is word embedding generated in the step 3.1;

step 3.2.2: updating attention weights of nodes in the graph attention network;

adopting a conditional random field to carry out sequence labeling on sentences in the medical text, generating entity category labels of Chinese characters or English words, and outputting a medical text entity recognition result;

2. the method for recognizing medical entities based on multi-granularity text embedding according to claim 1, wherein: the MC-Bert in the step 1.1.1 is a pre-training model generated according to Chinese medical data training; and the symbol embedding in step 1.1.1 refers to the vector representation of each word; the segmentation embedding is used for distinguishing two natural language sentences, and the medical entity recognition task recognizes the entity by taking the sentences as units, so that each word has the same segmentation embedding; in the covering embedding, if the current position is the character of the input sentence, the value is assigned to be 1; if the current position is 0, namely the current position is not the character of the input sentence, the value is assigned to be 0;

3. the method for recognizing medical entities based on multi-granularity text embedding according to claim 2, wherein: in step 1.1.1, for the Chinese sentence CS, CS = (cc) ₁ ，cc ₂ ，...，cc _n )，cc ₁ ，cc ₂ ，...，cc _n For the character, a pre-training language model MC-Bert is adopted to generate a character embedded representation E of the sentence CS _cc As shown in (1):

E _cc ＝(e _cc1 ，e _cc2 ，....，e _ccn )，E _cc ∈R ^n×m (1)

wherein n is the sentence length, i.e. sentenceThe number of characters in the son, if the number of characters is less than n, filling 0; m is the dimension set by the pre-training model MC-Bert; e.g. of the type _cci (i =1, 2.., n) is the character cc _i Embedding; character-embedded representation E _cc The dimension of (a) is n rows and m columns; r ^n×m A real number matrix representing n rows and m columns;

4. the method for recognizing medical entities based on multi-granularity text embedding according to claim 3, wherein the method comprises the following steps: for the chinese sentence CS, CS = (cc) in step 1.1.2 ₁ ，cc ₂ ，...，cc _n )，cc ₁ ，cc ₂ ，...，cc _n The method is characterized in that the method comprises the following steps: obtaining words of the Chinese medical text based on a jieba word segmentation tool, and obtaining word embedding of the affiliated words by using word2 vec; part of speech embedding, specifically: based on the part-of-speech tag of the word of the Chinese medical text, word2vec is used for acquiring part-of-speech embedding of the affiliated word; phrase embedding, specifically: acquiring the type embedding of the affiliated phrase by using word2vec based on the phrase mark of the Chinese medical text;

wherein the word is embedded as E _cw ＝(e _cw1 ，e _cw2 ，....，e _cwn ) Wherein the character cc ₁ ，cc ₂ ，...，cc _n The subordinate words are cw in sequence ₁ ，cw ₂ ，...，cw _n ；e _cwi (i =1,2.. Ang., n) is the word cw _i The embedding of (2);

part of speech embedded as E _cpos ＝(e _cpo1 ，e _cpo2 ，....，e _cpon ) Wherein the word cw ₁ ，cw ₂ ，...，cw _n Has a part of speech of cpo ₁ ，cpo ₂ ，...，cpo _n ；e _cpoi (i =1, 2.., n) is a part of speech cpo _i Embedding;

phrase embedding as E _cph ＝(e _cph1 ，e _cph2 ，....，e _cphn ) Wherein the character cc ₁ ，cc ₂ ，...，cc _n The subordinate phrase types are in turn cph ₁ ，cph ₂ ，...，cph _n ；e _cphi (i =1, 2.., n) is the phrase type cph _i Embedding;

5. the method for recognizing medical entities based on multi-granularity text embedding according to claim 4, wherein: after the word2vec tool is used to generate the embedded representation of all the substrings in step 1.1.3, for each word cw _i (i =1, 2.. Multidot., n), the word cw is judged to be present _i Whether or not to include substrings in the medical term substring set; for the Chinese sentence CS, CS = (cc) ₁ ，cc ₂ ，...，cc _n )，cc ₁ ，cc ₂ ，...，cc _n Is a character, cc ₁ ，cc ₂ ，...，cc _n The words to which they belong are in turn: cw ₁ ，cw ₂ ，...，cw _n (ii) a Let the word cw _i Containing substrings csubs in a set of medical term substrings ₁ ，csubs ₂ ，...，csubs _p Sub-string csubs ₁ ，csubs ₂ ，...，csubs _p Are respectively represented as e _cs1 ，e _cs2 ，....，e _csp Then word cw _i Substring embedding representation e _cssi Comprises the following steps: to e for _cs1 ，e _cs2 ，....，e _csp Adding and summing, and then dividing by the result of the number p of the substrings; if the word cw _i If the substring does not contain any substring in the medical term substring set, outputting a custom value;

finally, for each character cc in the sentence CS _i (i =1, 2.. N), generating its corresponding substring embeddings as per the above steps; substring embedding of sentence CS as E _css ＝(e _css1 ，e _css2 ，....，e _cssn )；

6. The method for recognizing medical entities based on multi-granularity text embedding according to claim 5, wherein: step 1.1.4, specifically:

for the Chinese sentence CS, CS = (cc) ₁ ，cc ₂ ，...，cc _n )，cc ₁ ，cc ₂ ，...，cc _n For a character, construct the character cc _i Multi-granularity text embedding of (i =1,2.. Ang., n), i.e., character embedding E of a spliced sentence CS _cc Word embedding E _cw Part of speech embedding E _cpos Phrase embedding E _cph And substring embedding E _css Constructing multi-granularity text embedding of the sentence CS, as shown in (2):

E _cme ＝Concate(E _cc ，E _cw ，E _cpos ，E _cph ，E _css ) (2)

wherein, concate represents splicing operation; in addition, E _cme Dimension of = E _cc Dimension + E of _cw Dimension + E of _cpos Dimension + E of _cph Dimension + E of _css Dimension (d);

7. The method of claim 6, wherein the method comprises the following steps: in step 1.2.1, for English sentence ES, ES = (ew) ₁ ，ew ₂ ，...，ew _n )，ew ₁ ，ew ₂ ，...，ew _n For a word, a word-embedded representation of the sentence ES is generated using the pre-trained language model BioBert, as shown in (3):

E _ew ＝(e _ew1 ，e _ew2 ，....，e _ewn )，E _ew ∈R ^n×m (3)

wherein n is the length of the sentence, namely the number of words in the sentence, and if the number of the words is less than n, 0 is filled; m is the dimension set by the pre-training model BioBert; e.g. of the type _ewi (i =1, 2.. N.) is the word ew _i The embedding of (2);

in step 1.2.2, for English sentence ES, ES = (ew) ₁ ，ew ₂ ，...，ew _n )，ew ₁ ，ew ₂ ，...，ew _n Is word, ew _i (i =1, 2.. Once, n) characters are embedded to form a view _i All character embedding and averaging, part of speech embedding as ew _i Embedding part of speech, wherein phrase embedding is embedding of the type of the affiliated phrase;

wherein the character is embedded as E _ec ＝(e _ec1 ，e _ec2 ，....，e _ecn ) Wherein e is _eci (i =1, 2.. Times.n) is the word ew _i Embedding the characters;

part of speech embedded as E _epos ＝(e _epo1 ，e _epo2 ，....，e _epon ) Wherein the word ew ₁ ，ew ₂ ，...，ew _n In order of part of speech of epo ₁ ，epo ₂ ，...，epo _n ；e _epoi (i =1, 2.. N.) is a part-of-speech epo _i Embedding;

phrase embedding as E _eph ＝(e _eph1 ，e _eph2 ，....，e _ephn ) Wherein the word ew ₁ ，ew ₂ ，...，ew _n The subordinate phrase types are eph in sequence ₁ ，eph ₂ ，...，eph _n ；e _ephi (i =1, 2.. N.) is the phrase type eph _i Embedding;

8. the method of claim 7, wherein the method comprises: step 1.2.4, specifically:

for English sentence ES, ES = (ew) ₁ ，ew ₂ ，...，ew _n )，ew ₁ ，ew ₂ ，...，ew _n Constructing an ew for a word _i Multi-granularity text embedding of (i =1, 2...., n), i.e., character embedding E of the spliced sentence ES _ec Word embedding E _ew Part of speech embedding E _epos Phrase embedding E _eph And substring embedding E _ess Constructing multi-granularity text embedding of the sentence ES, as shown in (4);

E _eme ＝Concate(E _ec ，E _ew ，E _epos ，E _eph ，E _ess ) (4)

wherein, concate represents splicing operation; in addition, E _eme Dimension of = E _ec Dimension + E of _ew Dimension + E of _epos Dimension + E of _eph Dimension +2E _ess Dimension (d);

9. The method for recognizing medical entities based on multi-granularity text embedding according to claim 8, wherein: in step 2.2, for the Chinese sentence CS, CS = (cc) ₁ ，cc ₂ ，...，cc _n )，cc ₁ ，cc ₂ ，...，cc _n For characters, judging whether the Chinese sentence CS is matched with a medical entity to form a pattern, and constructing a pattern matching weight vector as follows: (w) ₁ ，w ₂ ，...，w _n )；

Case 1: if the character string cc _i ，cc _i+1 ，...，cc _j Satisfying the pattern "anatomical region", "disease" or "medicine", cc is each character _i ，cc _i+1 ，...，cc _j Giving a mode weight of 2;

case 2: if the character string cc _i ，cc _i+1 ，...，cc _j Satisfying other patterns, for each character cc _i ，cc _i+1 ，...，cc _j Giving a weight of 1.5 to the mode;

case 3: if the character string cc _i ，cc _i+1 ，...，cc _j Not satisfying the pattern, for each character cc _i ，cc _i+1 ，...，cc _j Giving a mode weight of 1;

step 3.2.1, specifically:

wherein, W ₁ A shared weight matrix is represented that is,

wherein e is _uk Representing the degree of importance of node u to node k, α _uv Denotes e _uv Normalized value of (2), N _u A neighbor node representing node u;

finally, the attention weight α of the node u is generated _u As shown in (8);

wherein, W ₂ Representing a weight matrix;

step 3.2.2, specifically:

α _u ＝α _u ×w _u (9)

second, construct attention weight attention of sentence _l (1. Ltoreq. L. Ltoreq.k) as shown in (10);

attention _l ＝(α ₁ ，α ₂ ，...，α _M ) (10)

then, a multi-head attention mechanism is introduced into the graph attention network, specifically: calculating k attention weights, multiplying each attention weight by the input h, and generating a feature h 'of the sentence' _l As shown in (11);

h′ _l ＝attention _l ×h (11)

generating a single-ended output elu (h ') by activating a function elu' ₁ )，elu(h′ ₂ )，...，elu(h′ _k )；

Thirdly, splicing the outputs of the k heads to generate h' as shown in (12);

h′＝Concat(elu(h′ ₁ )，elu(h′ ₂ )，...，elu(h′ _k )) (12)

finally, generating a final output h through a log _ softmax function _final As shown in (13)

h _final ＝log_softmax(h')) (13) 。