CN113779993A - Medical entity identification method based on multi-granularity text embedding - Google Patents

Medical entity identification method based on multi-granularity text embedding Download PDF

Info

Publication number
CN113779993A
CN113779993A CN202110890112.9A CN202110890112A CN113779993A CN 113779993 A CN113779993 A CN 113779993A CN 202110890112 A CN202110890112 A CN 202110890112A CN 113779993 A CN113779993 A CN 113779993A
Authority
CN
China
Prior art keywords
embedding
medical
word
character
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110890112.9A
Other languages
Chinese (zh)
Other versions
CN113779993B (en
Inventor
道捷
张春霞
彭成
薛晓军
王瞳
徐天祥
郭贵锁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Publication of CN113779993A publication Critical patent/CN113779993A/en
Application granted granted Critical
Publication of CN113779993B publication Critical patent/CN113779993B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a medical entity identification method based on multi-granularity text embedding, and belongs to the technical field of information extraction and knowledge graph construction. The medical entity identification method comprises the following steps: constructing multi-granularity text embedding: constructing multi-granularity text embedding through a pre-training language model, wherein the multi-granularity text embedding comprises character embedding, word embedding, part of speech embedding, substring embedding and phrase embedding; generating a pattern weight: generating mode weights of all characters in the Chinese sentence according to the medical term composition mode; node embedding represents learning: using a graph attention network and a mode enhanced attention mechanism to carry out node embedding representation learning; outputting a medical text entity recognition result: and generating an entity category label of the medical text by adopting the conditional random field, and outputting a medical entity recognition result. The method solves the problems of insufficient utilization of graph representation information and single embedding granularity of text distributed representation in the medical entity identification, and improves the performance of the medical entity identification.

Description

Medical entity identification method based on multi-granularity text embedding
Technical Field
The invention relates to a medical entity identification method based on multi-granularity text embedding, and belongs to the technical field of information extraction and knowledge graph construction.
Background
Medical entity identification is an important research topic in the fields of information extraction and medical knowledge graph construction. Medical entity recognition refers to the recognition of entities or terms in the medical field from unstructured medical text. The medical entity recognition technology can provide technical and knowledge support for the fields of question-answering systems in the medical field, medical auxiliary diagnosis, accurate medical knowledge services and the like.
The medical entity identification method mainly comprises a rule-based method, a statistical machine learning-based method, a deep learning-based method and the like. The basic idea of the rule-based medical identification method is to identify medical entities from unstructured texts according to constructed medical entity construction rules, and the constituent elements of the rules comprise keywords, word categories and the like.
The medical entity recognition method based on statistical machine learning mainly comprises the steps of recognizing the medical entity by adopting models such as a maximum entropy model, a hidden Markov model, a conditional random field, a support vector machine and the like. The method is used for converting medical entity identification into a classification problem or a sequence labeling problem. For example, a method based on the combination of conditional random fields and rules is used for identifying named entities of Chinese electronic medical records. Firstly, adopting conditional random field recognition according to language symbol characteristics, suffix characteristics, keyword characteristics, dictionary characteristics and length characteristics; then, the recognition result is optimized by using the rule.
Deep learning based medical entity recognition methods include distributed representation or embedded encoding of unstructured input text, contextual or contextual semantic encoding, and tag decoding. The embedded coding of the input text mainly comprises character embedding and word embedding. The context semantic coding model comprises a convolutional neural network, a bidirectional long-short term memory network, a cyclic neural network and the like. For example, one approach is to perform medical entity recognition of Chinese electronic medical records based on a two-way long-short term memory network and a conditional random field model. First, generating a low-dimensional vector representation of each word; then, the two-way long-short term memory network with the attention mechanism and the conditional random field model are adopted for medical entity recognition.
The attention network of the graph is based on a graph convolution neural network, and an attention mechanism is introduced. The graph attention network has been applied to answer extraction, information recommendation, relationship extraction, and the like of a question-answering system.
The existing medical entity identification method mainly has the following problems: firstly, the existing medical entity recognition method mainly constructs character embedding, word embedding and part of speech embedding of texts, and rarely introduces phrase embedding and substring embedding. Second, current methods perform less entity recognition of medical text modeling through a graph attention network. Third, current methods less fuse pattern or rule-based methods and deep learning-based methods to fully and efficiently integrate the advantages of both methods. The method based on the mode or the rule has high performance, and the method based on the deep learning does not need time-consuming and labor-consuming characteristic engineering and can realize end-to-end nonlinear learning.
Disclosure of Invention
The invention aims to solve the problems of insufficient utilization of graph representation information and single embedding granularity of text distributed representation in medical entity recognition, and provides a medical entity recognition method based on multi-granularity text embedding, which firstly constructs multi-granularity text embedding, including character embedding, word embedding, part of speech embedding, substring embedding and phrase embedding, and realizes multi-granularity text embedding representation learning of characters, words, parts of speech, phrases and substrings of a medical text; then, performing medical text entity recognition by adopting a graph attention network, a mode enhanced attention mechanism and a conditional random field, wherein the method specifically comprises the following steps: firstly, the construction of graph embedded representation of a medical text is realized by utilizing a graph attention network model, and secondly, a mode strengthening attention mechanism is introduced into the graph attention network to strengthen the attention weight of nodes, so that the identification performance of a medical entity is improved.
In order to achieve the purpose, the invention adopts the following technical scheme:
the medical entity identification method based on multi-granularity text embedding comprises the following steps:
step 1: the method for constructing the multi-granularity text embedding through the pre-training language model comprises the following steps:
step 1.1: for unstructured Chinese medical texts, constructing multi-granularity text embedding;
step 1.1.1: learning of symbol embedding, segmentation embedding and covering embedding is carried out by utilizing a pre-training language model MC-Bert, and character embedding of an unstructured Chinese medical text is generated;
MC-Bert is a pre-training model generated according to Chinese medical data training;
for an unstructured Chinese medical text, the input of a pre-training language model MC-Bert consists of three kinds of embedding, namely symbol embedding, segmentation embedding and covering embedding;
wherein symbol embedding refers to vector representation of each word; the segmentation embedding is used for distinguishing two natural language sentences, and the medical entity recognition task recognizes the entity by taking the sentences as units, so that each word has the same segmentation embedding; in the covering embedding, if the current position is the character of the input sentence, the value is assigned to be 1; if the current position is 0, namely the current position is not the character of the input sentence, the value is assigned to be 0;
for the Chinese sentence CS, CS ═ cc1,cc2,...,ccn),cc1,cc2,...,ccnFor the character, a pre-training language model MC-Bert is adopted to generate a character embedded representation E of the sentence CSccAs shown in (1):
Ecc=(ecc1,ecc2,....,eccn),Ecc∈Rn×m (1)
wherein n is the sentence length, namely the number of characters in the sentence, and if the number of characters is less than n, 0 is filled; m is the dimension set by the pre-training model MC-Bert; e.g. of the typecci(i ═ 1, 2.., n) is the character cciEmbedding; character-embedded representation EccThe dimension of is n rows and m columns; rn×mA real number matrix representing n rows and m columns;
step 1.1.2: generating word embedding, part of speech embedding and phrase embedding of the Chinese medical text;
firstly, for a Chinese medical text, a jieba word segmentation tool is used for obtaining words of the Chinese medical text, a part of speech marker of the words of the Chinese medical text is obtained by using a part of speech marker Stanford posttagger, and a phrase marker of the Chinese medical text is obtained by using a syntax analyzer Stanford parser;
then, word embedding, part of speech embedding and phrase embedding are generated by using a word2vec tool;
for the Chinese sentence CS, CS ═ cc1,cc2,...,ccn),cc1,cc2,...,ccnThe method is characterized in that the method comprises the following steps: obtaining words of the Chinese medical text based on a jieba word segmentation tool, and obtaining word embedding of the affiliated words by using word2 vec; part of speech embedding, specifically: based on the part-of-speech tag of the word of the Chinese medical text, word2vec is used for acquiring part-of-speech embedding of the affiliated word; phrase embedding, specifically: acquiring the subordinate short by using word2vec based on the phrase mark for acquiring the Chinese medical textEmbedding the type of the language;
wherein the word is embedded as Ecw=(ecw1,ecw2,....,ecwn) Wherein, the character cc1,cc2,...,ccnThe subordinate words are cw in sequence1,cw2,...,cwn;ecwi(i ═ 1, 2.., n) is the word cwiEmbedding;
part of speech embedded as Ecpos=(ecpo1,ecpo2,....,ecpon) Wherein the word cw1,cw2,...,cwnHas a part of speech of cpo1,cpo2,...,cpon;ecpoi(i 1, 2.., n) is a part of speech cpoiEmbedding;
phrase embedding as Ecph=(ecph1,ecph2,....,ecphn) Wherein, the character cc1,cc2,...,ccnThe subordinate phrase types are in turn cph1,cph2,...,cphn;ecphi(i ═ 1, 2.., n) is the phrase type cphiEmbedding;
step 1.1.3: for an unstructured Chinese medical text, generating substring embedding by using a word2vec tool, specifically:
firstly, collecting a medical term dictionary, and constructing a medical term substring set, specifically comprising the following steps: for any two terms, extracting the longest common substring of the two terms, and adding the substring to the medical term substring set; if two terms have a plurality of longest common substrings with the same length, the first longest common substring is taken and added into the medical term substring set;
secondly, generating embedded representations of all substrings in the substring set of the medical terms by using a word2vec tool;
then, for each word cwi(i ═ 1, 2.., n), the word cw is judgediWhether or not to include substrings in the medical term substring set; for the Chinese sentence CS, CS ═ cc1,cc2,...,ccn),cc1,cc2,...,ccnIs a character, cc1,cc2,...,ccnThe words to which they belong are in turn: cw1,cw2,...,cwn(ii) a Let the word cwiIncluding substrings csubs in a set of medical term substrings1,csubs2,...,csubspString csubs1,csubs2,...,csubspIs respectively represented by ecs1,ecs2,....,ecspThen word cwiSubstring embedding representation ecssiComprises the following steps: to ecs1,ecs2,....,ecspAdding and summing, and then dividing by the result of the number p of the substrings; if the word cwiIf the substring does not contain any substring in the medical term substring set, outputting a custom value;
finally, for each character cc in the sentence CSi(i ═ 1, 2.. times, n), generating corresponding substring embeddings according to the steps; substring embedding of sentence CS as Ecss=(ecss1,ecss2,....,ecssn);
Step 1.1.4: character embedding, word embedding, part of speech embedding, phrase embedding and substring embedding of the Chinese medical text are spliced to construct multi-granularity text embedding, and the method specifically comprises the following steps:
for the Chinese sentence CS, CS ═ cc1,cc2,...,ccn),cc1,cc2,...,ccnFor a character, construct the character cciMulti-granularity text embedding of (i ═ 1, 2.. said., n), i.e. character embedding E of the spliced sentence CSccWord embedding EcwPart of speech embedding EcposPhrase embedding EcphAnd substring embedding EcssConstructing multi-granularity text embedding of the sentence CS, as shown in (2):
Ecme=Concate(Ecc,Ecw,Ecpos,Ecph,Ecss) (2)
wherein, Concate represents splicing operation; in addition, EcmeE isccDimension + E ofcwDimension + E ofcposDimension + E ofcphDimension + E ofcssDimension (d);
thus, from step 1.1.1 to step 1.1.4, the Chinese character cc is constructediA multi-granularity text embedding of (i ═ 1, 2.., n) as Ecme
Step 1.2: for unstructured English medical text, constructing multi-granularity text embedding, comprising the following steps:
step 1.2.1: learning of symbol embedding, segmentation embedding and covering embedding is carried out by utilizing a pre-training language model BioBert, and word embedding of an unstructured English medical text is generated;
BioBert is a pre-training model generated by training according to English medical data;
for English sentence ES, ES ═ ew1,ew2,...,ewn),ew1,ew2,...,ewnFor a word, a word-embedded representation of the sentence ES is generated using the pre-trained language model BioBert, as shown in (3):
Eew=(eew1,eew2,....,eewn),Eew∈Rn×m (3)
wherein n is the sentence length, namely the number of words in the sentence, and if the number of words is less than n, 0 is filled; m is the dimension set by the pre-training model BioBert; e.g. of the typeewi(i 1, 2.. n.) is the word ewiEmbedding;
step 1.2.2: generating character embedding, part of speech embedding and phrase embedding of English medical texts;
for English medical texts, character embedding, part-of-speech embedding and phrase embedding are generated by using a word2vec tool;
for English sentence ES, ES ═ ew1,ew2,...,ewn),ew1,ew2,...,ewnIs a word, ewiEmbedding characters of (i ═ 1, 2.., n) to form ewiAll character embedding and averaging, part of speech embedding as ewiEmbedding part of speech, wherein phrase embedding is embedding of the type of the affiliated phrase;
wherein the character is embedded as Eec=(eec1,eec2,....,eecn) Wherein e iseci(i 1, 2.. n.) is the word ewiEmbedding the characters;
part of speech embedded as Eepos=(eepo1,eepo2,....,eepon) Wherein the word ew1,ew2,...,ewnIn order of part of speech of epo1,epo2,...,epon;eepoi(i ═ 1, 2.., n) is a part-of-speech epoiEmbedding;
phrase embedding as Eeph=(eeph1,eeph2,....,eephn) Wherein the word ew1,ew2,...,ewnThe subordinate phrase types are eph in sequence1,eph2,...,ephn;eephi(i ═ 1, 2.., n) is the phrase type ephiEmbedding;
step 1.2.3: for unstructured English medical texts, generating substring embedding by using a word2vec tool;
firstly, collecting an English medical term dictionary and constructing a medical term substring set; for any two terms, extracting the longest common substring of the two terms, and adding the substring to the medical term substring set; if two terms have a plurality of longest common substrings with the same length, the first longest common substring is taken and added into the medical term substring set;
secondly, generating embedded representations of all the substrings by using a word2vec tool;
then, for the english sentence ES, ES ═ ew (ew)1,ew2,...,ewn),ew1,ew2,...,ewnFor words, for each word ewi(i 1, 2.. times.n), judging the word ewiWhether or not to include substrings in the medical term substring set; word setting viewiIncluding substrings esubs in a set of substrings of medical terms1,esubs2,...,esubsqSubstrings esubs1,esubs2,...,esubsqIs denoted as ees1,ees2,....,eesqThen word ewiSubstring embedding representation eessiComprises the following steps: to ees1,ees2,....,eesqAdding and summing, and then dividing by the result of the number q of the substrings; if the word ewiNot containing medical term substringsOutputting any substring in the set as a custom value;
finally, for each character ew in the sentence ESi(i ═ 1, 2.. times, n), generating corresponding substring embeddings according to the steps; substring embedding of sentence ES as Eess=(eess1,eess2,....,eessn);
Step 1.2.4: the method comprises the following steps of splicing character embedding, word embedding, part of speech embedding, phrase embedding and substring embedding of the English medical text to construct multi-granularity text embedding, and specifically comprises the following steps:
for English sentence ES, ES ═ ew1,ew2,...,ewn),ew1,ew2,...,ewnConstruct ew for wordsiMulti-granularity text embedding of (i ═ 1, 2.. multideck, n), i.e. character embedding E of the concatenated sentence ESecWord embedding EewPart of speech embedding EeposPhrase embedding EephAnd substring embedding EessConstructing multi-granularity text embedding of the sentence ES, as shown in (4);
Eeme=Concate(Eec,Eew,Eepos,Eeph,Eess) (4)
wherein, Concate represents splicing operation; in addition, EemeE isecDimension + E ofewDimension + E-eposDimension + E ofephDimension +2EessDimension (d);
so far, from step 1.2.1 to step 1.2.4, the English word ew is constructediA multi-granularity text embedding of (i ═ 1, 2.., n) as Eeme
Step 2: generating the mode weight of all characters in the Chinese sentence according to the medical entity constitution mode, comprising the following steps:
step 2.1: constructing a Chinese medical entity composition mode;
the medical entity constitution mode has the constitution form: "Y" is1+Y2+Y3+...+Yk”;
Wherein, Y1,Y2,Y3,...,YkMeaning termThe "+" indicates a link operation of a character string;
the category of the words comprises negative words, clinical manifestations, anatomical parts, modifiers, disease names, physical examination, numerical values, quantifiers, and drugs;
step 2.2: generating a mode weight of characters in the Chinese sentence;
for the Chinese sentence CS, CS ═ cc1,cc2,...,ccn),cc1,cc2,...,ccnAnd for characters, judging whether the Chinese sentence CS is matched with the medical entity constitution mode, wherein the constructed mode matching weight vector is as follows: (w)1,w2,...,wn);
Case 1: if the character string cci,cci+1,...,ccjSatisfying the pattern "anatomical region", "disease" or "medicine", it is cc for each characteri,cci+1,...,ccjGiving a mode weight of 2;
case 2: if the character string cci,cci+1,...,ccjSatisfying other patterns, for each character cci,cci+1,...,ccjGiving a mode weight of 1.5;
case 3: if the character string cci,cci+1,...,ccjNot satisfying the pattern, for each character cci,cci+1,...,ccjGiving a mode weight of 1;
and step 3: using a graph attention network and a pattern-reinforced attention mechanism to perform node-embedded representation learning, comprising the steps of:
3.1, transforming the embedding dimensions of the Chinese character nodes or English word nodes by using the full connection layer;
inputting the multi-granularity text embedding of each character in the Chinese sentence CS into a full connection layer, and converting the embedding dimension of the multi-granularity text embedding of the Chinese sentence CS; the reason for converting dimensions is: the multi-granularity text embedding dimension needs to be consistent with the node vector input dimension of the graph attention network used in the step 3.2;
similarly, for the multi-granularity text embedding of each word in the English sentence ES, inputting the multi-granularity text embedding into the full connection layer, and converting the embedding dimension of the English multi-granularity text embedding.
In the fully-connected layer, firstly, the dimensionality is converted through a linear layer; then, using dropout method to prevent overfitting; finally, the gradient is prevented from disappearing by activating the function Relu;
step 3.2: for the Chinese medical text, multiplying the attention weight and the mode weight of Chinese character nodes of the graph attention network by the mode weight of characters in a Chinese sentence; for English medical texts, the mode weight of the Chinese and English word nodes in the attention network is set to be 1;
for the Chinese medical text, node embedding of the graph attention network is character embedding, and the character embedding is the character embedding generated in the step 3.1; for the English medical text, the node embedding of the graph attention network is word embedding, and the word embedding is the word embedding generated in the step 3.1;
step 3.2.1: calculating attention weights of nodes in the graph attention network;
first, a multi-granular text embedding h of a sentence is input into an attention layer in an attention network, wherein,
Figure BDA0003195630660000071
embedding Chinese character nodes or English word nodes of a sentence of a medical text, wherein M is the number of the nodes, and H is the dimension of character embedding, as shown in (5);
Figure BDA0003195630660000072
performing linear transformation on the input node embedding, and converting the node embedding into the dimension of the number of all category labels; and calculating attention weight by using LeakyRelu function, namely calculating importance degree e of node v to node uuvAs shown in (6);
Figure BDA0003195630660000073
wherein, W1A shared weight matrix is represented that is,
Figure BDA0003195630660000074
the embedding of a chinese character node or an english word node u representing a sentence of medical text,
Figure BDA0003195630660000075
embedding Chinese character nodes or English word nodes v representing one sentence of the medical text;
then, using Softmax function to pair euvNormalization is carried out to obtain alphauvAs shown in (7);
Figure BDA0003195630660000081
wherein e isukRepresenting the degree of importance of node u to node k, αuvDenotes euvNormalized value of (1), NuA neighbor node representing node u;
finally, the attention weight α of the node u is generateduAs shown in (8);
Figure BDA0003195630660000082
wherein, W2Representing a weight matrix;
step 3.2.2: updating attention weights of nodes in the graph attention network;
first, for node u, the mode weight w of Chinese character or English word represented by node u is useduUpdating the attention weight alpha of the node uuAs shown in (9);
αu=αu×wu (9)
second, construct an attention weight attention for the sentencel(1. ltoreq. l. ltoreq.k) as shown in (10);
attentionl=(α12,...,αM) (10)
then, a multi-head attention mechanism is introduced into the graph attention network, specifically: calculating k attention weights, multiplying each attention weight by the input h, and generating a feature h 'of the sentence'lAs shown in (11);
h′l=attentionl×h (11)
by activating function elu, a one-headed output elu (h'1),elu(h′2),...,elu(h′k);
Thirdly, splicing the outputs of the k heads to generate h' as shown in (12);
h'=Concat(elu(h′1),elu(h′2),...,elu(h'k)) (12)
finally, generating a final output h through a log _ softmax functionfinalAs shown in (13);
hfinal=log_softmax(h')) (13)
and 4, step 4: generating an entity category label of the medical text by adopting a conditional random field, and outputting a medical entity recognition result, wherein the method specifically comprises the following steps: generating entity category labels of Chinese characters or English words;
calculating the conditional probability distribution density of each character based on the conditional random field, namely calculating the probability of each character belonging to each entity class label, allocating the label with the highest probability to the corresponding character as the entity class label of the character, and further outputting the medical entity recognition result;
and (3) carrying out sequence labeling on sentences in the medical text by adopting a conditional random field, generating entity category labels of Chinese characters or English words, and outputting a medical text entity recognition result.
Advantageous effects
Compared with the traditional medical entity identification method, the medical entity identification method based on multi-granularity text embedding provided by the invention has the following beneficial effects:
1. the identification method has portability and robustness, and is not limited to the source of the corpus; the method comprises the steps of carrying out graph representation modeling on a medical text based on a graph attention network, wherein the language of a language material is not limited, and Chinese texts and English texts can be processed;
2. the method comprises the steps of establishing multi-granularity text embedding of an unstructured medical text, wherein the multi-granularity text embedding comprises character embedding, word embedding, part of speech embedding, substring embedding and phrase embedding; by introducing multi-granularity text embedding, the characteristics of the medical text in the aspects of characters, words, parts of speech, phrases and substrings are mined, the distributed representation learning of a character string level, a lexical level and a grammatical level is realized, the entity characteristic information of the medical text is enhanced, and the accuracy of medical entity recognition is improved;
3. the method adopts a graph attention network, a mode enhanced attention mechanism and a conditional random field to identify the medical entity: firstly, a graph attention network model is utilized to realize graph representation modeling of a medical text, and graph structure information between Chinese characters or English words of the medical text is captured; secondly, a mode strengthening attention mechanism is introduced, the medical entity forming mode characteristics are introduced into the attention weight of the nodes in the graph attention network, the mode-based medical entity identification method and the deep learning-based medical entity identification method are effectively integrated, the characteristics and the advantages of the two methods are fully utilized, and the medical entity identification performance is improved;
4. the method can identify the medical entities of the unstructured Chinese medical text and the English medical text, and has wide application prospects in the fields of information retrieval, text classification, question-answering systems and the like.
Drawings
Fig. 1 is a flowchart illustrating a medical entity recognition based on multi-granular text embedding according to an embodiment of the present invention.
Detailed Description
The medical entity recognition system based on the method takes Pycharm as a development tool, Python as a development language and Pytrch as a development framework.
Preferred embodiments of the method of the present invention will be described in detail with reference to examples.
Examples
This embodiment describes a process of using the method for medical entity recognition based on multi-granularity text embedding according to the present invention, as shown in fig. 1.
Firstly, generating character embedding of a Chinese medical text and word embedding of an English medical text by using a pre-training language model MC-Bert and a Biobert; word embedding, part-of-speech embedding, phrase embedding, substring embedding of the Chinese medical text and character embedding, part-of-speech embedding, phrase embedding and substring embedding of the English medical text are generated by using a word2vec tool, and the embedding is spliced to construct final Chinese and English medical text multi-granularity text embedding; secondly, generating mode weights of all characters in the Chinese sentence according to the medical entity composition mode, wherein the mode weights of all words in the English sentence are set to be 1; then, carrying out node embedding expression learning by using a graph attention network and a mode enhanced attention mechanism, and updating the attention weight of the nodes in the graph attention network by using the mode weight; finally, predicting the entity label of each character in the Chinese medical text or predicting the entity label of each word in the English medical text by using a conditional random field, and outputting a medical text entity recognition result; experiments were performed under the CCKS2019 dataset; firstly, generating multi-granularity text embedding of each sentence of medical text in a CCKS2019 data set; secondly, generating mode weights of all characters in each sentence of medical text according to the medical entity composition mode; then, transmitting the multi-granularity text embedding and pattern matching weight into a graph attention network, multiplying the pattern matching weight with the attention weight of the node, and calculating to obtain the final embedded expression of the input text; finally, outputting a final predicted entity identification label by the conditional random field according to the calculated probability; the experimental results prove the effectiveness of the invention; the method can also be applied to an English medical text data set NCBI Disease, a biochemical field data set BC5CDR and the like; the process applied to the data set NCBI Disease is generally consistent with the data set CCKS2019, and the difference is that when the attention coefficient is calculated in the attention network, the mode weights of English word nodes are all set to be 1; the flow applied to the data set BC5CDR differs: when multi-granularity text embedding is constructed, a term dictionary aiming at the field of biochemistry is required to be used for generating substring embedding, and character embedding, word embedding, part of speech embedding and phrase embedding are spliced, and the spliced embedding is transmitted into a graph attention network; when the attention network calculates the attention weight, adding entity formation pattern matching weight aiming at the biochemical field, and transmitting the result of the attention network into a conditional random field; and outputting a final predicted entity recognition result by the conditional random field according to the probability.
As can be seen from fig. 1, the method specifically includes the following steps:
step 1: the method for constructing the multi-granularity text embedding through the pre-training language model comprises the following steps:
step 1.1: for unstructured Chinese medical texts, constructing multi-granularity text embedding;
step 1.1.1: learning of symbol embedding, segmentation embedding and covering embedding is carried out by utilizing a pre-training language model MC-Bert, and character embedding of an unstructured Chinese medical text is generated;
MC-Bert is a pre-training model generated according to Chinese medical data training;
for unstructured Chinese medical text, the input of the pre-training language model MC-Bert consists of three kinds of Embedding, namely symbol Embedding (Token Embedding), Segment Embedding (Segment Embedding) and Mask Embedding (Mask Embedding), wherein symbol Embedding refers to vector representation of each word. Segmentation embedding is used to distinguish two natural language sentences, and the medical entity recognition task recognizes entities in sentence units, so that each word has the same segmentation embedding. In the covering embedding, if the current position is the character of the input sentence, the value is assigned to be 1; if the current position is 0, namely the current position is not the character of the input sentence, the value is assigned to be 0;
for the Chinese sentence CS, CS ═ cc1,cc2,...,ccn),cc1,cc2,...,ccnIs a character; generating a character-embedded representation E of a sentence CS using a pre-trained language model MC-BertccAs shown in (1);
Ecc=(ecc1,ecc2,....,eccn),Ecc∈Rn×m (1)
wherein n is the sentence length 512, i.e. the number of characters in the sentence, and if the number of characters is less than 512, 0 is filled; m is a dimensionality 768 set by the pre-training model MC-Bert; e.g. of the typecci(i ═ 1, 2.., n) is the character cci768 dimensions; character-embedded representation EccHas a dimension of 512 × 768; rn×mA real number matrix representing n rows and m columns;
for example, in the sentence "the patient has yellow skin and sclera, with decreased appetite before 4 months, after dinner, with paroxysmal abdominal pain, nausea, no diarrhea, vomiting, chest distress, suffocating, dizziness, headache, no object rotation, no fever, cough, chest pain, asthma, and light stool. ", the characters are divided by" \ t ". The mark of 'CLS' and 'SEP' is added at the beginning and end of sentence. To make the dimensions of the character-embedded representations of different sentences consistent, the sentence length is extended to 512 by filling in 0. The character embedding representation for generating the sentence by using the pre-training model MC-Bert is as follows:
Figure BDA0003195630660000111
where n is the sentence length 512 and m represents the character embedding dimension 768 dimensions. The character-embedded vector of the character "sick" is (x)11,x12,...x1m);
Step 1.1.2: generating word embedding, part of speech embedding and phrase embedding of the Chinese medical text;
firstly, for a Chinese medical text, a jieba word segmentation tool is used for obtaining words of the Chinese medical text, a part of speech marker of the words of the Chinese medical text is obtained by using a part of speech marker Stanford posttagger, and a phrase marker of the Chinese medical text is obtained by using a syntax analyzer Stanford parser;
then, word embedding, part of speech embedding and phrase embedding are generated by using a word2vec tool, and the three embedding dimensions are all 200;
for Chinese sentences CS, CS ═(cc1,cc2,...,ccn),cc1,cc2,...,ccnThe method is characterized in that the method comprises the following steps: obtaining words of the Chinese medical text based on a jieba word segmentation tool, and obtaining word embedding of the affiliated words by using word2 vec; part of speech embedding, specifically: based on the part-of-speech tag of the word of the Chinese medical text, word2vec is used for acquiring part-of-speech embedding of the affiliated word; phrase embedding, specifically: acquiring the type embedding of the affiliated phrase by using word2vec based on the phrase mark of the Chinese medical text;
wherein the word is embedded as Ecw=(ecw1,ecw2,....,ecwn) Wherein, the character cc1,cc2,...,ccnThe subordinate words are cw in sequence1,cw2,...,cwn;ecwi(i ═ 1, 2.., n) is the word cwiEmbedding;
part of speech embedded as Ecpos=(ecpo1,ecpo2,....,ecpon) Wherein the word cw1,cw2,...,cwnHas a part of speech of cpo1,cpo2,...,cpon;ecpoi(i 1, 2.., n) is a part of speech cpoiEmbedding;
phrase embedding as Ecph=(ecph1,ecph2,....,ecphn) Wherein, the character cc1,cc2,...,ccnThe subordinate phrase types are in turn cph1,cph2,...,cphn;ecphi(i ═ 1, 2.., n) is the phrase type cphiEmbedding;
for example, for the Chinese sentence "the patient found yellow skin and sclera 4 months ago", the segmentation result obtained by the jieba segmentation tool is "the patient found yellow skin and sclera 4 months ago". Further, the word segmentation result is that the patient finds yellow and yellow dyeing of skin and sclera of the skin 4 months ago, and words to which each character in the sentence belongs are given. For example, the character "patient" is affiliated with the word "patient", and the word of the character "patient" is embedded as an embedding of the word "patient";
the part-of-speech tag of the Chinese sentence after being expanded is obtained as 'NN NN CD NN LC VV VV NN NN PU NNNNNR NR NR' through a part-of-speech tag device Stanford posttagger. Obtaining a phrase mark 'NP NP NP NP LCP VP VP NP NP PU NP NP NP NP NP NP NP after the Chinese sentence expansion by using a syntax analyzer Stanford parser, for example, the character' suffers 'from' and is attached to a word 'patient', the part of speech of the character 'suffers' is embedded into the part of speech 'NN' of the word 'patient', and the phrase of the character 'suffers' is embedded into the type mark 'NP' of the attached phrase;
step 1.1.3: for an unstructured Chinese medical text, generating substring embedding by using a word2vec tool, wherein the embedding dimension is 200, and specifically comprises the following steps:
firstly, collecting a medical term dictionary and constructing a medical term substring set;
for any two terms, the longest common substring of the two terms is extracted and added to the medical term substring set. If two terms have a plurality of longest common substrings with the same length, the first longest common substring is taken and added into the medical term substring set;
secondly, generating embedded representations of all substrings in the substring set of the medical terms by using a word2vec tool;
then, for each word cwi(i ═ 1, 2.., n), the word cw is judgediWhether or not to include substrings in the medical term substring set; for the Chinese sentence CS, CS ═ cc1,cc2,...,ccn),cc1,cc2,...,ccnIs a character, cc1,cc2,...,ccnThe words to which they belong are in turn: cw1,cw2,...,cwn(ii) a Let the word cwiIncluding substrings csubs in a set of medical term substrings1,csubs2,...,csubspString csubs1,csubs2,...,csubspIs denoted as ecs1,ecs2,....,ecspThen word cwiSubstring embedding representation ecssiComprises the following steps: to ecs1,ecs2,....,ecspAdd and sum, then divide by subThe result of the number p of strings; if the word cwiIf the substring does not contain any substring in the medical term substring set, outputting a custom value;
finally, for each character cc in the sentence CSi(i ═ 1, 2.. times, n), generating corresponding substring embeddings according to the steps; substring embedding of sentence CS as Ecss=(ecss1,ecss2,....,ecssn);
For example, a medical dictionary is collected: medical system nomenclature-clinical term SNOMED CT, the medical term substring set is constructed. For the word "digestive tract" in a Chinese sentence, the word includes the substrings "digestion, tract, digestive tract" in the substring set of medical terms. Further, the substring embedding of the character string "digestive tract" is: the embedding of six substrings of 'elimination', 'melting', 'tract', 'digestion', 'melting tract' and 'melting tract' represents the result of adding and summing and dividing by the number of the substrings of 6;
step 1.1.4: character embedding, word embedding, part of speech embedding, phrase embedding and substring embedding of the Chinese medical text are spliced to construct multi-granularity text embedding:
for the Chinese sentence CS, CS ═ cc1,cc2,...,ccn),cc1,cc2,...,ccnFor a character, construct the character cciMulti-granularity text embedding of (i ═ 1, 2.. said., n), i.e. character embedding E of the spliced sentence CSccWord embedding EcwPart of speech embedding EcposPhrase embedding EcphAnd substring embedding EcssConstructing multi-granularity text embedding of the sentence CS, as shown in (2);
Ecme=Concate(Ecc,Ecw,Ecpos,Ecph,Ecss) (2)
wherein, Concate represents the splicing operation. In addition, EcmeDimension of (c) is 1568, i.e. 1568 (E)cmeDimension of 768 (E)ccDimension of) +200 (E)cwDimension of) +200 (E)cposDimension of) +200 (E)cphDimension of) +200 (E)cssDimension (d);
thus, from step 1.1.1 to step 1.1.4, the Chinese character cc is constructediA multi-granularity text embedding of (i ═ 1, 2.., n) as Ecme
Step 1.2: for unstructured English medical text, generating multi-granularity text embedding, comprising the steps of:
step 1.2.1: learning of symbol embedding, segmentation embedding and covering embedding is carried out by utilizing a pre-training language model BioBert, and word embedding of an unstructured English medical text is generated;
BioBert is a pre-training model generated by training according to English medical data;
for English sentence ES, ES ═ ew1,ew2,...,ewn),ew1,ew2,...,ewnIs a word; generating a word-embedded representation of the sentence ES using the pre-trained language model BioBert, as shown in (3);
Eew=(eew1,eew2,....,eewn),Eew∈Rn×m (3)
wherein n is the sentence length 512, i.e. the number of words in the sentence, and if the number of words is less than 512, 0 is filled; m is a dimensionality 768 dimension set by a pre-training model BioBert; e.g. of the typeewi(i 1, 2.. n.) is the word ewi768 dimensions; character-embedded representation EewHas a dimension of 512 × 768; rn×mA real number matrix representing n rows and m columns;
step 1.2.2: generating character embedding, part of speech embedding and phrase embedding of English medical texts;
for English medical texts, word2vec tools are used for generating character embedding, part of speech embedding and phrase embedding, and the embedding dimension is 200;
for English sentence ES, ES ═ ew1,ew2,...,ewn),ew1,ew2,...,ewnIs a word, ewiEmbedding characters of (i ═ 1, 2.., n) to form ewiAll character embedding and averaging, part of speech embedding as ewiEmbedding part of speech, wherein phrase embedding is embedding of the type of the affiliated phrase;
wherein the character is embedded as Eec=(eec1,eec2,....,eecn) Wherein e iseci(i 1, 2.. n.) is the word ewiEmbedding the characters;
part of speech embedded as Eepos=(eepo1,eepo2,....,eepon) Wherein the word ew1,ew2,...,ewnIn order of part of speech of epo1,epo2,...,epon;eepoi(i ═ 1, 2.., n) is a part-of-speech epoiEmbedding;
phrase embedding as Eeph=(eeph1,eeph2,....,eephn) Wherein the word ew1,ew2,...,ewnThe subordinate phrase types are eph in sequence1,eph2,...,ephn;eephi(i ═ 1, 2.., n) is the phrase type ephiEmbedding;
step 1.2.3: for an unstructured English medical text, generating substring embedding by using a word2vec tool, wherein the embedding dimension is 200, and specifically comprises the following steps:
firstly, collecting an English medical term dictionary and constructing a medical term substring set; for any two terms, extracting the longest common substring of the two terms, and adding the substring to the medical term substring set; if two terms have a plurality of longest common substrings with the same length, the first longest common substring is taken and added into the medical term substring set;
secondly, generating embedded representations of all the substrings by using a word2vec tool;
then, for the english sentence ES, ES ═ ew (ew)1,ew2,...,ewn),ew1,ew2,...,ewnFor words, for each word ewi(i 1, 2.. times.n), judging the word ewiWhether or not to include substrings in the medical term substring set; word setting viewiIncluding substrings esubs in a set of substrings of medical terms1,esubs2,...,esubsqSubstrings esubs1,esubs2,...,esubsqIs denoted as ees1,ees2,....,eesqThen word ewiSubstring embedding representation eessiComprises the following steps: to ees1,ees2,....,eesqAdding and summing, and then dividing by the result of the number q of the substrings; if the word ewiIf the substring does not contain any substring in the medical term substring set, outputting a custom value;
finally, for each character ew in the sentence ESi(i ═ 1, 2.. times, n), generating corresponding substring embeddings according to the steps; substring embedding of sentence ES as Eess=(eess1,eess2,....,eessn);
Step 1.2.4: character embedding, word embedding, part of speech embedding, phrase embedding and substring embedding of the English medical text are spliced to construct multi-granularity text embedding:
for English sentence ES, ES ═ ew1,ew2,...,ewn),ew1,ew2,...,ewnConstruct ew for wordsiMulti-granularity text embedding of (i ═ 1, 2.. multideck, n), i.e. character embedding E of the concatenated sentence ESecWord embedding EewPart of speech embedding EeposPhrase embedding EephAnd substring embedding EessConstructing multi-granularity text embedding of the sentence ES, as shown in (4);
Eeme=Concate(Eec,Eew,Eepos,Eeph,Eess) (4)
wherein, Concate represents the splicing operation. In addition, EemeDimension of (c) is 1568, i.e. 1568 (E)emeDimension of 768 (E)ecDimension of) +200 (E)ewDimension of) +200 (E)eposDimension of) +200 (E)ephDimension of) +200 (E)essDimension (d);
so far, from step 1.2.1 to step 1.2.4, the English word ew is constructediA multi-granularity text embedding of (i ═ 1, 2.., n) as Eeme
Step 2: generating the mode weight of all characters in the Chinese sentence according to the medical entity constitution mode, comprising the following steps:
step 2.1: constructing a Chinese medical entity composition mode;
the medical entity constitution mode has the constitution form: "Y" is1+Y2+Y3+...+Yk", wherein Y1,Y2,Y3,...,YkIndicating the category of the word, "+" indicates the link operation of the character string. The category of the words comprises negative words, clinical manifestations, anatomical parts, modifiers, disease names, physical examination, numerical values, quantifier words and medicines;
for example, negation words include none, absent, etc. The clinical manifestations include chill, sweating, increased heart rate, etc. Anatomical sites include the back, meniscus, left colon artery, etc. Modifiers include mild, more severe, etc. The disease names include rheumatic heart disease, multiple cancers, etc. Physical examination includes cardiopulmonary examination, electrocardiography, and the like. Quantifier includes degree, group, only, and the like. The medicines comprise cedilanid, cefuroxime axetil, aspirin and the like;
for example, constructing a medical entity constitutes a pattern of "negation + clinical manifestation", which is satisfied by the terms "no nausea" and "no fever". Because the term "no nausea" consists of the negation of the word "no" and clinical manifestations of "nausea", the term "no fever" consists of the negation of the word "no" and clinical manifestations of "fever";
step 2.2: generating a mode weight of characters in the Chinese sentence;
for the Chinese sentence CS, CS ═ cc1,cc2,...,ccn),cc1,cc2,...,ccnAnd for characters, judging whether the Chinese sentence CS is matched with the medical entity constitution mode, wherein the constructed mode matching weight vector is as follows: (w)1,w2,...,wn);
Case 1: if the character string cci,cci+1,...,ccjSatisfying the pattern "anatomical region", "disease" or "medicine", it is cc for each characteri,cci+1,...,ccjGiving a mode weight of 2;
case 2: if the character string cci,cci+1,...,ccjSatisfy other patterns, then eachIndividual character cci,cci+1,...,ccjGiving a mode weight of 1.5;
case 3: if the character string cci,cci+1,...,ccjNot satisfying the pattern, for each character cci,cci+1,...,ccjGiving a mode weight of 1;
for example, the input text "patient found yellow skin, sclera, 4 months ago" constructs a pattern from the medical entity, and the generated pattern weight vector is: (1,1,1,1,1,1,1,2,2,1,2,2,1, 1);
and step 3: using a graph attention network and a pattern-reinforced attention mechanism to perform node-embedded representation learning, comprising the steps of:
3.1, transforming the embedding dimensions of the Chinese character nodes or English word nodes by using the full connection layer;
for multi-granularity text embedding of each character in the Chinese sentence CS, the input is input into a full connection layer, and the embedding dimension of the multi-granularity text embedding of the Chinese sentence is converted from 1568 dimensions to 768 dimensions. The reason for converting dimensions is: the multi-granularity text embedding dimension needs to be consistent with the node vector input dimension of the graph attention network used in the step 3.2, namely 768 dimensions. Similarly, for the multi-granularity text embedding of each word in the English sentence ES, inputting the multi-granularity text embedding into a full connection layer, and converting the embedding dimensionality of the English multi-granularity text embedding from 1568 dimensionality to 768 dimensionality;
in the fully-connected layer, firstly, the dimensionality is converted through a linear layer; then, using dropout method to prevent overfitting; finally, the gradient is prevented from disappearing by activating the function Relu;
step 3.2: for the Chinese medical text, multiplying the attention weight and the mode weight of Chinese character nodes of the graph attention network by the mode weight of characters in a Chinese sentence; for English medical texts, the mode weight of the Chinese and English word nodes in the attention network is set to be 1;
for Chinese medical text, the node embedding of the graph attention network is character embedding, and the character embedding is 768-dimensional character embedding generated in step 3.1. For English medical texts, node embedding of the graph attention network is word embedding, and the word embedding is 768-dimensional word embedding generated in the step 3.1;
step 3.2.1: calculating attention weights of nodes in the graph attention network;
first, a multi-granular text embedding h of a sentence is input into an attention layer in an attention network, wherein,
Figure BDA0003195630660000171
embedding Chinese character nodes or English word nodes of a sentence of a medical text, wherein M is the number of the nodes 512, H is the dimension of character embedding, and the value is 768, as shown in (5);
Figure BDA0003195630660000172
and performing linear transformation on the input node embedding, converting the 768-dimensional node embedding into 16-dimensional node embedding, wherein 16 is the number of all class labels. And calculating attention weight by using LeakyRelu function, namely calculating importance degree e of node v to node uuvAs shown in (6);
Figure BDA0003195630660000173
wherein, W1A shared weight matrix is represented that is,
Figure BDA0003195630660000174
the embedding of a chinese character node or an english word node u representing a sentence of medical text,
Figure BDA0003195630660000175
embedding Chinese character nodes or English word nodes v representing one sentence of the medical text;
then, using Softmax function to pair euvNormalization is carried out to obtain alphauvAs shown in (7);
Figure BDA0003195630660000176
wherein e isukRepresenting the importance degree of the node u to the node k;
wherein alpha isuvDenotes euvNormalized value of (1), NuA neighbor node representing node u;
finally, the attention weight α of the node u is generateduAs shown in (8);
Figure BDA0003195630660000181
wherein W2Representing a weight matrix;
step 3.2.2: updating attention weights of nodes in the graph attention network;
first, for node u, the mode weight w of Chinese character or English word represented by node u is useduUpdating the attention weight alpha of the node uuAs shown in (9);
αu=αu×wu (9)
second, construct an attention weight attention for the sentencel(l is more than or equal to 1 and less than or equal to k) as shown in (10);
attentionl=(α12,...,αM) (10)
a multi-point attention mechanism is then introduced to the graph attention network. Specifically, k attention weights are calculated, each attention weight is multiplied by an input h, and a feature h 'of a sentence is generated'lAs shown in (11);
h′l=attentionl×h (11)
by activating function elu, a one-headed output elu (h'1),elu(h'2),...,elu(h'k) Finally, splicing the outputs of the k heads to generate h' as shown in (12);
h'=concat(elu(h′1),elu(h′2),...,elu(h'k)) (12)
finally, generating a final output h through a log _ softmax functionfinalAs shown in (13);
hfinal=log_softmax(h')) (13)
and 4, step 4: generating an entity category label of the medical text by adopting a conditional random field, and outputting a medical entity recognition result, wherein the method specifically comprises the following steps: generating entity category labels of Chinese characters or English words;
calculating the conditional probability distribution density of each character based on the conditional random field, namely calculating the probability of each character belonging to each entity class label, allocating the label with the highest probability to the corresponding character as the entity class label of the character, and further outputting the medical entity recognition result;
and (3) carrying out sequence labeling on sentences in the medical text by adopting a conditional random field, generating entity category labels of Chinese characters or English words, and outputting a medical text entity recognition result.
For example, for a dataset, its entity class labels include: "PAD", "CLS", "SEP", "O", "B-disease and diagnosis", "I-disease and diagnosis", "B-surgery", "I-surgery", "B-anatomical site", "I-anatomical site", "B-drug", "I-drug", "B-imaging examination", "I-imaging examination", "B-laboratory test", "I-laboratory test";
for example, in the sentence "the patient has yellow skin and sclera, with decreased appetite before 4 months, after dinner, with paroxysmal abdominal pain, nausea, no diarrhea, vomiting, chest distress, suffocating, dizziness, headache, no object rotation, no fever, cough, chest pain, asthma, and light stool. ", the result after labeling with a conditional random field sequence is [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 8, 9, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 8, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3 ]; each number in the list represents an index of the predicted entity class label of the bit character. The index is converted into a corresponding entity label through an idx2tag function, and the final entity recognition result is O O O O O O O O B-anatomical part I-anatomical part O O O O O O O O O O O O O O O O O O O B-anatomical part O O O O O O O O O O O O B-anatomical part O O O O O O O O O O O O O O O O O O O O O O O O B.
In order to illustrate the medical entity recognition effect of the invention, the experiment is carried out by comparing the same training set and test set by two methods respectively under the same condition; the first method is a medical entity recognition method based on a bidirectional long-short term memory network, an attention mechanism and a conditional random field, and introduces a medical dictionary and part-of-speech characteristics; the second method is the medical entity identification method of the present invention;
the adopted evaluation indexes are as follows: accuracy, recall, and F1 values; the medical entity recognition result is: the accuracy of the medical entity recognition results of the two-way long-short term memory network, the attention mechanism and the conditional random field in the prior art is 76.42 percent, the recall rate is 73.80 percent, and the F1 value is 75.08 percent; the accuracy of the medical entity recognition result obtained by the method is 86.38%, the recall rate is 85.82%, and the F1 value is 86.10%; the effectiveness of the medical entity identification method based on multi-granularity text embedding provided by the invention is shown through experiments;
while the foregoing is directed to the preferred embodiment of the present invention, it is not intended that the invention be limited to the embodiment and the drawings disclosed herein. Equivalents and modifications may be made without departing from the spirit of the disclosure, which is to be considered as within the scope of the invention.

Claims (9)

1. A medical entity recognition method based on multi-granularity text embedding is characterized in that: the method comprises the following steps:
step 1: the method for constructing the multi-granularity text embedding through the pre-training language model comprises the following steps:
step 1.1: for unstructured Chinese medical texts, constructing multi-granularity text embedding;
step 1.1.1: learning of symbol embedding, segmentation embedding and covering embedding is carried out by utilizing a pre-training language model MC-Bert, and character embedding of an unstructured Chinese medical text is generated;
for an unstructured Chinese medical text, the input of a pre-training language model MC-Bert consists of three kinds of embedding, namely symbol embedding, segmentation embedding and covering embedding;
step 1.1.2: generating word embedding, part of speech embedding and phrase embedding of the Chinese medical text;
firstly, for a Chinese medical text, a jieba word segmentation tool is used for obtaining words of the Chinese medical text, a part of speech marker of the words of the Chinese medical text is obtained by using a part of speech marker Stanford posttagger, and a phrase marker of the Chinese medical text is obtained by using a syntax analyzer Stanford parser;
then, word embedding, part of speech embedding and phrase embedding are generated by using a word2vec tool;
step 1.1.3: for an unstructured Chinese medical text, generating substring embedding by using a word2vec tool, specifically:
firstly, collecting a medical term dictionary, and constructing a medical term substring set, specifically comprising the following steps: for any two terms, extracting the longest common substring of the two terms, and adding the substring to the medical term substring set; if two terms have a plurality of longest common substrings with the same length, the first longest common substring is taken and added into the medical term substring set;
secondly, generating embedded representations of all substrings in the substring set of the medical terms by using a word2vec tool;
step 1.1.4: character embedding, word embedding, part of speech embedding, phrase embedding and substring embedding of the Chinese medical text are spliced to construct multi-granularity text embedding;
step 1.2: for unstructured English medical text, constructing multi-granularity text embedding, comprising the following steps:
step 1.2.1: learning of symbol embedding, segmentation embedding and covering embedding is carried out by utilizing a pre-training language model BioBert, and word embedding of an unstructured English medical text is generated;
BioBert is a pre-training model generated by training according to English medical data;
step 1.2.2: generating character embedding, part of speech embedding and phrase embedding of English medical texts;
for English medical texts, character embedding, part-of-speech embedding and phrase embedding are generated by using a word2vec tool;
step 1.2.3: for unstructured English medical texts, generating substring embedding by using a word2vec tool;
firstly, collecting an English medical term dictionary and constructing a medical term substring set; for any two terms, extracting the longest common substring of the two terms, and adding the substring to the medical term substring set; if two terms have a plurality of longest common substrings with the same length, the first longest common substring is taken and added into the medical term substring set;
secondly, generating embedded representations of all the substrings by using a word2vec tool;
then, for the english sentence ES, ES ═ ew (ew)1,ew2,...,ewn),ew1,ew2,...,ewnFor words, for each word ewi(i 1, 2.. times.n), judging the word ewiWhether or not to include substrings in the medical term substring set; word setting viewiIncluding substrings esubs in a set of substrings of medical terms1,esubs2,...,esubsqSubstrings esubs1,esubs2,...,esubsqIs denoted as ees1,ees2,....,eesqThen word ewiSubstring embedding representation eessiComprises the following steps: to ees1,ees2,....,eesqAdding and summing, and then dividing by the result of the number q of the substrings; if the word ewiIf the substring does not contain any substring in the medical term substring set, outputting a custom value;
finally, for each word ew in the sentence ESi(i ═ 1, 2.. times, n), generating corresponding substring embeddings according to the steps; substring embedding of sentence ES intoEess=(eess1,eess2,....,eessn);
Step 1.2.4: splicing character embedding, word embedding, part of speech embedding, phrase embedding and substring embedding of the English medical text to construct multi-granularity text embedding;
step 2: generating the mode weight of all characters in the Chinese sentence according to the medical entity constitution mode, comprising the following steps:
step 2.1: constructing a Chinese medical entity composition mode;
the medical entity constitution mode has the constitution form: "Y" is1+Y2+Y3+...+Yk”;
Wherein, Y1,Y2,Y3,...,YkRepresents the category of the word, "+" represents the link operation of the character string;
the category of the words comprises negative words, clinical manifestations, anatomical parts, modifiers, disease names, physical examination, numerical values, quantifiers, and drugs;
step 2.2: generating a mode weight of characters in the Chinese sentence;
and step 3: using a graph attention network and a pattern-reinforced attention mechanism to perform node-embedded representation learning, comprising the steps of:
3.1, transforming the embedding dimensions of the Chinese character nodes or English word nodes by using the full connection layer;
inputting the multi-granularity text embedding of each character in the Chinese sentence CS into a full connection layer, and converting the embedding dimension of the multi-granularity text embedding of the Chinese sentence CS; the reason for converting dimensions is: the multi-granularity text embedding dimension needs to be consistent with the node vector input dimension of the graph attention network used in the step 3.2;
similarly, for multi-granularity text embedding of each word in the English sentence ES, inputting the multi-granularity text embedding into the full connection layer, and converting the embedding dimension of the English multi-granularity text embedding;
in the fully-connected layer, firstly, the dimensionality is converted through a linear layer; then, using dropout method to prevent overfitting; finally, the gradient is prevented from disappearing by activating the function Relu;
step 3.2: for the Chinese medical text, multiplying the attention weight and the mode weight of Chinese character nodes of the graph attention network by the mode weight of characters in a Chinese sentence; for English medical texts, the mode weight of the Chinese and English word nodes in the attention network is set to be 1;
for the Chinese medical text, node embedding of the graph attention network is character embedding, and the character embedding is the character embedding generated in the step 3.1; for the English medical text, the node embedding of the graph attention network is word embedding, and the word embedding is the word embedding generated in the step 3.1;
step 3.2.1: calculating attention weights of nodes in the graph attention network;
step 3.2.2: updating attention weights of nodes in the graph attention network;
and 4, step 4: generating an entity category label of the medical text by adopting a conditional random field, and outputting a medical entity recognition result, wherein the method specifically comprises the following steps: generating entity category labels of Chinese characters or English words;
calculating the conditional probability distribution density of each character based on the conditional random field, namely calculating the probability of each character belonging to each entity class label, allocating the label with the highest probability to the corresponding character as the entity class label of the character, and further outputting the medical entity recognition result;
and (3) carrying out sequence labeling on sentences in the medical text by adopting a conditional random field, generating entity category labels of Chinese characters or English words, and outputting a medical text entity recognition result.
2. The method for recognizing medical entities based on multi-granularity text embedding according to claim 1, wherein: the MC-Bert in the step 1.1.1 is a pre-training model generated according to Chinese medical data training; and the symbol embedding in step 1.1.1 refers to the vector representation of each word; the segmentation embedding is used for distinguishing two natural language sentences, and the medical entity recognition task recognizes the entity by taking the sentences as units, so that each word has the same segmentation embedding; in the covering embedding, if the current position is the character of the input sentence, the value is assigned to be 1; if the current position is 0, i.e. not a character of the input sentence, the value is assigned to 0.
3. The method for recognizing medical entities based on multi-granularity text embedding according to claim 2, wherein: in step 1.1.1, for the chinese sentence CS, CS ═ cc1,cc2,...,ccn),cc1,cc2,...,ccnFor the character, a pre-training language model MC-Bert is adopted to generate a character embedded representation E of the sentence CSccAs shown in (1):
Ecc=(ecc1,ecc2,....,eccn),Ecc∈Rn×m(1)
wherein n is the sentence length, namely the number of characters in the sentence, and if the number of characters is less than n, 0 is filled; m is the dimension set by the pre-training model MC-Bert; e.g. of the typecci(i ═ 1, 2.., n) is the character cciEmbedding; character-embedded representation EccThe dimension of (1) is n rows and m columns; rn×mA matrix of real numbers representing n rows and m columns.
4. The method for recognizing medical entities based on multi-granularity text embedding according to claim 3, wherein the method comprises the following steps: for the chinese sentence CS, CS ═ cc in step 1.1.21,cc2,...,ccn),cc1,cc2,...,ccnThe method is characterized in that the method comprises the following steps: obtaining words of the Chinese medical text based on a jieba word segmentation tool, and obtaining word embedding of the affiliated words by using word2 vec; part of speech embedding, specifically: based on the part-of-speech tag of the word of the Chinese medical text, word2vec is used for acquiring part-of-speech embedding of the affiliated word; phrase embedding, specifically: acquiring the type embedding of the affiliated phrase by using word2vec based on the phrase mark of the Chinese medical text;
wherein the word is embedded as Ecw=(ecw1,ecw2,....,ecwn) Wherein, the character cc1,cc2,...,ccnThe subordinate words are cw in sequence1,cw2,...,cwn;ecwi(i ═ 1, 2.., n) is the word cwiEmbedding;
part of speech embedded as Ecpos=(ecpo1,ecpo2,....,ecpon) Wherein the word cw1,cw2,...,cwnHas a part of speech of cpo1,cpo2,...,cpon;ecpoi(i 1, 2.., n) is a part of speech cpoiEmbedding;
phrase embedding as Ecph=(ecph1,ecph2,....,ecphn) Wherein, the character cc1,cc2,...,ccnThe subordinate phrase types are in turn cph1,cph2,...,cphn;ecphi(i ═ 1, 2.., n) is the phrase type cphiIs embedded.
5. The method for recognizing medical entities based on multi-granularity text embedding according to claim 4, wherein: after the word2vec tool is used to generate the embedded representation of all the substrings in step 1.1.3, for each word cwi(i ═ 1, 2.., n), the word cw is judgediWhether or not to include substrings in the medical term substring set; for the Chinese sentence CS, CS ═ cc1,cc2,...,ccn),cc1,cc2,...,ccnIs a character, cc1,cc2,...,ccnThe words to which they belong are in turn: cw1,cw2,...,cwn(ii) a Let the word cwiIncluding substrings csubs in a set of medical term substrings1,csubs2,...,csubspString csubs1,csubs2,...,csubspIs respectively represented by ecs1,ecs2,....,ecspThen word cwiSubstring embedding representation ecssiComprises the following steps: to ecs1,ecs2,....,ecspSumming the sums and then dividing by the number of substringsThe result of p; if the word cwiIf the substring does not contain any substring in the medical term substring set, outputting a custom value;
finally, for each character cc in the sentence CSi(i ═ 1, 2.. times, n), generating corresponding substring embeddings according to the steps; substring embedding of sentence CS as Ecss=(ecss1,ecss2,....,ecssn)。
6. The method for recognizing medical entities based on multi-granularity text embedding according to claim 5, wherein: step 1.1.4, specifically:
for the Chinese sentence CS, CS ═ cc1,cc2,...,ccn),cc1,cc2,...,ccnFor a character, construct the character cciMulti-granularity text embedding of (i ═ 1, 2.. said., n), i.e. character embedding E of the spliced sentence CSccWord embedding EcwPart of speech embedding EcposPhrase embedding EcphAnd substring embedding EcssConstructing multi-granularity text embedding of the sentence CS, as shown in (2):
Ecme=Concate(Ecc,Ecw,Ecpos,Ecph,Ecss) (2)
wherein, Concate represents splicing operation; in addition, EcmeE isccDimension + E ofcwDimension + E ofcposDimension + E ofcphDimension + E ofcssDimension (d);
thus, from step 1.1.1 to step 1.1.4, the Chinese character cc is constructediA multi-granularity text embedding of (i ═ 1, 2.., n) as Ecme
7. The method of claim 6, wherein the method comprises: in step 1.2.1, for english sentence ES, ES ═ ew1,ew2,...,ewn),ew1,ew2,...,ewnFor words, using pre-trainingThe language model BioBert, which generates a word-embedded representation of the sentence ES, is shown in (3):
Eew=(eew1,eew2,....,eewn),Eew∈Rn×m(3)
wherein n is the sentence length, namely the number of words in the sentence, and if the number of words is less than n, 0 is filled; m is the dimension set by the pre-training model BioBert; e.g. of the typeewi(i 1, 2.. n.) is the word ewiEmbedding;
in step 1.2.2, for english sentence ES, ES ═ ew1,ew2,...,ewn),ew1,ew2,...,ewnIs a word, ewiEmbedding characters of (i ═ 1, 2.., n) to form ewiAll character embedding and averaging, part of speech embedding as ewiEmbedding part of speech, wherein phrase embedding is embedding of the type of the affiliated phrase;
wherein the character is embedded as Eec=(eec1,eec2,....,eecn) Wherein e iseci(i 1, 2.. n.) is the word ewiEmbedding the characters;
part of speech embedded as Eepos=(eepo1,eepo2,....,eepon) Wherein the word ew1,ew2,...,ewnIn order of part of speech of epo1,epo2,...,epon;eepoi(i ═ 1, 2.., n) is a part-of-speech epoiEmbedding;
phrase embedding as Eeph=(eeph1,eeph2,....,eephn) Wherein the word ew1,ew2,...,ewnThe subordinate phrase types are eph in sequence1,eph2,...,ephn;eephi(i ═ 1, 2.., n) is the phrase type ephiIs embedded.
8. The method of claim 7, wherein the method comprises: step 1.2.4, specifically:
for English sentence ES, ES ═ ew1,ew2,...,ewn),ew1,ew2,...,ewnConstruct ew for wordsiMulti-granularity text embedding of (i ═ 1, 2.. multideck, n), i.e. character embedding E of the concatenated sentence ESecWord embedding EewPart of speech embedding EeposPhrase embedding EephAnd substring embedding EessConstructing multi-granularity text embedding of the sentence ES, as shown in (4);
Eeme=Concate(Eec,Eew,Eepos,Eeph,Eess) (4)
wherein, Concate represents splicing operation; in addition, EemeE isecDimension + E ofewDimension + E ofeposDimension + E ofephDimension +2EessDimension (d);
so far, from step 1.2.1 to step 1.2.4, the English word ew is constructediA multi-granularity text embedding of (i ═ 1, 2.., n) as Eeme
9. The method for recognizing medical entities based on multi-granularity text embedding according to claim 8, wherein: in step 2.2, for the Chinese sentence CS, CS ═ cc1,cc2,...,ccn),cc1,cc2,...,ccnAnd for characters, judging whether the Chinese sentence CS is matched with the medical entity constitution mode, wherein the constructed mode matching weight vector is as follows: (w)1,w2,...,wn);
Case 1: if the character string cci,cci+1,...,ccjSatisfying the pattern "anatomical region", "disease" or "medicine", it is cc for each characteri,cci+1,...,ccjGiving a mode weight of 2;
case 2: if the character string cci,cci+1,...,ccjSatisfying other patterns, for each character cci,cci+1,...,ccjGiving a mode weight of 1.5;
case 3: if the character string cci,cci+1,...,ccjNot satisfying the pattern, for each character cci,cci+1,...,ccjGiving a mode weight of 1;
step 3.2.1, specifically:
first, a multi-granular text embedding h of a sentence is input into an attention layer in an attention network, wherein,
Figure FDA0003195630650000061
embedding Chinese character nodes or English word nodes of a sentence of a medical text, wherein M is the number of the nodes, and H is the dimension of character embedding, as shown in (5);
Figure FDA0003195630650000062
performing linear transformation on the input node embedding, and converting the node embedding into the dimension of the number of all category labels; and calculating attention weight by using LeakyRelu function, namely calculating importance degree e of node v to node uuvAs shown in (6);
Figure FDA0003195630650000071
wherein, W1A shared weight matrix is represented that is,
Figure FDA0003195630650000072
the embedding of a chinese character node or an english word node u representing a sentence of medical text,
Figure FDA0003195630650000073
embedding Chinese character nodes or English word nodes v representing one sentence of the medical text;
then, using Softmax function to pair euvNormalization is carried out to obtain alphauvAs shown in (7);
Figure FDA0003195630650000074
wherein e isukRepresenting the degree of importance of node u to node k, αuvDenotes euvNormalized value of (1), NuA neighbor node representing node u;
finally, the attention weight α of the node u is generateduAs shown in (8);
Figure FDA0003195630650000075
wherein, W2Representing a weight matrix;
step 3.2.2, specifically:
first, for node u, the mode weight w of Chinese character or English word represented by node u is useduUpdating the attention weight alpha of the node uuAs shown in (9);
αu=αu×wu(9)
second, construct an attention weight attention for the sentencel(1. ltoreq. l. ltoreq.k) as shown in (10);
attentionl=(α1,α2,...,αM) (10)
then, a multi-head attention mechanism is introduced into the graph attention network, specifically: calculating k attention weights, multiplying each attention weight by the input h, and generating a feature h 'of the sentence'lAs shown in (11);
h′l=attentionl×h (11)
by activating function elu, a one-headed output elu (h'1),elu(h′2),...,elu(h′k);
Thirdly, splicing the outputs of the k heads to generate h' as shown in (12);
h′=Concat(elu(h′1),elu(h′2),...,elu(h′k)) (12)
finally, generating a final output h through a log _ softmax functionfinalAs shown in (13).
hfinal=log_softmax(h')) (13)。
CN202110890112.9A 2021-06-09 2021-08-04 Medical entity identification method based on multi-granularity text embedding Active CN113779993B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110641595 2021-06-09
CN2021106415959 2021-06-09

Publications (2)

Publication Number Publication Date
CN113779993A true CN113779993A (en) 2021-12-10
CN113779993B CN113779993B (en) 2023-02-28

Family

ID=78836880

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110890112.9A Active CN113779993B (en) 2021-06-09 2021-08-04 Medical entity identification method based on multi-granularity text embedding

Country Status (1)

Country Link
CN (1) CN113779993B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114972929A (en) * 2022-07-29 2022-08-30 中国医学科学院医学信息研究所 Pre-training method and device for medical multi-modal model
CN115512859A (en) * 2022-11-21 2022-12-23 北京左医科技有限公司 Internet-based in-clinic quality management method, management device and storage medium
CN115618824A (en) * 2022-10-31 2023-01-17 上海苍阙信息科技有限公司 Data set labeling method and device, electronic equipment and medium
CN116629267A (en) * 2023-07-21 2023-08-22 云筑信息科技(成都)有限公司 Named entity identification method based on multiple granularities

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111079377A (en) * 2019-12-03 2020-04-28 哈尔滨工程大学 Method for recognizing named entities oriented to Chinese medical texts
US20200226472A1 (en) * 2019-01-10 2020-07-16 Arizona Board Of Regents On Behalf Of Arizona State University Systems and methods for a supra-fusion graph attention model for multi-layered embeddings and deep learning applications
CN111444721A (en) * 2020-05-27 2020-07-24 南京大学 Chinese text key information extraction method based on pre-training language model
CN112101031A (en) * 2020-08-25 2020-12-18 厦门渊亭信息科技有限公司 Entity identification method, terminal equipment and storage medium
CN112541356A (en) * 2020-12-21 2021-03-23 山东师范大学 Method and system for recognizing biomedical named entities
CN112597774A (en) * 2020-12-14 2021-04-02 山东师范大学 Chinese medical named entity recognition method, system, storage medium and equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200226472A1 (en) * 2019-01-10 2020-07-16 Arizona Board Of Regents On Behalf Of Arizona State University Systems and methods for a supra-fusion graph attention model for multi-layered embeddings and deep learning applications
CN111079377A (en) * 2019-12-03 2020-04-28 哈尔滨工程大学 Method for recognizing named entities oriented to Chinese medical texts
CN111444721A (en) * 2020-05-27 2020-07-24 南京大学 Chinese text key information extraction method based on pre-training language model
CN112101031A (en) * 2020-08-25 2020-12-18 厦门渊亭信息科技有限公司 Entity identification method, terminal equipment and storage medium
CN112597774A (en) * 2020-12-14 2021-04-02 山东师范大学 Chinese medical named entity recognition method, system, storage medium and equipment
CN112541356A (en) * 2020-12-21 2021-03-23 山东师范大学 Method and system for recognizing biomedical named entities

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
MA NIANZU 等: "Entity-Aware Dependency-based Deep Graph Attention Network for Comparative Preference Classification", 《NSF PUBLIC ACCESS》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114972929A (en) * 2022-07-29 2022-08-30 中国医学科学院医学信息研究所 Pre-training method and device for medical multi-modal model
CN115618824A (en) * 2022-10-31 2023-01-17 上海苍阙信息科技有限公司 Data set labeling method and device, electronic equipment and medium
CN115618824B (en) * 2022-10-31 2023-10-27 上海苍阙信息科技有限公司 Data set labeling method and device, electronic equipment and medium
CN115512859A (en) * 2022-11-21 2022-12-23 北京左医科技有限公司 Internet-based in-clinic quality management method, management device and storage medium
CN116629267A (en) * 2023-07-21 2023-08-22 云筑信息科技(成都)有限公司 Named entity identification method based on multiple granularities
CN116629267B (en) * 2023-07-21 2023-12-08 云筑信息科技(成都)有限公司 Named entity identification method based on multiple granularities

Also Published As

Publication number Publication date
CN113779993B (en) 2023-02-28

Similar Documents

Publication Publication Date Title
CN113779993B (en) Medical entity identification method based on multi-granularity text embedding
CN110032648B (en) Medical record structured analysis method based on medical field entity
Finkel et al. Nested named entity recognition
CN112597774B (en) Chinese medical named entity recognition method, system, storage medium and equipment
CN110162784B (en) Entity identification method, device and equipment for Chinese medical record and storage medium
Warjri et al. Identification of pos tag for khasi language based on hidden markov model pos tagger
Liang et al. Asynchronous deep interaction network for natural language inference
CN111523320A (en) Chinese medical record word segmentation method based on deep learning
CN114927177A (en) Medical entity identification method and system fusing Chinese medical field characteristics
Goswami et al. ULD@ NUIG at SemEval-2020 Task 9: Generative morphemes with an attention model for sentiment analysis in code-mixed text
Yohannes et al. A method of named entity recognition for tigrinya
Wang et al. A hybrid model based on deep convolutional network for medical named entity recognition
CN110543630A (en) Method and device for generating text structured representation and computer storage medium
Ahnaf et al. An improved extrinsic monolingual plagiarism detection approach of the Bengali text.
CN115759102A (en) Chinese poetry wine culture named entity recognition method
CN114444467A (en) Traditional Chinese medicine literature content analysis method and device
Baruah et al. Character coreference resolution in movie screenplays
Nunsanga et al. Part-of-speech tagging in Mizo language: A preliminary study
Bharti et al. Sarcasm as a contradiction between a tweet and its temporal facts: a pattern-based approach
Wen et al. Improving Extraction of Chinese Open Relations Using Pre-trained Language Model and Knowledge Enhancement
Asker et al. Classifying Amharic webnews
Patkar et al. A Neural Network Based Machine Translation model For English To Ahirani Language
Sil et al. The IBM Systems for Entity Discovery and Linking at TAC 2017.
Barathi Ganesh et al. MedNLU: natural language understander for medical texts
Nizami et al. Hindustani or hindi vs. urdu: A computational approach for the exploration of similarities under phonetic aspects

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant