CN113177416B - Event element detection method combining sequence labeling and pattern matching - Google Patents

Event element detection method combining sequence labeling and pattern matching Download PDF

Info

Publication number
CN113177416B
CN113177416B CN202110532819.2A CN202110532819A CN113177416B CN 113177416 B CN113177416 B CN 113177416B CN 202110532819 A CN202110532819 A CN 202110532819A CN 113177416 B CN113177416 B CN 113177416B
Authority
CN
China
Prior art keywords
event
elements
treatment
sentence
diagnosis
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110532819.2A
Other languages
Chinese (zh)
Other versions
CN113177416A (en
Inventor
翟鹏珺
王晨
方钰
徐蔚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University filed Critical Tongji University
Priority to CN202110532819.2A priority Critical patent/CN113177416B/en
Publication of CN113177416A publication Critical patent/CN113177416A/en
Application granted granted Critical
Publication of CN113177416B publication Critical patent/CN113177416B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/186Templates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The current Chinese medical affair knowledge graph event element detection research is mainly based on a single mode matching or deep learning model, long-short level granularity division is not carried out on event elements, long-sentence level elements cannot be effectively extracted, and the model is low in flexibility and poor in generalization. Therefore, the invention provides an event element detection method combining sequence labeling and pattern matching, and has the innovation points that the event elements are subjected to granularity distinguishing according to event types, and different methods are adopted for detecting the event elements with different granularities. Firstly, for short term level event element detection, a BERT-BilSTM-CRF model based on sequence labeling is used, and corpus characteristics such as entity information and trigger word information are combined to realize strong expandability. Secondly, long sentence level event elements are detected through a mode matching method of joint dependency syntactic analysis, and therefore the accuracy of event element detection is improved.

Description

Event element detection method combining sequence labeling and pattern matching
Technical Field
The invention relates to the field of event element detection for event extraction in computer natural language processing.
Medical event element detection is an important subtask in the medical affairs knowledge graph construction task.
Background
Under the background of the current day-to-day change of smart cities, intelligent information technology has been widely applied to various fields such as social life, industrial production, city construction and the like, so that the information technology can better serve human beings. In recent years, the relevant research of intelligent medical treatment has attracted attention, especially the natural language processing task oriented to Chinese electronic medical record, which includes the event element detection of medical affair knowledge graph.
Event element detection is an important and challenging subtask in information extraction, and according to the definition of an Event element (Event) in an Event (Event) by an ace (automatic Content extraction) conference, the Event element is description information of one or more roles participating in Event occurrence or time, place and the like, and each Event type defines a corresponding Event element role. For example, the complex hyperplasia of the endometrium of the patient 2014-02-03 is treated by total hysterectomy, bilateral salpingectomy and pelvic adhesion relaxation in our hospital. The "surgical event includes the time element" 2014-02-03 ", the disease element" complicated hyperplasia of endometrium "and the operation name elements" total hysterectomy "," bilateral salpingectomy "and" pelvic adhesion laxity ". The elements in the event are all vocabularies with entity granularity, namely short word level event elements.
In the current event element detection method, most of the existing researches concern the detection of Short Term event elements, and a single matching model or a deep learning model of sequence labeling is mostly utilized, wherein a more popular method is based on sequence labeling, a Bi-LSTM (Bidirectional Long Short-Term Memory) model combined with CRF is used most in a sequence labeling task, and the model effect is better. Bi-LSTM can capture context information useful in the forward and backward directions of a sentence, while CRF has the advantage of utilizing sentence level and neighbor tag information in predicting the current tag. However, most of the existing models do not perform long-short level granularity division on event elements, and a single sequence marking model cannot effectively extract long sentence level elements.
Disclosure of Invention
In view of the prior art, the invention provides an event element detection method combining sequence labeling and pattern matching, which designs a BERT-BilTM-CRF sequence labeling model combining the characteristics of trigger word information, entity information, dependency syntactic information and the like based on the characteristics of electronic medical record diagnosis and treatment events, so that the model is suitable for the event element detection of electronic medical record event sentences of various styles. Meanwhile, aiming at the event sentence containing the long sentence level event element, according to the sentence pattern characteristics of the text, the template matching method is designed by utilizing syntactic structure analysis and joint dependency syntactic characteristics, so that the long sentence level event element detection is realized.
The detection of event elements in the medical affair knowledge graph is one of the important subtasks for constructing the affair knowledge graph in the intelligent medical field, and the extraction of disease diagnosis and treatment event information for assisting a doctor in diagnosis and decision-making from the rich electronic medical record text has important significance. At present, event element detection research in the process of constructing a Chinese medical field case knowledge graph is mainly based on a single mode matching method and a single sequence labeling deep learning method, granularity characteristics of event elements in an event corpus are not considered, and syntactic structure information in diagnosis and treatment event sentences is ignored.
Aiming at the problems, the invention aims at realizing the detection of the event elements in the diagnosis and treatment event sentences, performs long-short granularity division on the event elements of the Chinese electronic medical record text, designs an event element detection method combining sequence labeling and pattern matching by combining the dependency syntactic characteristics of the event sentences and the like, enables the model to simultaneously detect the short term level event elements and the long sentence level event elements, and improves the generalization performance and the accuracy of the model.
In order to achieve the purpose, the technical scheme provided by the invention is as follows:
the invention provides an event element detection method based on joint sequence annotation pattern matching, which comprises the following steps:
step 1, preprocessing a Chinese medical history text;
step 2, defining a Chinese diagnosis and treatment event element representation template;
step 3, constructing semantic feature vectors contained in the event sentences;
step 4, performing granularity division on the event elements obtained in the step 2;
and 5, acquiring corresponding event elements by respectively adopting a sequence labeling method and a pattern matching method according to the result divided in the step 4, wherein the sequence labeling method integrates the semantic feature vectors obtained in the step 3.
Advantageous effects
The invention aims at the problems that the granularity of event elements is not distinguished, the syntactic structure characteristics of event sentences are ignored, the event elements are extracted only by adopting a single model and the like in the conventional Chinese diagnosis and treatment affair knowledge map event element detection research, and realizes the event element detection method combining sequence marking and mode matching. The invention aims to detect event elements with different granularities in Chinese diagnosis and treatment events, and the granularity division of short term level and long sentence level is firstly carried out on the event elements. And then, aiming at short term level event elements, combining the characteristics of event types, intra-sentence entity information, dependency syntax and the like, and detecting by using a BERT-BilSTM-CRF sequence labeling model. In addition, aiming at long sentence level event elements, according to the sentence pattern characteristics of the electronic medical record text, a template matching method is designed by utilizing the dependency syntactic characteristics, and the long sentence level event element detection is realized. Therefore, the model can detect the short term level event elements and the long sentence level event elements simultaneously, and the whole event element detection result is improved. The invention is beneficial to promoting the research of Chinese diagnosis and treatment affair knowledge map event element detection tasks.
The invention carries out the event element detection experiment on the current medical history data set of the Chinese electronic medical record, and after the event element granularity is distinguished and different methods are respectively adopted for detection, the detection result is obviously improved. The event element detection method combining sequence labeling and pattern matching can be further applied to construction of a Chinese diagnosis and treatment affair knowledge map, and has great significance for promoting relevant tasks of intelligent cities such as computer-aided doctor diagnosis and treatment, medical automatic question answering and the like.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and, together with the description, serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a flow diagram of event element detection for joint sequence tagging and pattern matching;
FIG. 2 is a flow chart of sequence tagging oriented to short term event elements in step four;
FIG. 3 is a flow chart of pattern matching for long sentence-level event-oriented elements in step four;
FIG. 4 is an exemplary diagram of pathology test event dependency analysis.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, a detailed description of the embodiments of the present invention will be given below with reference to the accompanying drawings. It is to be understood that the specific embodiments described herein are merely illustrative and explanatory of the invention and are not restrictive thereof.
The specific implementation process of the invention is shown in fig. 1, and comprises the following 5 aspects:
step 1, preprocessing a Chinese medical history text;
step 2, defining a Chinese diagnosis and treatment event element representation template;
step 3, constructing semantic feature vectors contained in the event sentences;
step 4, performing granularity division on the event elements obtained in the step 2;
and 5, acquiring corresponding event elements by respectively adopting a sequence labeling method and a pattern matching method according to the result divided in the step 4, wherein the sequence labeling method integrates the semantic feature vectors obtained in the step 3. Each step is described in detail below.
The first step is as follows: and preprocessing the Chinese current medical history text.
The current medical history part in the Chinese electronic medical record is a core part for recording the medical record of the patient, covers the whole process of the patient from illness to treatment and treatment, contains abundant diagnosis and treatment events, and is used for collecting the current medical history text to construct a Chinese diagnosis and treatment event data set. The specific data preprocessing process is as follows:
1.1: deleting sentences irrelevant to medical events in the data set, and standardizing disordered punctuations in the text;
1.2: if an event sentence is too long or one sentence contains a plurality of diagnosis and treatment event sentences, the sentences need to be divided into sentences, so that each sentence in the data set corresponds to one diagnosis and treatment event;
1.3: because writing habits of different doctors are different, words which refer to the same disease, operation or medicine name in different current medical history texts are different, and the conditions of abbreviation, shorthand and variation exist, certain difficulty is brought to follow-up work, and medical vocabularies such as diseases, operations, medicines and the like need to be unified;
1.4: and labeling event types and event trigger words corresponding to the event sentences, wherein the event types comprise 8 types of events such as admission, examination, pathological examination, immunohistochemical examination, treatment, chemotherapy, operation, diagnosis and the like, and the trigger words respectively comprise diagnosis, gastroscopy, disease examination, staining examination, symptomatic treatment, chemical drug treatment, administration, confirmation diagnosis and the like. After the event type is marked, the preprocessing operation of the data is completed, and the preprocessed data is provided for the second step, the third step, the fourth step and the fifth step.
The second step is that: defining a Chinese diagnosis and treatment event element representation template.
According to the definition of the ACE conference on the event, the data set content in the first step is combined, and after statistical analysis, corresponding event elements are designed according to different event types and the participation elements in the event. Wherein, the event elements of the admission event are admission time, symptoms, operation names and diseases; event elements of the inspection event are inspection time, diseases and inspection results; the event elements of the pathological examination events are examination time, diseases and pathological examination results; the event elements of the immunohistochemical event are detection time and an immunohistochemical staining result; the event elements of the general treatment event are treatment time and general treatment means; the event elements of the chemotherapy event are chemotherapy time and chemotherapy drugs; event elements of the operation event are operation time, symptoms, operation name and diseases; the event elements of the diagnosis event are diagnosis event, symptom, disease and operation name.
In addition, the admission event at least comprises one of four types of event elements, wherein the number of the symptoms, the operation names and the disease type event elements can be more; the inspection event must contain inspection result event elements, and the inspection time can be vacant; the checking event must include checking result event elements, and the checking time can be vacant; the general treatment events need to include event elements of general treatment means, the number of the treatment means can be multiple, and the event elements of treatment time can be vacant; the chemotherapy event must contain chemotherapy drug event elements, and the chemotherapy time and the number of the chemotherapy drugs can be multiple; the operation event must contain operation name event elements, other types of elements can be vacant, and the number of the symptom, disease and operation name event elements can be multiple; the diagnosis event must include disease event elements, other event elements may be absent, and the number of symptom event elements may be multiple. The defined event elements are provided to the third step, the fourth step and the fifth step.
The third step: and constructing semantic features of the event sentences.
In order to more comprehensively assist a model to improve the mining capability of deeper semantic information and shorten the time required by model learning training, the invention constructs four feature vectors as the distance feature (2) of an additional input word (1) and a trigger word of the model, the event type feature (3) and the entity category feature (4) of the model according to the features of a current medical history text in an electronic medical record. These characteristics are provided to the fifth step.
(1) Distance characteristics of words from trigger words: because the event elements provided in the second step are usually distributed around the event trigger word, and the different types of event elements in the diagnosis and treatment event corpus provided in the first step have a certain distribution rule with the distance from the trigger word, the distance vector of the fusion word and the trigger word in the model can provide deep syntactic information for the extraction of the event elements. The position coding mode adopted by the invention is shown as formulas (1) and (2):
PE(pos,2i)=sin(pos/100002i/d) (1)
PE(pos,2i+1)=cos (pos/100002i/d) (2)
pos is the position of an event trigger word in an input event sentence, the value of pos is a certain integer from 0 to the length of the event sentence, d is the dimension of an input vector, 2i and 2i +1 are other words and phrases of the input event sentence, and the value of i is a certain integer from 0 to d/2-1。PE(pos,2i)Is the value of the pos-th row in the matrix, the even 2i columns, calculated using a sine function, PE(pos,2i+1)Is the value of the pos row in the matrix, odd column 2i +1 column, calculated using a cosine function. Due to the characteristics of the trigonometric function, the position coding mode can express the absolute position and the relative position between the words and the trigger words at the same time.
(2) Event type characteristics: the different types of events in the event corpus provided in the first step correspond to different types of event elements, and obviously, the event elements and the event categories have close dependence relationship, so that the event types can be used as additional input characteristic information of the model.
(3) Entity category characteristics: the entity types corresponding to the event elements of the same type of events in the event corpus provided in the first step are generally similar, for example, "the patient in 1 month in 2013 visits his hospital due to acid regurgitation, nausea and pain in the xiphoid process, and has no fever, abdominal distension and no obvious eating discomfort". This phrase is an admission event, where "acid reflux", "nausea", "xiphoid pain" are the symptomatic elements of the admission event, and they are all symptom-descriptive entities. "patients with endometrial cancer in month 05 of 2014 with extensive laparoscopic resection, two-adnexal resection, and abdominal periaortic lymph node dissection. The phrase "this is the surgical event, and" extensive resection "," two-annex resection "," abdominal periaortic lymph node dissection "are the surgical elements of the surgical event, all of which are surgical entities. "patients started on oral chemotherapy for 7 th of Hiluoda at 2015-03-23 minus the Maranta regimen. "is a chemotherapeutic event, wherein" Hirodad "is the chemotherapeutic drug element of the chemotherapeutic event, the term being a drug-like entity. Therefore, entity information can be used as additional input feature information for the model.
(4) Dependency syntax information characterization: although different doctors have respective writing styles, due to certain writing specifications of medical texts, the same type of medical events in the medical event corpus provided in the first step often follow similar grammatical structures, such as' acid suppression, anti-inflammation, antiemetic and other nursing treatments given to patients 3, 14 days in 2015. The phrase "care" refers to a general treatment event, wherein "care" refers to a trigger of the treatment event, and "antacid", "anti-inflammatory" and "antiemetic" refer to general treatment means event elements of the treatment event, and "antacid", "anti-inflammatory" and "antiemetic" are in parallel relationship, and they are in a central relationship with the trigger "care". And for example, the postoperative adjuvant treatment of patients such as liver protection, stomach protection, immunity improvement and the like has no obvious side effect of chemotherapy. The phrase "this phrase is also a treatment event, wherein" symptomatic treatment "is a trigger word of this phrase," liver protection "," stomach protection "and" immunity improvement "are event elements of the general treatment means, and although they are not directly grammatically connected with the trigger word, the event elements of the category are still in a parallel relationship. Therefore, syntactic components such as a predicate and a verb-object in a sentence can be analyzed through dependency parsing, a common dependency parsing tool such as an LTP natural language processing tool in hayage is used, and the syntax relationship and the corresponding label of the dependency parsing are shown in table 1.
TABLE 1 dependency syntax relationship
Figure BDA0003068546330000061
The fourth step: event element granularity is divided.
And according to the expression of the event elements in the second step, counting the event elements in the diagnosis and treatment event corpus provided in the first step, and then taking the examination result of the examination event containing 20-50 words, the pathological examination result of the pathological examination event and the immunohistochemical staining result of the immunohistochemical event as long sentence-level event elements and the other short word-level event elements. The defined event element granularity is provided to the fifth step.
The fifth step: an event element is detected.
5.1: for short term level event elements, a sequence tagging method fusing corpus semantic dependency features is used for detection, as shown in fig. 2:
5.1.1: for the diagnosis and treatment event corpus R provided in the first step, word vectors are trained by using BERT, and therefore the vector representation form of all words in each event sentence is obtained
Figure BDA0003068546330000071
The vector is provided to 5.1.2.
5.1.2: fusing the four additional input semantic feature vectors f for the text R provided in the third step into word vectorsmWhere m is 1,2,3,4, the fusion process can be expressed as formula (3), where | | represents the splicing operation.
Figure BDA0003068546330000072
5.1.3: using a hidden layer size of
Figure BDA0003068546330000073
The fused vector provided in step 5.1.2 is encoded by the bidirectional LSTM network of
Figure BDA0003068546330000074
Therefore, how to judge the key fusion vector is learned to obtain the semantic information of the short word level. At the time step t, the time is,
Figure BDA0003068546330000075
is the input of the BilSTM, the hidden state of the output of the BilSTM is
Figure BDA0003068546330000076
This process can be expressed as equation (4).
Figure BDA0003068546330000077
5.1.4: in order to avoid the overfitting condition of the training result, part of the bidirectional long-short time memory units in the BilSTM in the step 5.1.3 are inactivated randomly by utilizing a Dropout layer, and meanwhile, the vectors which are provided by the splicing step 5.1.2 and are fused with the semantic features
Figure BDA0003068546330000078
To further strengthen semantic information at the short term level:
Figure BDA0003068546330000079
wherein
Figure BDA00030685463300000710
As a weight matrix, bLIn order to be offset,
Figure BDA00030685463300000711
here, the Bernoulli function is to randomly generate a vector of 0 and 1, and f (×) is an activation function.
5.1.5: output of 5.1.4Dropout
Figure BDA00030685463300000712
Input into CRF layer, and corresponding label sequence is
Figure BDA00030685463300000713
Figure BDA00030685463300000714
Then, for a given current medical history text R in the medical event corpus provided in the first step, all parameters of the CRF layer can be maximized
Figure BDA00030685463300000715
To estimate:
Figure BDA00030685463300000716
wherein the content of the first and second substances,
Figure BDA00030685463300000717
in order to normalize the factors, the method comprises the steps of,
Figure BDA00030685463300000718
to represent
Figure BDA00030685463300000719
Corresponding markLabel (Bao)
Figure BDA00030685463300000720
The probability of (a) of (b) being,
Figure BDA00030685463300000721
then it is indicated at
Figure BDA00030685463300000722
Corresponding label
Figure BDA0003068546330000081
On the premise of
Figure BDA0003068546330000082
Corresponding label
Figure BDA0003068546330000083
Probability of (a)gAnd muvIs a hyper-parameter. Therefore, the CRF can be trained by solving the maximum log-likelihood function on the corpus, so that the most accurate short word level event element retrieval result is obtained:
Figure BDA0003068546330000084
5.2: for a long sentence-level event element, detecting the long sentence-level event element by adopting a pattern matching method, as shown in fig. 3:
5.2.1: for the diagnosis and treatment event corpus provided in the first step, parsing out grammatical components such as a leader and a predicate, a guest and the like in a sentence through dependency syntax analysis, wherein a commonly used dependency syntax analysis tool is an LTP natural language processing tool in haohang, and after-operation pathology shows in "2015, 6 months and 3 days: the dependency analysis of the pathological examination event of colon cancer, invasion to the upper serosa, and metastasis of adenocarcinoma at the periintestinal lymph node … is shown in fig. 4, and the dependency analysis of other event corpora is similarly obtained, and the results of the dependency analysis are provided to step 5.2.2.
5.2.2: according to the event sentence fixed expression sentence pattern, in combination with the dependency syntax analysis result provided in step 5.2.1, the invention summarizes and designs the pattern rule for long sentence level event element detection, and the logical representation is shown in table 2.
TABLE 2 Long sentence-level event element extraction schema rules
Figure BDA0003068546330000085
5.2.3: for the event type of the event sentence, the event sentence is sequentially matched in the pattern rule templates provided in step 5.2.2, and the rule template corresponding to the event sentence is found from 12 templates, so that the corresponding event element is detected from the event sentence.
Innovation point
Aiming at the defects of event element detection research in the field of intelligent medical treatment, the event element detection method combining sequence labeling and pattern matching is provided. The method is different from the conventional Chinese medical field event element detection method in that the method carries out short-term level and long-term level granularity differentiation on event elements in diagnosis and treatment events, fully utilizes the characteristics of syntactic structures, entities, event types and the like of the event sentences, combines a BERT-BilSTM-CRF sequence labeling method and a pattern matching method based on dependency syntactic analysis to respectively detect the short-term level event elements and the long-term level event elements, realizes the simultaneous detection of the short-term level event elements and the long-term level event elements, and improves the accuracy of event element detection results.
The method provided by the invention has an excellent effect in Chinese medical event element detection, and provides a basic support for the construction of a Chinese diagnosis and treatment affair knowledge map.

Claims (3)

1. An event element detection method combining sequence labeling and pattern matching is characterized by comprising the following steps:
firstly, preprocessing a Chinese medical history text;
secondly, defining a Chinese diagnosis and treatment event element representation template;
thirdly, constructing semantic feature vectors contained in the event sentences;
fourthly, performing granularity division on the event elements obtained in the second step;
fifthly, according to the result divided in the fourth step, respectively adopting a sequence labeling method and a pattern matching method to obtain corresponding event elements, wherein the sequence labeling method integrates the semantic feature vector obtained in the third step;
wherein, the third step: constructing semantic feature vectors contained in the event sentences,
according to the characteristics of the current medical history text in the electronic medical record, four characteristic vectors are constructed as additional inputs of the model: (1) distance characteristics of words and trigger words; (2) an event type characteristic; (3) an entity category characteristic; (4) dependency syntax information features; providing these characteristics to the fifth step;
(1) distance characteristics of words from trigger words: because the event elements provided in the second step are distributed around the event trigger word, and the distance between different types of event elements in the diagnosis and treatment event corpus provided in the first step and the trigger word has a distribution rule, the distance vector of the fusion word and the trigger word in the model provides deep syntactic information for the extraction of the event elements; the adopted position coding mode is shown as formulas (1) and (2):
PE(pos,2i)=sin(pos/100002i/d) (1)
PE(pos,2i+1)=cos(pos/100002i/d) (2)
pos is the position of an event trigger word in an input event sentence, the value of pos is a certain integer from 0 to the length of the event sentence, d is the dimension of an input vector, 2i and 2i +1 are other words of the input event sentence, and the value of i is a certain integer from 0 to d/2-1; PE (polyethylene)(pos,2i)Is the value of the pos-th row in the matrix, the even 2i columns, calculated using a sine function, PE(pos,2i+1)The values of the pos row and the odd-numbered columns 2i +1 in the matrix are calculated by using a cosine function; due to the characteristics of the trigonometric function, the position coding mode can simultaneously express the absolute position and the relative position between the words and the trigger words;
(2) event type characteristics: the events of different types in the event corpus provided in the first step correspond to event elements of different types, so that the event type is used as additional input characteristic information of the model;
(3) entity category characteristics: the entity types corresponding to the event elements of the same type of events in the event corpus provided in the first step are the same, so the entity information is used as the additional input characteristic information of the model;
(4) dependency syntax information characterization: although different doctors have respective writing styles, because the medical texts have writing specifications, the diagnosis and treatment events of the same type in the diagnosis and treatment event corpus provided in the first step often follow the same grammatical structure;
the fourth step: the event elements obtained in the second step are subjected to granularity division,
counting event elements in the diagnosis and treatment event corpus provided in the first step according to the event element representation in the second step, and then taking an inspection result of an inspection event containing 20-50 words, a pathological inspection result of a pathological inspection event and an immunohistochemical staining result of an immunohistochemical event as long sentence-level event elements, and taking other short word-level event elements; providing the defined event element granularity to the fifth step;
the fifth step: according to the result divided in the fourth step, respectively adopting sequence marking method and pattern matching method to obtain its correspondent event element,
5.1: aiming at short term level event elements, detecting by using a sequence labeling method fusing corpus semantic dependency characteristics:
5.1.1: for the diagnosis and treatment event corpus R provided in the first step, word vectors are trained by using BERT, and therefore the vector representation form of all words in each event sentence is obtained
Figure FDA0003591918780000021
Providing the vector to 5.1.2;
5.1.2: fusing the four additional input semantic feature vectors f for the text R provided in the third step into word vectorsmWhere m is 1,2,3,4, the fusion process is expressed as formula (3), where | | | represents the splicing operation;
Figure FDA0003591918780000022
5.1.3: using a hidden layer size of
Figure FDA0003591918780000023
The fused vector provided in step 5.1.2 is encoded by the bidirectional LSTM network of
Figure FDA0003591918780000024
Therefore, how to judge the key fusion vector is learned to obtain semantic information of short word level; at the time step t, the time is,
Figure FDA0003591918780000025
is the input of the BilSTM, the hidden state of the output of the BilSTM is
Figure FDA0003591918780000026
The process is represented as equation (4);
Figure FDA0003591918780000027
5.1.4: in order to avoid the overfitting condition of the training result, a Dropout layer is utilized to randomly deactivate partial two-way long-and-short-term memory cells in the BilSTM in the step 5.1.3, and meanwhile, the vectors fused with the semantic features and provided in the step 5.1.2 are spliced
Figure FDA0003591918780000028
To further strengthen semantic information at the short term level:
Figure FDA0003591918780000029
wherein
Figure FDA0003591918780000031
As a weight matrix, bL as an offset,
Figure FDA0003591918780000032
here, the Bernoulli function is to randomly generate a vector of 0 and 1, and f (×) is an activation function;
5.1.5: output of 5.1.4Dropout
Figure FDA0003591918780000033
Input to a CRF layer, with corresponding tag sequences as
Figure FDA0003591918780000034
Then, for a given current medical history text R in the medical event corpus provided in the first step, all parameters of the CRF layer are maximized
Figure FDA0003591918780000035
To estimate:
Figure FDA0003591918780000036
wherein the content of the first and second substances,
Figure FDA0003591918780000037
in order to normalize the factors, the method comprises the steps of,
Figure FDA0003591918780000038
represent
Figure FDA0003591918780000039
Corresponding label
Figure FDA00035919187800000310
The probability of (a) of (b) being,
Figure FDA00035919187800000311
then it is indicated at
Figure FDA00035919187800000312
Corresponding label
Figure FDA00035919187800000313
On the premise of
Figure FDA00035919187800000314
Corresponding label
Figure FDA00035919187800000315
Probability of (a)gAnd muvIs a hyper-parameter; therefore, the CRF is trained by solving the maximum log-likelihood function on the corpus, so that the most accurate short word level event element retrieval result is obtained:
Figure FDA00035919187800000316
5.2: aiming at long sentence level event elements, detecting the long sentence level event elements by adopting a mode matching method:
5.2.1: analyzing the main meaning and the kinergen meaning in the sentence according to the dependency syntax analysis of the diagnosis and treatment event linguistic data provided in the first step, carrying out dependency syntax analysis on the event linguistic data by utilizing an LTP natural language processing tool with the size of Hadamard, and providing the dependency syntax analysis results to the step 5.2.2;
5.2.2: according to the event sentence fixed expression sentence pattern, combining the dependency syntax analysis result provided in the step 5.2.1, summarizing and designing a pattern rule for long sentence level event element detection;
5.2.3: for the event type of the event sentence, the event sentence is sequentially matched in the pattern rule templates provided in step 5.2.2, and the rule template corresponding to the event sentence is found from 12 templates, so that the corresponding event element is detected from the event sentence.
2. The method of claim 1, wherein the first step of the event element detection combining sequence labeling and pattern matching comprises: the Chinese medical history text is preprocessed,
1.1: deleting sentences irrelevant to medical events in the data set, and standardizing disordered punctuations in the text;
1.2: if an event sentence is too long or one sentence contains a plurality of diagnosis and treatment event sentences, the sentences need to be divided into sentences, so that each sentence in the data set corresponds to one diagnosis and treatment event;
1.3: because writing habits of different doctors are different, words which refer to the same disease, operation or medicine name in different current medical history texts are different, and the conditions of abbreviation, shorthand and variation exist, the medical vocabularies of the disease, the operation and the medicine need to be unified;
1.4: labeling event types and event trigger words corresponding to the event sentences, wherein the event types comprise 8 types of events such as admission, examination, pathological examination, immunohistochemical examination, treatment, chemotherapy, operation and diagnosis, and the trigger words respectively comprise diagnosis, gastroscopy, disease examination, staining examination, symptomatic treatment, chemical drug treatment, administration and definite diagnosis; after the event type is marked, the preprocessing operation of the data in the first step is finished, and the preprocessed data are provided for the second step, the third step, the fourth step and the fifth step.
3. The method of claim 1, wherein the second step of the event element detection combining sequence labeling and pattern matching comprises: defining a Chinese diagnosis and treatment event element representation template,
according to the definition of an ACE conference on events, combining the data set content in the first step, and after statistical analysis, designing corresponding event elements according to different event types and combining the participating elements in the events; wherein, the event elements of the admission event are admission time, symptoms, operation names and diseases; event elements of the inspection event are inspection time, diseases and inspection results; the event elements of the pathological examination events are examination time, diseases and pathological examination results; the event elements of the immunohistochemical event are detection time and an immunohistochemical staining result; the event elements of the treatment event are treatment time and treatment means; the event elements of the chemotherapy event are chemotherapy time and chemotherapy drugs; event elements of the operation event are operation time, symptoms, operation name and diseases; the event elements of the diagnosis event are diagnosis event, symptom, disease and operation name;
in addition, the admission event at least comprises one of four types of event elements, wherein the number of the symptom, the operation name and the disease type event element is one or more; the inspection event must contain inspection result event elements, and the inspection time can be vacant; the checking event must include checking result event elements, and the checking time can be vacant; the treatment event must contain treatment means event elements, the number of the treatment means is one or more, and the treatment time event elements can be vacant; the chemotherapy event necessarily comprises chemotherapy drug event elements, and the chemotherapy time and the number of the chemotherapy drugs are one or more; the operation event must contain operation name event elements, other types of elements can be vacant, and the number of the symptom, disease and operation name event elements is one or more; the diagnosis event must contain disease event elements, other event elements can be vacant, and the number of symptom event elements is one or more; the defined event elements are provided to the third step, the fourth step and the fifth step.
CN202110532819.2A 2021-05-17 2021-05-17 Event element detection method combining sequence labeling and pattern matching Active CN113177416B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110532819.2A CN113177416B (en) 2021-05-17 2021-05-17 Event element detection method combining sequence labeling and pattern matching

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110532819.2A CN113177416B (en) 2021-05-17 2021-05-17 Event element detection method combining sequence labeling and pattern matching

Publications (2)

Publication Number Publication Date
CN113177416A CN113177416A (en) 2021-07-27
CN113177416B true CN113177416B (en) 2022-06-07

Family

ID=76929058

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110532819.2A Active CN113177416B (en) 2021-05-17 2021-05-17 Event element detection method combining sequence labeling and pattern matching

Country Status (1)

Country Link
CN (1) CN113177416B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111382575A (en) * 2020-03-19 2020-07-07 电子科技大学 Event extraction method based on joint labeling and entity semantic information
CN112084381A (en) * 2020-09-11 2020-12-15 广东电网有限责任公司 Event extraction method, system, storage medium and equipment
CN112241457A (en) * 2020-09-22 2021-01-19 同济大学 Event detection method for event of affair knowledge graph fused with extension features

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109871538A (en) * 2019-02-18 2019-06-11 华南理工大学 A kind of Chinese electronic health record name entity recognition method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111382575A (en) * 2020-03-19 2020-07-07 电子科技大学 Event extraction method based on joint labeling and entity semantic information
CN112084381A (en) * 2020-09-11 2020-12-15 广东电网有限责任公司 Event extraction method, system, storage medium and equipment
CN112241457A (en) * 2020-09-22 2021-01-19 同济大学 Event detection method for event of affair knowledge graph fused with extension features

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Chinese medical event detection based on feature extension and document consistency;Chen Wang等;《IEEE》;20201022;全文 *
Relation Extraction Based on Fusion Dependency Parsing from Chinese EMRS;Pengjun Zhai等;《Hindawi》;20200608;全文 *
Sample imbalance disease classification model based on association rule feature selection;Chenxi Huang等;《elsevier》;20200311;全文 *

Also Published As

Publication number Publication date
CN113177416A (en) 2021-07-27

Similar Documents

Publication Publication Date Title
CN111708874B (en) Man-machine interaction question-answering method and system based on intelligent complex intention recognition
Savova et al. Use of natural language processing to extract clinical cancer phenotypes from electronic medical records
Seo et al. Bidirectional attention flow for machine comprehension
Xu et al. An end-to-end system to identify temporal relation in discharge summaries: 2012 i2b2 challenge
Tutubalina et al. Combination of deep recurrent neural networks and conditional random fields for extracting adverse drug reactions from user reviews
Holzinger et al. Combining HCI, natural language processing, and knowledge discovery-potential of IBM content analytics as an assistive technology in the biomedical field
Farmer et al. Reading span task performance, linguistic experience, and the processing of unexpected syntactic events
CN112148851A (en) Construction method of medicine knowledge question-answering system based on knowledge graph
Santander-Cruz et al. Semantic feature extraction using SBERT for dementia detection
CN112635071B (en) Diabetes knowledge graph construction method integrating Chinese and Western medicine knowledge
Kim et al. From descriptions to depictions: A dynamic sketch map drawing strategy
CN116805013A (en) Traditional Chinese medicine video retrieval model based on knowledge graph
Wang et al. EHR2Vec: representation learning of medical concepts from temporal patterns of clinical notes based on self-attention mechanism
Zhou et al. Chemical-induced disease relation extraction with dependency information and prior knowledge
Ke et al. Medical entity recognition and knowledge map relationship analysis of Chinese EMRs based on improved BiLSTM-CRF
Gaur et al. “Who can help me?”: Knowledge Infused Matching of Support Seekers and Support Providers during COVID-19 on Reddit
Tao et al. Geographic named entity recognition by employing natural language processing and an improved bert model
Sboev et al. Analysis of the full-size russian corpus of internet drug reviews with complex ner labeling using deep learning neural networks and language models
An et al. Toward better understanding older adults: a biography brief timeline extraction approach
CN113177416B (en) Event element detection method combining sequence labeling and pattern matching
Qiu et al. DocFlow: A visual analytics system for question-based document retrieval and categorization
Chen et al. Entity relation extraction from electronic medical records based on improved annotation rules and BiLSTM-CRF
Han et al. Chinese Q&A community medical entity recognition with character-level features and self-attention mechanism
Zhang et al. Syntax-informed self-attention network for span-based joint entity and relation extraction
Laleye et al. A French medical conversations corpus annotated for a virtual patient dialogue system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant