CN117875319A

CN117875319A - Medical field labeling data acquisition method and device and electronic equipment

Info

Publication number: CN117875319A
Application number: CN202311864741.XA
Authority: CN
Inventors: 张隆基; 任梦星; 刘迎建; 彭菲
Original assignee: Hanwang Technology Co Ltd
Current assignee: Hanwang Technology Co Ltd
Priority date: 2023-12-29
Filing date: 2023-12-29
Publication date: 2024-04-12

Abstract

The application discloses a method, a device and electronic equipment for acquiring medical field labeling data, and belongs to the technical field of computers. The method comprises the following steps: acquiring a first data set composed of marked data and a second data set composed of data to be marked; training a preset entity identification model based on marked data in the first data set; marking and predicting each datum to be marked in the second data set by adopting the entity identification model obtained through training; calibrating the labeling prediction result based on a preset medical term database to obtain a calibration result; and converting the data to be marked in the second data set into marked data in the first data set based on the calibration result, and repeating the steps until the marking condition is met, and outputting the marked data in the first data set. The method fully utilizes professional medical knowledge to conduct supervision, achieves the expansion of a high-quality data set under the condition of limited labeling data, improves the efficiency of labeling data, and reduces the labeling cost.

Description

Medical field labeling data acquisition method and device and electronic equipment

Technical Field

The present invention relates to the field of computer technology, and in particular, to a method and apparatus for acquiring labeling data in the medical field, an electronic device, and a computer readable storage medium.

Background

In the training process of the medical knowledge question-answering model, medical entity recognition technology and medical field text classification are two important technical tasks. The medical entity identification task aims to identify important medical entities involved in the problem, such as lesion sites, diseases, medicines and the like, and provides key information for subsequent problem processing. The medical field text classification can classify the text into medical sub-fields such as a designated medical subject category or further subject knowledge category, etc., so as to ensure that the medical knowledge question-answering model can provide answers with high accuracy and specificity. And the entity recognition model for realizing the medical entity recognition task and the training of the text classification model for realizing the medical text classification task need the labeling data of the text in the medical field. The quality of the annotation data directly affects the accuracy of entity recognition and text classification.

However, due to the specificity of the medical field, the data labeling needs a high degree of expertise, and usually labeling samples needs processing of professionals in the medical field, so that the labeling data is difficult to obtain. In the prior art, the quantity of the marked data is rare, and the accuracy of the entity recognition model and the text classification model obtained based on the marked data training is directly reduced.

Therefore, in the prior art, an improvement is needed for a method for acquiring labeling data in the medical field.

Disclosure of Invention

The embodiment of the application provides a method, device and storage medium for acquiring medical field annotation data, which can efficiently and high-quality generate the medical field annotation data and provide training data support for a medical field entity identification model, a text classification model and the like.

In a first aspect, an embodiment of the present application provides a method for acquiring labeling data in a medical field, including:

acquiring a first data set composed of marked data and a second data set composed of data to be marked, wherein the marked data are as follows: medical field text labeled with entity class labels;

training a preset entity identification model based on the marked data in the first data set;

performing entity recognition on each piece of data to be marked in the second data set by adopting the entity recognition model obtained through training to obtain a marking result predicted value corresponding to each piece of data to be marked;

calibrating the predicted value of the labeling result based on a preset medical term database to obtain a calibration result corresponding to each piece of data to be labeled;

Based on the calibration result, performing an updating operation on the first data set and the second data set to convert the data to be marked in the second data set into marked data in the first data set;

and jumping to the step of executing the next round of the step of training a preset entity identification model based on the marked data in the first data set to the step of executing updating operation on the first data set and the second data set based on the calibration result until the ending marking condition is met, and outputting the marked data in the first data set.

In a second aspect, an embodiment of the present application provides an apparatus for acquiring labeling data in a medical field, including:

the system comprises a marked data acquisition module and a data to be marked acquisition module, wherein the marked data acquisition module is used for acquiring a first data set composed of marked data and a second data set composed of data to be marked, and the marked data are as follows: medical field text labeled with entity class labels;

the entity recognition model training module is used for training a preset entity recognition model based on the marked data in the first data set;

the pre-labeling module is used for carrying out entity recognition on each piece of data to be labeled in the second data set by adopting the entity recognition model obtained through training to obtain a labeling result predicted value corresponding to each piece of data to be labeled;

The marking result calibration module is used for carrying out calibration processing on the marking result predicted value based on a preset medical term database to obtain a calibration result corresponding to each piece of data to be marked;

the marked data generating module is used for executing updating operation on the first data set and the second data set based on the calibration result so as to convert the data to be marked in the second data set into marked data in the first data set;

and the data set updating module is used for jumping to execute the next round of calling the entity recognition model training module to the marked data generating module until the ending marking condition is met, and outputting marked data in the first data set.

In a third aspect, the embodiment of the application further discloses an electronic device, which includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor implements the method for acquiring the labeling data in the medical field according to the embodiment of the application when executing the computer program.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method for acquiring medical field labeling data disclosed in the embodiments of the present application.

According to the method for acquiring the labeling data in the medical field, which is disclosed by the embodiment of the application, a first data set consisting of labeled data and a second data set consisting of data to be labeled are acquired, wherein the labeled data are as follows: medical field text labeled with entity class labels; training a preset entity identification model based on the marked data in the first data set; performing entity recognition on each piece of data to be marked in the second data set by adopting the entity recognition model obtained through training to obtain a marking result predicted value corresponding to each piece of data to be marked; calibrating the predicted value of the labeling result based on a preset medical term database to obtain a calibration result corresponding to each piece of data to be labeled; based on the calibration result, performing an updating operation on the first data set and the second data set to convert the data to be marked in the second data set into marked data in the first data set; and jumping to the step of executing the next round of the step of training a preset entity identification model based on the marked data in the first data set to the step of executing updating operation on the first data set and the second data set based on the calibration result until the ending marking condition is met, and outputting the marked data in the first data set. The method fully utilizes professional medical knowledge to conduct supervision, achieves the expansion of a high-quality data set under the conditions of limited labeling data and limited resources, provides sufficient and accurate training data for entity identification and text classification in the medical field, does not need manual labeling, improves the efficiency of labeling data, and reduces the labeling cost.

The foregoing description is only an overview of the technical solutions of the present application, and may be implemented according to the content of the specification in order to make the technical means of the present application more clearly understood, and in order to make the above-mentioned and other objects, features and advantages of the present application more clearly understood, the following detailed description of the present application will be given.

Drawings

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

FIG. 1 is a flowchart of a method for acquiring medical field labeling data disclosed in an embodiment of the present application;

FIG. 2 is a schematic diagram of an entity recognition model used in an embodiment of the present application;

FIG. 3 is a schematic diagram of entity labeling results obtained by using an entity recognition model in the embodiment of the present application;

FIG. 4 is a schematic diagram of labeled data obtained after the entity labeling result in FIG. 3 is calibrated;

Fig. 5 is a schematic structural diagram of an acquiring device for medical field labeling data disclosed in an embodiment of the present application;

FIG. 6 schematically shows a block diagram of an electronic device for performing a method according to the present application; and

fig. 7 schematically shows a memory unit for holding or carrying program code implementing the method according to the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

Referring to fig. 1, a method for acquiring labeling data in a medical field disclosed in an embodiment of the present application includes: steps 110 to 160.

Step 110, a first data set consisting of marked data and a second data set consisting of data to be marked are obtained.

Wherein, the noted data is: medical field text labeled with entity class labels.

The method for acquiring the labeling data in the medical field disclosed by the embodiment of the application is used for labeling the text data in the medical field, for example, labeling the data to be labeled in the second data set.

When the application is implemented, the marked data in the medical field also needs to be acquired. Wherein the noted data may originate from a public dataset of the medical field. In the embodiment of the present application, the labeling manner of the labeled data is not limited.

Wherein the noted data comprises: the text sequence can be related questions such as question descriptions in the medical field question-answering application, and the labels are descriptions of the positions and categories of the entities in the text sequence. In particular, the callout may be used to describe which character in the text sequence is the medical entity, what category of medical subject entity, or what category of subject knowledge entity. Wherein the categories of medical subject entities include, but are not limited to: internal medicine, surgery, pediatrics, gynaecology and obstetrics, otorhinolaryngology, ophthalmology, stomatology, dentistry, and the like. The categories of subject knowledge entities include, but are not limited to, one or more of the following: disease name, disease symptoms, diagnostic methods, therapeutic methods, medical devices, drugs, organ tissues, biological nouns, and the like.

Alternatively, a BIO tagging mechanism may be employed to label named entities in the text sequence. The BIO label is a sequence marking method mainly applied to natural language processing tasks such as named entity recognition, and the BIO marking mechanism marks each element in the sequence as B, I or O, respectively represents the initial, middle and non-entity of the entity, and marks the entity types corresponding to labels B and I. In the embodiment of the application, taking the marking of data by using a BIO label mechanism as an example, a method for obtaining marked data is illustrated. Those skilled in the art should understand that in the specific implementation process, other labeling mechanisms may be used for data labeling, and the label mechanism used in the embodiments of the present application is not limited.

The data to be marked is obtained by preprocessing the collected data. For example, rule cleaning may be performed from data obtained from data sources such as wikipedia, han Wang Shucheng, and open source knowledge databases. Wherein the data cleansing process includes, but is not limited to, one or more of the following: removing HTML tags, deleting special characters, converting text into lower case form to ensure consistency, removing stop words to reduce noise, performing word drying or word shape reduction to simplify vocabulary form, removing numbers, processing abbreviations, removing redundant blanks, processing repeated characters, missing value processing, and the like, thereby obtaining a plurality of pieces of data. And the collected data are subjected to data cleaning to obtain the data to be marked, so that the quality of the marked data generated according to the data to be marked is improved.

Each piece of processed data to be marked can comprise a sentence or a plurality of sentences. For example, a piece of data to be annotated may be: "do I need to do an electrocardiographic examination for chest pain? ".

In general, the marked data resources in the medical field are scarce, and the data to be marked are massive.

Step 120, training a preset entity recognition model based on the marked data in the first data set.

And training a preset entity identification model based on the marked data in the first data set. In an embodiment of the present application, the entity recognition model may be an entity recognition model in the prior art. Preferably, the entity recognition model may be a model decision tree model including a plurality of different branch entity recognition models, as the entity recognition model. For example, a model decision tree model may be employed as shown in fig. 2 that includes three different branch entity recognition models, where the three different branch entity recognition models are: the BERT model cascades CRF layers (i.e., conditional random fields), the BiLSTM (Bidirect ional LSTM) model cascades CRF layers, ERNIE (Enhanced Representat ion through kNowledge IntEgrat ion) models. The BERT model cascading CRF layer is superior in semantic information extraction of input text, the BiLSTM model is superior in extraction of context information, the ERNIE model is a pre-training language model, and modeling is performed on various different types of knowledge (such as grammar knowledge, semantic knowledge, entity knowledge and the like), so that the performance of a natural language understanding task is improved.

The branch entity recognition models are in parallel relation, and the branch entity recognition models jointly make decisions to form a model tree containing a plurality of models so as to fully utilize the advantages of the branch entity recognition models and improve the accuracy and generalization of entity recognition.

During the training phase of the entity recognition model, each scoreThe entity identification model provides the entity labeling result of the entity identification model, and the labeling result predicted value of the entity labeling is obtained through weight learning. For example, for each piece of marked data, the vector representation of the text sequence in the marked data is respectively input to each branch entity recognition model, each branch entity recognition model respectively performs feature extraction and mapping processing on the input vector representation to respectively obtain corresponding output vectors O1, O2 and O3, and finally, the following formula is adopted: f=w ₁ *O1+W ₂ *O2+W ₃ * And O3, carrying out weighted fusion on the output vector of each branch entity recognition model included by the entity recognition model to obtain an entity recognition result of the entity recognition model aiming at the text sequence in the input marked data.

Wherein W is ₁ 、W ₂ And W is ₃ Respectively represent the weights of the corresponding branch entity recognition models. And then, calculating a model loss value according to the entity recognition result and the label of each labeled data, and optimizing parameters of each branch entity recognition model and weights of each branch entity recognition model based on the model loss value, so as to train and obtain the entity recognition model. In some alternative embodiments, if each branch entity recognition model selects a pre-trained model, the weights of each branch entity recognition model may be optimized only.

In the training process of the entity identification model, refer to the prior art, and are not described in detail in the embodiment of the present application.

In the embodiment of the application, the voting training is performed by using the branch entity identification model with smaller parameter scale, and the small-volume branch entity identification model can learn the feature extraction capability of the large-volume model by utilizing the voting advantage, so that the advantage of low entity identification model training cost is realized. And the entity recognition model is obtained by voting learning through a plurality of branch entity recognition models, and in the reasoning stage, the entity labeling result is obtained through voting learning, thereby being beneficial to the accuracy of entity recognition of a prompt function and the robustness of an input text sequence.

And 130, performing entity recognition on each piece of data to be marked in the second data set by using the training-obtained entity recognition model to obtain a marking result predicted value corresponding to each piece of data to be marked.

After training to obtain the entity recognition model, in the data labeling stage, for each piece of data to be labeled in the second data set, text vectors of text sequences in the data to be labeled can be obtained respectively, then the text vectors of the text sequences in each piece of data to be labeled are input into the entity recognition model respectively, and entity recognition is carried out on the input text vectors of each text sequence through the entity recognition model to obtain entity labeling results corresponding to each piece of data to be labeled.

Optionally, the entity recognition model is a model decision tree model including a plurality of different branch entity recognition models, the entity recognition model obtained by training is used for performing entity recognition on each piece of data to be marked in the second data set, so as to obtain a predicted value of a marking result corresponding to each piece of data to be marked, and the method includes: aiming at each piece of data to be marked in the second data set, respectively adopting each branch entity identification model to carry out entity identification on the data to be marked to obtain a coding vector output by each branch entity identification model; and carrying out weighted fusion on the coding vectors output by each branch entity identification model to obtain a marking result predicted value corresponding to each piece of data to be marked. Wherein the plurality of different branch entity recognition models employ different entity recognition algorithms.

For example, for a piece of data S to be annotated, it is first converted into a text vector X suitable for the input form requirements of the entity recognition model. Then, the text vector X is respectively input into a branch entity recognition model of the BERT model cascade CRF layer to obtain a first coding vector O1, the text vector X is input into a branch entity recognition model of the BiLSTM model cascade CRF layer to obtain a second coding vector O2, and the text vector X is input into an ERNIE branch entity recognition model to obtain a third coding vector O3. And then, respectively obtaining a first coding vector O1, a second coding vector O2 and a third coding vector O3 by the three branch entity recognition models, and carrying out weighted summation by using decision weights obtained by training to obtain a marking result predicted value corresponding to the data S to be marked. The predicted value of the labeling result may be a label of each element in the data S to be labeled (i.e., text sequence) labeled by using a BIO-label mechanism, and is used for the location and type of the entity identified in the data S to be labeled. For example, it may be identified whether each character in the data S to be annotated belongs to an entity, and to which medical subject category the entity belongs, or to which subject knowledge category the entity belongs.

And respectively carrying out entity identification on each piece of data to be marked in the second data set to obtain a marking result predicted value corresponding to each piece of data to be marked.

In the embodiment of the application, entity recognition models with different classification levels can be respectively trained based on the labeled data, for example, entity recognition models for performing medical subject entity recognition are trained based on the labeled information of the medical subject entities in the labeled data, and entity recognition models for performing subject knowledge entity recognition are trained based on the labeled information of the subject knowledge entities in the labeled data. Correspondingly, when labeling the data to be labeled, two entity recognition models can be respectively adopted to recognize the medical subject entity and the subject knowledge entity in one piece of data to be labeled, so that the method comprises the following steps of: labeling result predicted values of medical subject entities and/or subject knowledge entities.

And 140, performing calibration processing on the predicted value of the labeling result based on a preset medical term database to obtain a calibration result corresponding to each piece of data to be labeled.

Because the marked data samples are rare, the inference result of the marked data to be marked may have errors based on the entity recognition model obtained by training a small amount of marked data, and then, the predicted value of the marked result of each piece of data to be marked is calibrated based on a preset medical term database so as to improve the accuracy of the marked result. Wherein, the preset medical term database can be a public preset medical term database.

Optionally, performing calibration processing on the predicted value of the labeling result based on a preset medical term database to obtain a calibration result corresponding to each piece of data to be labeled, including: obtaining a predicted entity included in the data to be marked according to the marking result predicted value; performing similarity matching on the prediction entity and each keyword in a preset medical term database to obtain a matching result; responding to the matching result to indicate that the keyword meeting the preset similarity condition with the prediction entity is matched, and performing calibration processing on the predicted value of the marking result and/or the data to be marked according to the keyword obtained by matching to obtain a calibration result which corresponds to the data to be marked and indicates successful calibration, wherein the calibration result comprises the following components: marked data generated according to the keyword and/or the marking result predicted value and the data to be marked; and responding to the matching result to indicate that the keyword which meets the preset similarity condition with the prediction entity is not matched, and obtaining a calibration result which corresponds to the data to be marked and indicates calibration failure.

Is there a chest pain, and is there a need to make an electrocardiographic examination? The predicted value of the labeling result obtained by performing entity recognition through the foregoing steps is shown in fig. 3 as an example, and the identified medical entity includes: "chest pain" and "electrocardiogram", the predicted values X1 of the subject knowledge categories corresponding to these two medical entities represent disease symptoms, and X2 represents a diagnostic method. Firstly, according to the predicted value of the labeling result, a predicted entity ' chest pain ' and an electrocardiogram ' which are included in the data to be labeled are obtained.

And then, respectively carrying out similarity matching on the two prediction entities and each keyword in a preset medical term database to obtain a matching result. For example, if the similarity between the predicted entity "chest pain" and the medical term "chest pain" in the predetermined medical term database is greater than a predetermined similarity threshold, and the similarity between the predicted entity "electrocardiogram" and the medical term "electrocardiogram" in the predetermined medical term database is greater than a predetermined similarity condition, a matching result of the keywords in the predetermined medical term database that meet the predetermined similarity condition with the predicted entity may be obtained. If the predicted entity 'chest pain' or 'electrocardiogram' is not matched with the keywords with the similarity larger than the preset similarity threshold value in the preset medical term database, a matching result of the keywords which are not matched with the predicted entity and meet the preset similarity condition in the preset medical term database can be obtained.

In some alternative embodiments, each keyword in the preset medical term database is pre-associated with a medical subject category and a subject knowledge category. For example, the keyword "chest pain" is associated with a medical subject category that is internal, and associated subject knowledge category that is symptoms of a disease. Thus, after determining the keywords in the preset medical term database that the predicted entities match, the medical subject category and subject knowledge category corresponding to each keyword may be further determined.

In some alternative embodiments, the preset medical term database is made up of a plurality of sub-databases or data tables. The preset medical term database comprises the following steps: subject name keywords for different medical subject categories, and subject knowledge keywords for each subject knowledge category under each of the medical subject categories. The subject name keywords of the medical subject are stored in one database or data table, and the subject knowledge keywords of each medical subject category are stored in a different database or data table, respectively. Therefore, in the keyword matching process, a step-by-step matching mode can be adopted, so that the keyword matching efficiency is improved.

Optionally, the matching the similarity between the prediction entity and each keyword in the preset medical term database to obtain a matching result includes: performing similar matching on the prediction entity and a subject name keyword in a preset medical term database; responding to the matched subject name keywords, respectively taking subject knowledge keywords of the medical subject category corresponding to the matched subject name keywords as verification data sources, respectively splicing and combining the prediction entities matched to the subject name keywords with the prediction entities not matched to any subject name keyword to obtain splicing entities, and then carrying out similarity matching on the splicing entities and the corresponding verification data sources to obtain a matching result; and responding to the keyword which is not matched with the subject name, and obtaining a matching result which indicates that the keyword is not matched with the subject name.

Wherein the subject name keywords include, but are not limited to, one or more of the following: "medical", "surgical", "pediatric", "gynecological", "otorhinolaryngological", "ophthalmic", "stomatology", "dentistry" and the like. Accordingly, the medical subject category corresponding to the subject name keyword includes, but is not limited to, one or more of the following: internal medicine, surgery, pediatrics, gynaecology and obstetrics, otorhinolaryngology, ophthalmology, stomatology, dentistry, and the like. The subject knowledge keywords include, but are not limited to, one or more of the following: disease names (e.g., "heart disease", "conjunctivitis", etc.), disease symptoms (e.g., "chest pain", "dry eye itching", etc.), treatment methods (e.g., "oral azithromycin", etc.), medical devices (e.g., "blood pressure apparatus", etc.), drugs (e.g., "azithromycin", etc.), organ tissues (e.g., "skin", etc.), biological nouns (e.g., "bacterial culture", etc.), etc.

What kind of eye drops are better to use is taken as "eyes are dry and itchy? For example, the entity identification is performed through the foregoing steps, and the labeling result obtained may be: the medical subject category of the predicted entity "eye" is "ophthalmic", and the subject knowledge category of the predicted entity "eye drop" is "therapeutic drug". Firstly, each prediction entity is subjected to similar matching with a preset medical term database storing the subject name keywords of the medical subjects, so that the matching of the prediction entity 'eyes' and the medical subject name keywords 'ophthalmology' in the preset medical term database can be obtained. Then, the vector representation of the predicted entity ' ophthalmology ' and the spliced vector of the eye drop ' are spliced into an entity vector, and are matched with subject knowledge keywords in a preset medical term database of subject knowledge of the medical subject ' ophthalmology ', so that a matching result is obtained. If keywords are matched in a preset medical term database of subject knowledge of the medical subject category of 'ophthalmology', the predicted entity is considered to be successfully matched with the keywords in the preset medical term database, at the moment, the successfully matched keywords are recorded, and the medical subjects and subject knowledge categories to which the successfully matched keywords belong are used for subsequent use. If a keyword is matched in a preset medical term database of subject knowledge of the medical subject category "ophthalmic", the predictive entity is considered to fail to match the keyword in the preset medical term database.

If the data to be marked is a text of a non-medical field, performing entity identification through the steps, and obtaining a matching result that the data to be marked is not matched with the medical field keywords when the obtained predicted entity is subjected to similar matching with a preset medical term database storing the medical field name keywords.

Specifically, for example, a method of matching a vector may be based on faiss (Facebook AI Simi lari ty Search, a similarity vector search method), where medical subject name keywords are matched from a preset medical subject term database, and if the medical subject name keywords are matched, a medical subject category corresponding to the matched medical subject name keywords is used as a matching subject category. And then splicing the vector of the successfully marked predicted entity with other predicted entities to be judged, matching the subject knowledge keywords of the matched subject category in a preset medical term database again, and if the keywords are successfully matched, considering that the predicted entities in the data to be marked are successfully marked. Otherwise, the prediction entity is considered to be marked with errors.

The above is two possible embodiments for matching the predicted entity with the keywords in the preset medical term database, and when the embodiment is implemented, other methods may be used to match the predicted entity with the keywords in the preset medical term database according to labeling requirements, which is not listed in the embodiment of the present application.

And after the keyword matching result is obtained, further calibrating the labeling result predicted value according to the matching result to obtain a calibration result.

As described above, the preset medical term database includes: subject name keywords for different medical subject categories, and subject knowledge keywords for each subject knowledge category under each of the medical subject categories. Optionally, the calibrating the predicted value of the labeling result and/or the data to be labeled according to the keyword obtained by matching to obtain a calibration result corresponding to the data to be labeled and indicating successful calibration includes: substep S1 to substep S5.

And S1, determining the prediction entity to be calibrated as a target prediction entity according to the consistency of the medical subject category corresponding to the matched keywords and the subject knowledge category.

Firstly, for a piece of data to be marked, prediction entities successfully matched may all correspond to the same medical subject, or may correspond to different medical subjects. If the successfully matched prediction entities in one piece of data to be marked correspond to different medical subjects respectively, selecting a strategy according to a preset strategy to select the prediction entity belonging to one medical subject for marking, and generating marked data.

Optionally, the determining the prediction entity to be calibrated according to the consistency of the medical subject category corresponding to the matched keyword and the subject knowledge category, as the target prediction entity, includes: under the condition that the matched keywords correspond to different medical subject categories and/or correspond to different subject knowledge categories, selecting the prediction entity matched with the target keywords according to a preset strategy as a target prediction entity, wherein the target keywords are: subject name keywords and subject knowledge keywords corresponding to a medical subject category. For example, one piece of data to be marked includes 3 prediction entities, where the prediction entity 1 is matched with a name keyword of the medical subject category a, the prediction entity 2 is matched with a name keyword of the medical subject category B, and if the prediction entity 3 is matched with a subject knowledge keyword of a subject knowledge category of the medical subject category a, the prediction entity 1 and the prediction entity 3 corresponding to the medical subject category a in the data to be marked can be considered as target prediction entities for performing subsequent marking, so that the obtained marking result is suitable for training a text classification model in the medical field.

And S2, carrying out entity calibration on the target prediction entity in the data to be marked by adopting the keyword obtained by matching according to the difference between the target prediction entity and the keyword obtained by matching, so as to obtain calibrated data to be marked.

Next, in order to improve the quality of the acquired marked data, the text in the data to be marked is calibrated further based on the technical terms in the preset medical term database.

Optionally, the entity calibration is performed on the target prediction entity in the data to be marked by using the keyword obtained by matching according to the dissimilarity between the target prediction entity and the keyword obtained by matching, so as to obtain calibrated data to be marked, including: under the condition that the target prediction entity is different from the keyword obtained by matching, replacing the corresponding target prediction entity by the keyword obtained by matching to obtain the data to be marked by calibration; and under the condition that the target prediction entity is the same as the keyword obtained by matching, reserving the target prediction entity to obtain the data to be marked by calibration.

Taking the keyword "chest pain" of the prediction entity "chest pain" matched to the preset medical term database and the keyword "electrocardiogram" of the prediction entity "electrocardiogram" matched to the preset medical term database as examples, the target prediction entity determined by the steps is as follows: the chest pain and the electrocardiogram are characterized in that the target prediction entity chest pain is different from the keyword chest pain matched in the preset medical term database, so that the keyword chest pain matched in the preset medical term database is adopted to replace the target prediction entity chest pain in the data to be marked, and the target prediction entity electrocardiogram is identical to the keyword electrocardiogram matched in the preset medical term database, so that replacement is not needed, and the obtained calibration data to be marked after keyword replacement is as follows: "do I have a bit chest pain, do it require an electrocardiographic examination? "

And S3, calibrating the marking result predicted value according to the position information of the target predicted entity in the data to be marked, and obtaining a marking result calibrated value.

And then, marking the calibration data to be marked. For example, the predicted value of the labeling result of the data to be labeled can be adjusted, and only labeling information of the starting position and the middle position in the calibrated data to be labeled, which is obtained after the keyword replacement of the target predicted entity, is labeled, and for the text of the non-target predicted entity, the labeling information is modified into non-entity labeling information. Is there a chest pain in me with the data to be annotated? "for example, the obtained calibration data to be marked is: "do I have a bit chest pain, do it require an electrocardiographic examination? Correspondingly, the calibration values of the labeling results for calibrating the data to be labeled are shown in fig. 4.

And S4, generating marked data according to the data to be marked and the marking result calibration value.

And then, taking the data to be marked as a text sequence, taking the marking result calibration value of the data to be marked as marking, and generating a piece of marked data.

And S5, obtaining a calibration result which corresponds to the data to be marked and indicates successful calibration according to the generated marked data.

In some alternative embodiments, entity text calibration may not be performed on the predicted entity in the data to be marked, the predicted value of the marking result may be calibrated only according to the position of the target predicted entity, then the original data to be marked is used as a text sequence, and the calibrated predicted value of the marking result is used as a marking, so as to generate a piece of marked data.

Thus, the labeling processing of one piece of data to be labeled is completed. The generated marked data are as follows: data of entity location and category are noted.

In some optional embodiments, after generating the marked data according to the data to be marked and the marking result calibration value, the method further includes: setting a medical subject category label for marked data generated according to the data to be marked according to the medical subject category corresponding to the keyword matched by the target prediction entity; or setting a subject knowledge category label for the marked data generated according to the data to be marked according to the subject knowledge category corresponding to the keyword matched by the target prediction entity. After the marked data of the marked entity information is generated according to the data to be marked, further, in order to generate training data of the text classification model in the medical field, a medical subject category label and a subject knowledge category label can be further set on the generated marked data, so that the generated marked data can be provided with the entity information label and the classification label.

And step 150, based on the calibration result, performing an updating operation on the first data set and the second data set to convert the data to be marked in the second data set into marked data in the first data set.

Optionally, the performing an update operation on the first data set and the second data set based on the calibration result includes: supplementing marked data in the calibration result to the first data set aiming at the calibration result indicating successful calibration, and removing the data to be marked corresponding to the calibration result from the second data set so as to execute updating operation on the first data set and the second data set.

After the processing of the steps, the obtained calibration result comprises two cases of successful calibration and failure calibration. For the case of successful calibration, the annotated data is further generated. And adding the generated marked data into the first data set to supplement the first data set, and simultaneously removing the data to be marked which generates the marked data from the second data set to finish the conversion of the data to be marked.

According to the steps, entity identification and calibration processing are respectively carried out on each piece of data to be marked in the second data set, so that one round of conversion from the data to be marked to marked data is realized. And for the data to be marked which fails in calibration, the data to be marked is kept in the second data set, and the data to be marked is converted for the next round.

Step 160, jumping to execute the next round of steps from the step of training a preset entity recognition model based on the marked data in the first data set to the step of executing updating operation on the first data set and the second data set based on the calibration result until the ending marking condition is met, and outputting the marked data in the first data set.

Wherein the end labeling conditions include, but are not limited to, one or more of the following: the second data set is empty, the second data set is not updated in the process of executing updating operation on the first data set and the second data set according to the verification result at least two continuous rounds, and the number of marked data in the first data set reaches a preset number.

After a round of conversion from data to be annotated to annotated data, if the number of annotated data in the first dataset does not meet the requirement of training a classification model or training an entity recognition model, the next round of conversion from data to be annotated to annotated data can be performed iteratively. Taking the marked sample in the updated first data set as training data, executing step 120, retraining the entity identification model, and executing steps 130 to 150 on the updated second data set to perform the next round of conversion from the data to be marked to the marked data.

Furthermore, the predicted value of the labeling result is calibrated by adopting a preset medical term database, and the quality of the labeling data is improved by converting the entity in the sample data into the medical entity with the standard of the medical field, so that the understanding and processing level of a text classification model obtained based on the generated labeled data to the medical text can be improved, and the accuracy and reliability of text classification are improved. On the other hand, by introducing a voting mechanism, the training data is continuously updated, and the robustness and generalization capability of the entity recognition model are improved, so that the accuracy of the predicted value of the labeling result is continuously improved, and the quality of the generated entity labeling data is improved.

Referring to fig. 5, the embodiment of the application further discloses a device for acquiring labeling data in a medical field, where the device includes:

the marked data and to-be-marked data obtaining module 510 is configured to obtain a first data set composed of marked data and a second data set composed of to-be-marked data, where the marked data is: medical field text labeled with entity class labels;

the entity recognition model training module 520 is configured to train a preset entity recognition model based on the labeled data in the first dataset;

The pre-labeling module 530 is configured to perform entity recognition on each piece of data to be labeled in the second data set by using the trained entity recognition model, so as to obtain a predicted value of a labeling result corresponding to each piece of data to be labeled;

the labeling result calibration module 540 is configured to calibrate the predicted value of the labeling result based on a preset medical term database, so as to obtain a calibration result corresponding to each piece of data to be labeled;

a marked data generating module 550, configured to perform an update operation on the first data set and the second data set based on the calibration result, so as to convert the data to be marked in the second data set into marked data in the first data set;

the data set updating module 560 is configured to jump to execute the next round of calling the entity recognition model training module 520 to the marked data generating module 550 until the end marking condition is satisfied, and output marked data in the first data set.

Optionally, the entity recognition model is a model decision tree model including a plurality of different branch entity recognition models, and the pre-labeling module 530 is further configured to:

aiming at each piece of data to be marked in the second data set, respectively adopting each branch entity identification model to carry out entity identification on the data to be marked to obtain a coding vector output by each branch entity identification model;

And carrying out weighted fusion on the coding vectors output by each branch entity identification model to obtain a marking result predicted value corresponding to each piece of data to be marked.

Optionally, the labeling result calibration module 540 is further configured to:

obtaining a predicted entity included in the data to be marked according to the marking result predicted value;

performing similarity matching on the prediction entity and each keyword in a preset medical term database to obtain a matching result;

responding to the matching result to indicate that the keyword meeting the preset similarity condition with the prediction entity is matched, and performing calibration processing on the predicted value of the marking result and/or the data to be marked according to the keyword obtained by matching to obtain a calibration result which corresponds to the data to be marked and indicates successful calibration, wherein the calibration result comprises the following components: marked data generated according to the keyword and/or the marking result predicted value and the data to be marked;

and responding to the matching result to indicate that the keyword which meets the preset similarity condition with the prediction entity is not matched, and obtaining a calibration result which corresponds to the data to be marked and indicates calibration failure.

Optionally, the preset medical term database includes: performing calibration processing on the predicted value of the labeling result and/or the data to be labeled according to the keywords obtained by matching to obtain a calibration result which corresponds to the data to be labeled and indicates successful calibration, wherein the calibration result comprises:

determining the prediction entity to be calibrated according to the consistency of the medical subject category corresponding to the matched keyword and the subject knowledge category, and taking the prediction entity as a target prediction entity;

according to the difference between the target prediction entity and the keyword obtained by matching, performing entity calibration on the target prediction entity in the data to be marked by adopting the keyword obtained by matching to obtain calibrated data to be marked;

calibrating the marking result predicted value according to the position information of the target predicted entity in the data to be marked, and obtaining a marking result calibrated value;

generating marked data according to the data to be marked and the marking result calibration value;

and obtaining a calibration result which corresponds to the data to be marked and indicates successful calibration according to the generated marked data.

Optionally, the determining the prediction entity to be calibrated according to the consistency of the medical subject category corresponding to the matched keyword and the subject knowledge category, as the target prediction entity, includes:

under the condition that the matched keywords correspond to different medical subject categories and/or correspond to different subject knowledge categories, selecting the prediction entity matched with the target keywords according to a preset strategy as a target prediction entity, wherein the target keywords are: subject name keywords and subject knowledge keywords corresponding to a medical subject category.

Optionally, the entity calibration is performed on the target prediction entity in the data to be marked by using the keyword obtained by matching according to the dissimilarity between the target prediction entity and the keyword obtained by matching, so as to obtain calibrated data to be marked, including:

under the condition that the target prediction entity is different from the keyword obtained by matching, replacing the corresponding target prediction entity by the keyword obtained by matching to obtain the data to be marked by calibration;

and under the condition that the target prediction entity is the same as the keyword obtained by matching, reserving the target prediction entity to obtain the data to be marked by calibration.

Optionally, after generating the marked data according to the data to be marked and the marking result calibration value, the method further includes:

setting a medical subject category label for marked data generated according to the data to be marked according to the medical subject category corresponding to the keyword matched by the target prediction entity; or,

and setting a subject knowledge category label for the marked data generated according to the data to be marked according to the subject knowledge category corresponding to the keyword matched by the target prediction entity.

Optionally, the noted data generating module 550 is further configured to:

supplementing marked data in the calibration result to the first data set aiming at the calibration result indicating successful calibration, and removing the data to be marked corresponding to the calibration result from the second data set so as to execute updating operation on the first data set and the second data set.

The device for acquiring the medical field labeling data disclosed in the embodiment of the present application is used for implementing the method for acquiring the medical field labeling data described in the embodiment of the present application, and specific implementation manners of each module of the device are not repeated, and reference may be made to specific implementation manners of corresponding steps in the embodiment of the method.

The device for acquiring the labeling data in the medical field disclosed by the embodiment of the application acquires a first data set composed of labeled data and a second data set composed of data to be labeled, wherein the labeled data are as follows: medical field text labeled with entity class labels; training a preset entity identification model based on the marked data in the first data set; performing entity recognition on each piece of data to be marked in the second data set by adopting the entity recognition model obtained through training to obtain a marking result predicted value corresponding to each piece of data to be marked; calibrating the predicted value of the labeling result based on a preset medical term database to obtain a calibration result corresponding to each piece of data to be labeled; based on the calibration result, performing an updating operation on the first data set and the second data set to convert the data to be marked in the second data set into marked data in the first data set; and jumping to the step of executing the next round of the step of training a preset entity identification model based on the marked data in the first data set to the step of executing updating operation on the first data set and the second data set based on the calibration result until the ending marking condition is met, and outputting the marked data in the first data set. The device fully utilizes professional medical knowledge to conduct supervision, achieves the expansion of a high-quality data set under the conditions of limited labeling data and limited resources, provides sufficient and accurate training data for entity identification and text classification in the medical field, does not need manual labeling, improves the efficiency of labeling data, and reduces the labeling cost.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other. For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.

The above description is made in detail on a method and a device for acquiring labeling data in a medical field provided by the present application, and specific examples are applied herein to illustrate principles and embodiments of the present application, where the above description of the examples is only for helping to understand the method and a core idea of the present application; meanwhile, as those skilled in the art will have modifications in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

Various component embodiments of the present application may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that some or all of the functions of some or all of the components in an electronic device according to embodiments of the present application may be implemented in practice using a microprocessor or Digital Signal Processor (DSP). The present application may also be embodied as an apparatus or device program (e.g., computer program and computer program product) for performing a portion or all of the methods described herein. Such a program embodying the present application may be stored on a computer readable medium, or may have the form of one or more signals. Such signals may be downloaded from an internet website, provided on a carrier signal, or provided in any other form.

For example, fig. 6 shows an electronic device in which a method according to the present application may be implemented. The electronic device may be a PC, a mobile terminal, a personal digital assistant, a tablet computer, etc. The electronic device conventionally comprises a processor 610 and a memory 620 and a program code 630 stored on said memory 620 and executable on the processor 610, said processor 610 implementing the method described in the above embodiments when said program code 630 is executed. The memory 620 may be a computer program product or a computer readable medium. The memory 620 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. The memory 620 has a storage space 6201 for program code 630 of a computer program for performing any of the method steps described above. For example, the memory space 6201 for the program code 630 may include individual computer programs for implementing the various steps in the above methods, respectively. The program code 630 is computer readable code. These computer programs may be read from or written to one or more computer program products. These computer program products comprise a program code carrier such as a hard disk, a Compact Disc (CD), a memory card or a floppy disk. The computer program comprises computer readable code which, when run on an electronic device, causes the electronic device to perform a method according to the above-described embodiments.

The embodiment of the application also discloses a computer readable storage medium, on which a computer program is stored, which when being executed by a processor, realizes the steps of the method for acquiring the labeling data in the medical field according to the embodiment of the application.

Such a computer program product may be a computer readable storage medium, which may have memory segments, memory spaces, etc. arranged similarly to the memory 620 in the electronic device shown in fig. 6. The program code may be stored in the computer readable storage medium, for example, in a suitable form. The computer readable storage medium is typically a portable or fixed storage unit as described with reference to fig. 7. In general, the memory unit comprises computer readable code 630', which computer readable code 630' is code that is read by a processor, which code, when executed by the processor, implements the steps of the method described above.

Reference herein to "one embodiment," "an embodiment," or "one or more embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Furthermore, it is noted that the word examples "in one embodiment" herein do not necessarily all refer to the same embodiment.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the present application may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The application may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. The method for acquiring the labeling data in the medical field is characterized by comprising the following steps:

2. The method according to claim 1, wherein the entity recognition model is a model decision tree model including a plurality of different branch entity recognition models, the training-obtained entity recognition model is used for performing entity recognition on each piece of data to be marked in the second data set to obtain a predicted value of a marking result corresponding to each piece of data to be marked, and the method includes:

3. The method according to claim 1, wherein the calibrating the predicted value of the labeling result based on the preset medical term database to obtain the calibration result corresponding to each piece of data to be labeled comprises:

4. A method according to claim 3, wherein the predetermined medical term database comprises: performing calibration processing on the predicted value of the labeling result and/or the data to be labeled according to the keywords obtained by matching to obtain a calibration result which corresponds to the data to be labeled and indicates successful calibration, wherein the calibration result comprises:

5. The method according to claim 4, wherein the determining the prediction entity to be calibrated as the target prediction entity according to the consistency of the medical subject category and the subject knowledge category corresponding to the matched keyword includes:

6. The method of claim 4, wherein the performing entity calibration on the target prediction entity in the data to be annotated by using the keyword obtained by matching according to the dissimilarity between the target prediction entity and the keyword obtained by matching, to obtain calibrated data to be annotated, includes:

7. The method of claim 4, wherein generating labeled data from the calibration data to be labeled and the labeling result calibration value further comprises:

8. The method of claim 1, wherein the performing an update operation on the first data set and the second data set based on the calibration result comprises:

9. An apparatus for acquiring labeling data in a medical field, the apparatus comprising:

10. An electronic device comprising a memory, a processor and program code stored on the memory and executable on the processor, wherein the processor implements the method of acquiring medical field labeling data according to any one of claims 1 to 8 when the program code is executed by the processor.

11. A computer-readable storage medium having stored thereon program code, which when executed by a processor, implements the steps of the method for acquiring medical field labeling data according to any one of claims 1 to 8.