CN117637175A

CN117637175A - Large model medical training data generation method and system based on multistage semantics

Info

Publication number: CN117637175A
Application number: CN202311589976.2A
Authority: CN
Inventors: 黄飞跃; 徐宇辰; 朱立峰; 陈胤儒
Original assignee: Ruinjin Hospital Affiliated to Shanghai Jiaotong University School of Medicine Co Ltd
Current assignee: Ruinjin Hospital Affiliated to Shanghai Jiaotong University School of Medicine Co Ltd
Priority date: 2023-11-27
Filing date: 2023-11-27
Publication date: 2024-03-01

Abstract

The invention discloses a large model medical training data generation method and system based on multi-level semantics. The invention adopts the NLP model to generate the corpus as the training data of the large model, and can solve the defects of the corpus and the professional dictionary in the medical field. The definition and application of multi-level semantics are adopted, so that the problem of insufficient research on basic attributes, characteristics and rules of medical data can be solved. The alignment training of one-to-many reinforcement learning and the fine adjustment of the loss function are adopted, so that the problems of weak generalization, inflexibility of constraint and low self-adaptability of other text generation technologies can be solved. The application of the techniques can make the large model training in the medical field more effective, overcome the problems existing in the prior art and promote the capability of medical text processing and analysis.

Description

Large model medical training data generation method and system based on multistage semantics

Technical Field

The invention relates to the field of computers, in particular to a large model medical training data generation method and system based on multistage semantics.

Background

The electronic entry and informatization construction of medical data brings a large amount of medical text data, including various clinical information such as medical records, medical orders, nursing documents, examination findings, examination conclusions and the like of patients. These data are important for clinical diagnosis, treatment procedures and analysis of results. However, due to the complexity and unstructured nature of medical text, how to process, analyze, and mine such data becomes an important issue in the construction of medical informatization.

Medical text structuring is the process of converting unstructured natural language information into a data structure that a computer can "understand" and handle conveniently. Through text information extraction and conversion (or encoding), the medical text data can be converted into structured data for applications such as information retrieval, discovery of similar medical records, patient information management, and depth analysis of the medical data.

Artificial intelligence and machine learning techniques have been developed in various industries along with the explosive development of deep learning techniques, and are no exception in the medical field. In a large model of machine learning, the target data and the predicted data to be analyzed must be structured data that can be identified by a computer, and most of data information generated in daily medical diagnosis reports is unstructured data that cannot be identified by a computer, such as text diagnosis reports, so that the data information cannot be directly provided to an intelligent machine learning algorithm and the model for operation.

While machine learning and deep learning models with good results require a large amount of structured data support, none of the current application cases that are successful in the artificial intelligence field due to deep learning is a field that does not possess massive training data or can be automatically simulated by modeling engineers to generate massive data.

The medical field generates massive information every day, but most of the information is unstructured data such as pathological radiation, images and diagnostic texts. The image data can be directly forced to enter the model in the form of pixels as structured data, but a text diagnosis report written by a doctor cannot be directly entered into a machine learning model such as deep learning.

Prior art solutions typically use manual or semi-manual means to process medical text information. Most doctors and practitioners in the relevant industry need to process unstructured historical medical data (retrospective data) by manually reading medical text and making standardized entries. In a general approach, the relevant personnel designs themselves or is programmed by a third party technology provider to implement an electronically structured form (eCRF), then manually scans the text data piece by piece and manually enters the structured form after the relevant information is found. Few techniques can assist in manual reading by keyword matching and standard-formulated semi-automated information extraction, i.e., matching related words or expressions from text, providing an auxiliary tool to facilitate manual reading of information. However, these current solutions rely mainly on manual operations with expert knowledge, which is time-consuming and costly. The whole process lacks an intelligent training generation tool, even if data are manually input, the efficiency is low due to the labor intensity and the repeated and boring content, and therefore enough training samples cannot be obtained efficiently to support the large model for training. The deficiencies of these techniques result in the failure of the accuracy and recall of the entire information extraction to achieve the desired results. Therefore, these forms cannot fully utilize massive diagnosis and treatment data to guide and optimize the model, and create a great bottleneck for the development of medical artificial intelligence and big data.

In addition, the traditional medical field lacks a large-scale real corpus and a rich medical dictionary, and particularly, for specific terms, diseases, treatment methods and the like of medical professions, a small amount of manually marked data is difficult to meet the actual application requirements. The lack of in-depth research on basic attributes, features and rules of medical texts leads to poor information extraction and analysis effects on various medical texts, including medical course records, medical reports, prescription information and the like.

In summary, the problems in the medical field are that the corpus and the professional dictionary in the medical field are insufficient, the basic attribute, the characteristic and the rule of the medical data are not studied sufficiently, the semantic understanding capability of the medical data is limited, the generation efficiency is low, the generalization is weak, the constraint is inflexible, and the self-adaptability is low.

Disclosure of Invention

In order to solve the problems, the invention aims to provide a large model medical training data generation method and system based on multi-level semantics.

To this end, one aspect of the present invention provides a large model medical training data generating method based on multi-level semantics, comprising the steps of:

step 1, generating multi-level semantics, comprising the following steps:

step 1.1, capturing keywords, and identifying and labeling entities;

step 1.2, sorting and counting the original description, and extracting and modeling the relationship;

step 1.3, weighting to generate one-to-many multi-level semantics, and carrying out semantic representation and generation;

step 2, SFT fine tuning training effect, comprising the following steps:

step 2.1, pre-training a basic model;

step 2.2, defining specific medical tasks clearly and preparing corresponding labeling data

And 2.3, performing SFT fine tuning training process. .

Step 3, alignment training of one-to-many reinforcement learning, comprising the following steps:

step 3.1, defining input and output, and ensuring that the corresponding relation between the input and the output is correctly established;

step 3.2, constructing a Seq2Seq model suitable for processing a one-to-many sequence generating task, wherein the model is composed of an encoder and a decoder, the encoder is used for converting an input medical phrase into a vector with fixed length, the decoder generates a next word according to the vector and a word generated last, and a plurality of medical phrases or sentences are generated step by step;

and 3.3, enhancing the learning alignment training in one-to-many mode.

Further, step S1.1 automatically identifies the entities in the medical text by using named entity identification techniques, and labels them as corresponding categories.

Further, step S1.2 identifies and extracts relationships between different entities by analyzing context information, dependencies and grammar structures.

Further, step S2.3 is divided into feature extraction and fine tuning training, wherein:

in the feature extraction stage, feature extraction is carried out on the annotation data by using a pre-trained basic model, and the extracted features are used for fine adjustment training in the next stage;

and in the fine tuning training stage, the extracted features are input into a classifier, a sequence marker or other models aiming at specific medical tasks for training.

Further, the fine tuning training stage uses the labeling data to perform supervised training, and optimizes model parameters through back propagation and gradient updating.

Further, the network model in step 3.2 is a Seq2Seq model.

Further, the alignment training in step S3.3 includes a reverse alignment and a forward alignment, wherein:

the back alignment is to calculate a loss function using the known output sequence and the model's prediction results and back propagate updated model parameters;

forward alignment is the use of a model trained in reverse alignment to make forward predictions on the input medical phrase.

Further, the forward alignment has an output probability distribution at each position of the generation, and the forward alignment loss function is calculated according to the distribution and the known output sequence, and model parameters are updated.

Another aspect of the present invention provides a large model medical training data generating system based on multi-level semantics, comprising a multi-level semantics module, an SFT fine tuning module, an alignment training module, wherein:

the multi-level semantic module is used for generating multi-level semantics and comprises the following components:

the text identification component is used for capturing keywords and carrying out entity identification and labeling;

the text ordering and counting component is used for ordering and counting the original text description and extracting and modeling the relation;

the multi-level semantic weighting component is used for weighting to generate one-to-many multi-level semantics and carrying out semantic representation and generation;

the SFT fine tuning module is used for SFT fine tuning training effect and comprises the following components:

the pre-training component is used for pre-training the basic model;

the labeling component is used for definitely defining specific medical tasks and preparing corresponding labeling data;

the fine tuning component is used for SFT fine tuning training;

an alignment training module for performing one-to-many reinforcement learning alignment training, comprising the following components:

the input/output assembly is used for defining input and output and ensuring that the corresponding relation between the input and the output is correctly established;

a model construction component for constructing a network model suitable for processing a one-to-many sequence generation task, the model consisting of an encoder for converting an input medical phrase into a fixed length vector and a decoder for generating a next word from the vector and a last generated word and gradually generating a plurality of medical phrases or sentences;

an alignment component for one-to-many reinforcement learning alignment training.

Due to the adoption of the technical scheme, compared with the prior art, the invention has the beneficial effects that:

1) The corpus generated by the NLP model is used as training data of a large model, so that the defects of a corpus in the medical field and a professional dictionary can be overcome.

2) Due to the adoption of definition and application of multi-level semantics, the method can solve the problem of insufficient research on basic attributes, characteristics and rules of medical data.

3) Because the alignment training of one-to-many reinforcement learning and the fine adjustment of the loss function are adopted, the problems of weak generalization, inflexibility and low self-adaptability of other text generation technologies can be solved.

In conclusion, the invention can solve the problems of insufficient corpus and professional dictionary in the medical field, insufficient basic attribute, characteristic and rule research of medical data, limited semantic understanding capability of the medical data, low generating efficiency, weak generalization, inflexibility of constraint and low adaptability by adopting the combination of the technologies 1-3. The application of the techniques can make the large model training in the medical field more effective, overcome the problems existing in the prior art and promote the capability of medical text processing and analysis.

Drawings

FIG. 1 is a flow chart of a method for generating large model medical training data based on multi-level semantics.

FIG. 2 is a block diagram of a large model medical training data generation system based on multi-level semantics.

FIG. 3 is a diagram of three-level semantic examples of medical terms.

FIG. 4 is a reinforcement learning alignment training flow chart.

Fig. 5 is a flow chart for fine-tuning a loss function.

Detailed Description

Advantages of the invention are further illustrated in the following description, taken in conjunction with the accompanying drawings and detailed description. It is to be understood by persons skilled in the art that the following detailed description is illustrative and not restrictive, and that this invention is not limited to the details given herein.

The method can automatically generate massive structured medical text structured data, and improves the processing efficiency and accuracy. By establishing a multi-level semantic generation mechanism, semantic information of medical text, including content related to patient basic information, clinical diagnosis, treatment procedures, results, and the like, can be better understood and generated. Meanwhile, a large amount of training data is generated, so that abundant samples can be provided for the model to improve the generalization capability and adaptability of the model.

The structured big model training data generation mechanism based on the multi-level semantics has wide application prospect in the field of medical informatization. The method can improve the efficiency and accuracy of medical data processing, promote the accuracy and individuation of clinical decisions, and provide more reliable support for medical scientific research and clinical practice. Meanwhile, the method also provides an important technical foundation for the development of the medical artificial intelligence system.

As shown in fig. 1, the method comprises the following steps:

step 1: generating multi-level semantics;

step 2: SFT fine tuning training effect;

step 3: alignment training for one-to-many reinforcement learning;

wherein, step 1 comprises the following steps:

step 1.1: keyword grabbing: first, the steps of entity identification and labeling are performed. This is to identify and label entities with a specific meaning from the medical text. The entities may include diseases, drugs, methods of treatment, and the like. By using Named Entity Recognition (NER) techniques, entities in the medical text can be automatically identified and labeled as corresponding categories. For example, identifying "hypertension" as a disease entity;

step 1.2: ordering and counting the original text description: next, relation extraction and modeling is performed. This is to determine the relationships between entities in the medical text and build them into a relationship model. By analyzing the context information, dependencies, grammar structures, etc., relationships between different entities can be identified and extracted. For example, the "treatment" relationship and related entities are extracted from the sentence "drug A may be used to treat disease B".

Step 1.3: weighting generates one-to-many multi-level semantics: semantic representation and generation are performed. This is to translate the multi-level semantic information of the medical text into a richer and more complex form that simulates real training data. The semantic representation may take the form of graph structures, logical forms, or other forms to represent entities, relationships, semantic roles, and the like. The generation stage can generate corresponding semantic expressions according to the keywords. For example, entities, relationships, and actions in medical text are translated into stronger, more complex, and true semantic expressions.

Wherein, step 2 includes the following steps:

step 2.1: an applicable base model is selected and pre-trained. The basic model can be a general language model such as BERT, GPT and the like, and can also be a pre-training model specially aiming at the medical field. By pre-training on the open source medical text dataset, the underlying model can learn semantics and knowledge of the medical field;

step 2.2: specific medical tasks are explicitly defined and corresponding labeling data is prepared. The annotation data may be a manually annotated medical text sample, which is required to contain the input text and corresponding labels or answers. For example, for a medication advice task, the annotation data may be a medication and its corresponding detailed medication advice.

Step 2.3: and performing SFT fine tuning training process. The process is generally divided into two phases: feature extraction and fine tuning training. In the feature extraction stage, feature extraction is carried out on the annotation data by using a pre-trained basic model, and the extracted features are used for fine tuning training in the next stage. In the fine tuning training phase, the extracted features are input into a classifier, a sequence labeler or other models for specific medical tasks for training. This stage uses the annotation data for supervised training to optimize model parameters through back propagation and gradient updates.

Wherein, step 3 includes the following steps:

step 3.1: defining input and output: a set of medical phrases or sentences as input and a corresponding plurality of reference answers as output. Ensuring that the correspondence between input and output is properly established.

Step 3.2: model construction: next a Seq2Seq model is constructed, consisting of an encoder and a decoder, suitable for handling one-to-many sequence generation tasks. The encoder converts the input medical phrase into a fixed length vector representation and the decoder generates the next word from the vector and the last generated word to thereby step-wise generate a plurality of medical phrases or sentences.

Step 3.2: one-to-many reinforcement learning alignment training, including reverse alignment and forward alignment, wherein:

reverse alignment: the known output sequence and model predictions are used to calculate the loss function and back-propagate updated model parameters. This step is similar to conventional supervised learning training.

Forward alignment: forward predictions are made on the input medical phrase using the reverse alignment trained model. Each position is generated with an output probability distribution, a forward alignment loss function is calculated from the distribution and a known output sequence, and model parameters are updated. This step helps the model learn the correspondence between the inputs and outputs.

Medical record data with an input medical record and a plurality of output labels are collected, and data cleaning and preprocessing are performed.

Feature extraction and representation: and extracting and representing the characteristics of the medical record by using natural language processing technology. The text may be converted to a vector representation using a bag of words model, a Word embedding model (e.g., word2Vec, BERT), etc.

Forward alignment label generation: and converting the label sequence into a corresponding prediction result by a manual or semi-automatic mode aiming at each output task. For example, for disease diagnosis tasks, a disease tag sequence may be converted into disease diagnosis results.

Forward alignment model training: using the data generated by the forward alignment labels, a model is trained to learn the predictive relationships from the input medical record to the plurality of output tasks. Training may be performed using a sequence labeling model (e.g., a conditional random field) or a sequence generation model (e.g., a recurrent neural network).

FIG. 2 is a schematic diagram of a large model medical training data generation system based on multi-level semantics, comprising a multi-level semantics module, an SFT fine tuning module, and an alignment training module, wherein:

the pre-training component is used for pre-training the basic model;

the fine tuning component is used for SFT fine tuning training;

a model construction component for constructing a Seq2Seq model adapted to handle a one-to-many sequence generation task, the model consisting of an encoder for converting an input medical phrase into a fixed length vector and a decoder for generating a next word from the vector and a last generated word and for generating a plurality of medical phrases or sentences step by step;

The present invention will be described in further detail with reference to the accompanying drawings and specific examples.

Firstly, keyword grabbing is carried out, and entities with specific meanings are identified and marked. The entities may include diseases, drugs, methods of treatment, and the like. By using Named Entity Recognition (NER) techniques, entities in the medical text can be automatically identified and labeled as corresponding categories. For example, "hypertension" is identified as a disease entity, and then the textual descriptions are ranked and counted. The multi-level semantic construction is performed, for example, as shown in fig. 3, the upper level of the keyword "no abnormality is found" may be "no abnormality in chest", "no abnormality in abdomen", or the like. The next level is the more complex and detailed "no abnormalities in the thoracic bones and thoracic wall soft tissues are shown". The generation stage can thus generate corresponding semantic expressions from the keywords. For example, entities, relationships, and actions in medical text are translated into stronger, more complex, and true semantic expressions.

As shown in fig. 4, the reinforcement learning model is named RL model; randomly extracting a first-level semantic keyword from a structured multi-level semantic data set, and generating a first-level expansion statement by using an RL model; expanding the medical keywords and the corresponding three-level sentences for scoring; updating parameters of the RL model by using scores based on a reinforcement learning algorithm; and repeating the process for a plurality of times to obtain an RL model, wherein the obtained final RL model is a trained medical training data generation model based on the structured text.

For example, 1000 patient record samples are collected, each sample containing a description of the patient's symptoms and the physician's diagnostic results, medication advice, and prognosis. The data are cleaned and preprocessed, and the quality and consistency of the data are ensured.

Designing a RL model based on deep reinforcement learning: for example using long short term memory network (LSTM) as the core of the model. This model receives medical records as input and generates three-level expanded statements. For example, we have the following medical records:

the primary symptoms at the time of patient diagnosis are headache, with nausea and blurred vision. The doctor performs an outpatient examination to diagnose migraine, and recommends taking ibuprofen to relieve pain, and simultaneously observes the change of the illness state.

From this record we can do three-level expansion statement generation:

first-order expansion statement: the primary symptoms at the time of patient diagnosis are headache, with nausea and blurred vision.

Second-level expansion statement: the doctor suspects that migraine is likely after examination.

Three-stage unfolding sentence: it is recommended to take ibuprofen to relieve pain while observing the change of the condition.

By generating these three-level expansion statements, we can describe the patient's symptoms, the physician's suspicion diagnosis, and treatment advice in more detail. Such expanded statements may help doctors to better understand the condition and make accurate decisions and decisions.

And then extracting and expanding the keywords. A first level semantic keyword, such as "headache", is randomly extracted from the structured multi-level semantic dataset. A third expansion statement is generated using the RL model, for example "headache may be caused by migraine, and the administration of ibuprofen is recommended to relieve pain, and prognosis evaluation requires further observation of the change of condition.

And then score generation and updating are carried out. For the three-level expanded sentences generated, we ask the doctor expert to score each sentence of the output task. For example, for the disease diagnosis statement "headache may be caused by migraine", a doctor expert gives a score of 8/10. We use these scores to update the parameters of the RL model to improve the model's generative ability.

In model iterative training, we repeat the above steps multiple times, gradually optimizing the performance of the RL model by continuously generating three-level expansion sentences, scoring and updating parameters of the RL model. After each iteration, we evaluate the model generation results and compare them with the doctor expert evaluation results to ensure the model accuracy and reliability.

Then multitasking training and prediction is performed. On the trained RL model we do multitasking training and prediction. For new medical records, we use the RL model to generate corresponding three-level expanded sentences and predict according to the scores. For example, for a new patient's medical record, the three-level expansion statement we generate may be "headache may be caused by migraine" suggesting pain relief by ibuprofen administration, and prognosis requires further observation of the change in condition. From these statements, we can predict the outcome of disease diagnosis, medication advice and prognosis.

The known output sequence and model predictions are used to calculate the loss function and back-propagate updated model parameters. As shown in FIG. 5, the descriptive statement 1-x is generated from one key phrase in one-to-many, and a loss function is calculated from the result to measure the difference between the model predicted result and the target result. Depending on different requirements, such as "want to appear more accurate medical terms", "want to organize language more fluency", different loss functions may be chosen, such as cross entropy loss, mean square error loss, etc. The predicted outcome of the model is compared to the target outcome using the loss function and a loss value is calculated. The loss value represents the degree of difference between the predicted outcome of the model and the target outcome. The loss value is then back-propagated into the parameters of the model using a back-propagation algorithm, thereby updating the parameters of the model to reduce the loss value. Through multiple iterations of this process, the model will gradually adjust itself to better meet constraints and improve performance. Through fine tuning of a pair of n multiple loss functions, more flexible constraint conditions can be met, and the generated data generalization is better.

It should be noted that the embodiments of the present invention are preferred and not limited in any way, and any person skilled in the art may make use of the above-disclosed technical content to change or modify the same into equivalent effective embodiments without departing from the technical scope of the present invention, and any modification or equivalent change and modification of the above-described embodiments according to the technical substance of the present invention still falls within the scope of the technical scope of the present invention.

Claims

1. The large model medical training data generation method based on the multi-level semantics is characterized by comprising the following steps of:

step 1, generating multi-level semantics, comprising the following steps:

step 1.1, capturing keywords, and identifying and labeling entities;

step 2, SFT fine tuning training effect, comprising the following steps:

step 2.1, pre-training a basic model;

Step 2.3, performing SFT fine tuning training process;

step 3.2, constructing a network model suitable for processing one-to-many sequence generating tasks, wherein the model is composed of an encoder and a decoder, the encoder is used for converting an input medical phrase into a vector with fixed length, the decoder generates a next word according to the vector and a word generated last, and a plurality of medical phrases or sentences are generated step by step;

and 3.3, enhancing the learning alignment training in one-to-many mode.

2. The method for generating large model medical training data based on multi-level semantics according to claim 1, wherein step S1.1 automatically identifies entities in medical text and labels them as corresponding categories by using named entity recognition techniques.

3. The multi-level semantic based large model medical training data generation method according to claim 1, wherein step S1.2 identifies and extracts relationships between different entities by analyzing context information, dependencies and grammar structures.

4. The method for generating large model medical training data based on multi-level semantics according to claim 1, wherein step S2.3 is divided into feature extraction and fine tuning training, wherein:

5. The multi-level semantic based large model medical training data generation method of claim 4, wherein the fine tuning training phase uses annotation data for supervised training to update optimization model parameters via back propagation and gradients.

6. The multi-level semantic based large model medical training data generation method according to claim 1, wherein the network model in step 3.2 is a Seq2Seq model.

7. The multi-level semantic based large model medical training data generation method according to claim 1, wherein the alignment training in step S3.3 comprises a reverse alignment and a forward alignment, wherein:

8. The multi-level semantic based large model medical training data generation method according to claim 7, wherein the forward alignment has an output probability distribution at each generated position, and the model parameters are updated by calculating a forward alignment loss function based on the distribution and a known output sequence.

9. The large model medical training data generation system based on the multi-level semantics is characterized by comprising a multi-level semantics module, an SFT fine tuning module and an alignment training module, wherein:

the pre-training component is used for pre-training the basic model;

the fine tuning component is used for SFT fine tuning training;