CN116341546A

CN116341546A - Medical natural language processing method based on pre-training model

Info

Publication number: CN116341546A
Application number: CN202310123797.3A
Authority: CN
Inventors: 张志强; 唐山荣
Original assignee: Daoyi Technology Medical Health Hainan Co ltd
Current assignee: Daoyi Technology Medical Health Hainan Co ltd
Priority date: 2023-02-15
Filing date: 2023-02-15
Publication date: 2023-06-27

Abstract

The application is applicable to the technical field of computers, and provides a medical natural language processing method, a medical natural language processing device and a computer readable storage medium, wherein the medical natural language processing method comprises the following steps: acquiring a medical text of semantic information to be mined; preprocessing the text by using a text data preprocessing technology, such as word segmentation, word list establishment and the like; loading the weight of the pre-training model, and performing fine adjustment according to the specified classification or search task on the basis of the weight; adjusting the model super-parameters according to the fine adjustment result; and carrying out semantic mining extraction on the input text through the fine-tuned model, and further executing tasks under classification, recommendation and the like. According to the scheme, the rich knowledge in the pre-training model is utilized to conduct parameter fine adjustment on downstream tasks such as medical text information extraction, medical term normalization, medical text classification, medical knowledge question-answering and the like, so that the accuracy of medical natural language downstream processing tasks such as classification, retrieval recommendation and the like is greatly improved.

Description

Medical natural language processing method based on pre-training model

Technical Field

The invention relates to medical natural language processing and deep learning technology, in particular to a medical natural language processing method based on a pre-training model, which can be used for intelligent medical scenes such as medical text information extraction (entity identification and relation extraction), medical term normalization, medical text classification and medical question-answering 4 general medical natural language processing tasks.

Background

Artificial intelligence is gradually changing medical practice with recent advances in biomedical language understanding. With the development of biomedical language understanding benchmarks, artificial intelligence applications have been widely used in the medical field. Biomedical natural language processing has prompted widespread applications such as biomedical text mining, utilizing text data in electronic health records. For example, biomedical natural language processing methods can be used to provide specialized medical advice for high risk groups through text and information in electronic medical records. In addition, natural language processing technology has great application in the medical fields of speech recognition, clinical files, clinical trial matching, computer aided coding and the like.

The medical field has a large number of natural language documents, such as medical textbooks, medical encyclopedias, clinical routes, inspection reports, etc., which contain a large amount of expertise and abundant medical information. Named entities in the medical field refer to extracting important medical entities such as diseases, symptoms and the like from medical texts, and the step is also the basis of various tasks such as medical relation extraction and the like. However, due to the limitation of the size of the current medical shared corpus, the progress of processing various tasks of medical text information is greatly hindered. How to judge different medical entity categories, how to define coverage between different entities, and how to classify intention of different medical sentences all bring great challenges to researchers.

Natural Language Processing (NLP) is one of the hot spots of research in the field of artificial intelligence, how to let a computer read human language is an important point of NLP technology, and along with the increase of research and development force, the NLP technology has already made breakthrough progress, and the figure of the NLP can be seen in numerous subdivision fields such as intelligent question-answering, machine translation, spam filtering, etc. NLP technology generally depends on an NLP model, BERT developed by Google research and development team, based on bi-directional coded representation of converters), is the most widely used NLP model in recent years and performs well. How to reasonably utilize the Bert model for understanding datasets in medical natural language with limited resources is a relatively challenging problem.

The medical natural language processing technology based on the pre-training model can be applied to medical text information extraction (entity identification and relation extraction), medical term normalization, medical text classification and medical question-answering 4 general medical natural language processing tasks to obtain good effects.

Disclosure of Invention

The invention aims to solve the technical problem of effectively modeling and representing medical natural language texts so as to finish knowledge mining tasks such as medical text information extraction, medical term normalization, medical text classification, medical question answering and the like with high quality. The invention provides a medical natural language processing method based on a pre-training model, which can solve the problem of few medical text training samples and efficiently realize medical natural language information extraction and knowledge mining.

The technical scheme of the invention is as follows: firstly, initializing a data acquisition module, a data preprocessing module, a pre-training word segmentation module, a pre-training model architecture module, a downstream task model fine adjustment module, a super-parameter setting module, a downstream task performance evaluation module and a log recording module. The data acquisition module is used for exchanging medical natural language data with the environment, such as medical texts of entities to be extracted; the data preprocessing module is used for carrying out characteristic preprocessing on the medical text so as to obtain better hidden vector representation; the pre-training word segmentation module interacts with the data acquisition module to segment the input medical natural language text and establish a word list so as to further vectorize the representation; the pre-training model architecture module is used for defining a network architecture of a pre-training model so as to read parameters from a pre-trained weight file and perform fine adjustment on a designated downstream task; the downstream task model fine adjustment module interacts with the pre-training model architecture module, initializes downstream task network parameters by using the trained weights of the pre-training model, and receives the data obtained by the pre-training word segmentation module to carry out the weight fine adjustment of the downstream task model, so that the method is used for downstream tasks such as medical text information extraction and medical term normalization. The super-parameter setting module is used for setting super-parameters, such as learning rate, loss function and the like, for the downstream task model fine-tuning module; the downstream task performance evaluation module is interacted with the downstream task fine adjustment module and is used for evaluating the performance of the corresponding downstream task; the log recording module is used for recording the change of the loss function in the fine adjustment process of the task, the change of the precision along with the training period and the like.

The invention comprises the following steps:

the first step, a pre-training model and a downstream task fine adjustment environment are built and initialized, wherein the environment is provided with an operating system Ubuntu18.04 and a deep learning framework Pytorch, and the environment is composed of a data acquisition module, a data preprocessing module, a pre-training word segmentation module, a pre-training model architecture module, a downstream task model fine adjustment module, a super-parameter setting module, a downstream task performance evaluation module and a log recording module. The data acquisition module is connected with the data sample through a database; the pre-training word segmentation module loads a jieba library and is used for word segmentation of the Chinese text; the data preprocessing module loads statistical word frequency, stop words and word list establishment algorithm; the super parameter setting module receives user-defined super parameter configuration input by a user.

And secondly, the data acquisition module acquires a medical natural language sample for training through a database and sends the medical natural language sample to the pre-training word segmentation module.

Thirdly, the pre-training word segmentation module receives a medical natural language sample from the database, and carries out word segmentation through the jieba database. Specifically, for example, if the currently received medical natural language text is "stomach is uncomfortable today", the text is segmented to obtain "i/today/stomach/uncomfortable". And then the pre-training word segmentation module inputs the segmented data to the data preprocessing module.

And fourthly, the data preprocessing module receives the segmented data from the pre-training word segmentation module, counts word frequencies of all words in the data, and selects N segmented words with word frequencies from big to small to establish a word list. For each word in the word segmentation list, the position corresponding to the word is encoded as 1 by searching the position of the word in the word list, and the other positions are 0. For example, the word list is [ "me", "yes", "today", "belly", "heart", "uncomfortable" ], the vector of "me" in the medical natural language text is denoted as [1,0,0,0,0,0,0,0], and the vector of "comfort" is denoted as [0,0,0,0,0,0,1,0]. Thus, a word vector for each word in the medical natural language text can be obtained. In the embodiment, the single-heat encoding is adopted to process the medical natural language text, so that the effect of expanding the characteristics is achieved to a certain extent, and the method is suitable for the pretrained model with more parameters such as Bert. After preprocessing is performed on the data sent by all the pre-training word segmentation modules, the data preprocessing module sends the preprocessed data to the downstream task model fine adjustment module.

And fifthly, after the downstream task model fine adjustment module receives the preprocessing data from the data preprocessing module. And interacting with the pre-training model architecture module to obtain a network frame and an initialization weight of the pre-training model, selecting corresponding fine-tuning models for different downstream tasks, and setting corresponding loss functions. Specifically, for a medical entity recognition task, a fine tuning model needs to add a layer of word classification head on the head of a pre-training model framework; for a medical relation extraction task, a fine tuning model needs to add a word-level classification layer at the head of a pre-training model framework for classifying subjects and objects; the same is true for the relationship determination of subject-object pairs; for medical clinical term normalization tasks, regression and ordering phases are applied to solve. Specifically, the fine tuning model needs to regress a preset appointed number of candidate standardized terms on the head of the pre-training model framework, and then a classification layer is added on the head of the pre-training model framework for similarity prediction; for clinical trial screening standard classification tasks and user query intention recognition tasks, the fine tuning model needs to add a sequence classification layer on the head of the pre-training model framework. And after the downstream task fine tuning module selects a specific fine tuning model, sending a super-parameter and loss function request to the super-parameter setting module.

And sixthly, after receiving the super-parameter and the loss function request from the downstream task fine adjustment module, the super-parameter setting module sends the loss function and the super-parameter corresponding to the corresponding task to the downstream task fine adjustment module. Specifically, for classification tasks, a cross entropy function is used; for ordering tasks, using a twin network loss function; after the loss function is determined, the hyper-parameter setting module sends hyper-parameter settings corresponding to the corresponding downstream tasks, such as the fine-tuning cycle number, the initial learning rate, the type of the optimizer and the like, to the downstream task fine-tuning module along with the loss function type.

Seventh, the downstream task fine tuning module starts the fine tuning stage of the downstream task after receiving the loss function type and the super parameter setting from the super parameter setting module. Specifically, the downstream task fine adjustment module obtains the program code of the attention vector through parallel one-time calculation, and is used for indicating to obtain a query matrix, a key matrix and a value matrix from the memory, and loading the query matrix, the key matrix and the value matrix into the GPU, so that the GPU obtains the first attention characteristic of the medical natural language text in a parallel processing mode based on the query matrix, the key matrix and the value matrix. It should be appreciated that the GPU calls the transducer encoder to calculate the self-attention feature based on the previously derived query matrix Q, key matrix K, and value matrix V in the BERT model, typically using the following formula:

e＝Score(Q，K)；

a＝softmax(e)；

Attention Values＝aV；

where Score (Q, K) represents the attention Score, d represents the dimension of the key vector, softmax represents normalizing the attention scores of all words, and attritionvalues represents the calculated attention features. Loading the query matrix, the key matrix and the value matrix into the graphic processor so that the graphic processor obtains a first attention characteristic of the natural language text in a parallel processing mode based on the query matrix, the key matrix and the value matrix, and the method comprises the following steps: loading the query matrix, the key matrix and the value matrix into the graphics processor, so that the graphics processor calculates the attention weight in parallel based on the query matrix, the key matrix and the value matrix; multiplying the attention weight by the matrix of values to obtain a first attention feature of the natural language text. The query matrix, the key matrix and the value matrix are input of each attention mechanism in the multi-head attention mechanisms, and the first attention features are obtained by splicing the first attention features by the output of each attention mechanism; and obtaining a second attention characteristic of the natural language text according to the attention splicing characteristic. And according to the attention splicing characteristics, obtaining second attention characteristics of the natural language text, performing linear mapping on the attention splicing characteristics to obtain the second attention characteristics, or performing smoothing processing on the spliced parts of the attention splicing characteristics to obtain the attention splicing characteristics after smoothing processing, and performing linear mapping on the attention splicing characteristics after smoothing processing to obtain the second attention characteristics. The second attention characteristic refers to the final output characteristic of the self-attention layer in the transducer encoder, which is usually used as the input of the feedforward neural network, and the second attention characteristic and the first attention characteristic are both matrices. Assuming that the multi-head attention mechanism employs 8 attention heads, each of which receives attention calculations as a first attention feature, a second attention feature, and an eighth attention feature, respectively. And after all the features are spliced, attention splicing features are obtained. The linear mapping may be multiplied by a preset additional weight matrix, which is obtained by joint training in the model. And finally, inputting the attention splicing characteristics into a network layer newly added by a downstream task, and carrying out fine adjustment on the weight through a set loss function. In the fine tuning process, the downstream task fine tuning module sends loss function values, precision information and the like in the fine tuning process to the log recording module to record state information in the running process.

And eighth step, the user selects the moment with highest model precision from the log record module as a final model to be used for the prediction of knowledge mining tasks such as medical text information extraction, medical term normalization, medical text classification, medical question and answer and the like.

The invention has the following beneficial effects:

1. the invention carries out fine adjustment on different medical natural language processing tasks based on the pre-training model, and can exert good robustness and generalization in the downstream medical natural language processing tasks with few training samples;

2. the invention can be suitable for any pre-training model and any medical text knowledge in the current natural language processing field to mine downstream tasks, has strong mobility and wider application range;

3. the invention can ensure the reasoning speed in operation and provide good use experience for users.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

FIG. 1 is a diagram of a first step of the pre-training model-based medical natural language processing software deployment of the present invention.

Fig. 2 is a flow chart of business logic of the present invention.

FIG. 3 is an example of attention head stitching for a pre-trained model query.

FIG. 4 is a schematic diagram of a transducer encoder.

Fig. 5 is a diagram of a module decomposition structure according to the present invention.

Detailed Description

In order to make the present application solution better understood by those skilled in the art, the following description will be made in detail and with reference to the accompanying drawings in the embodiments of the present application, it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments.

Firstly, constructing and initializing a pre-training model and a downstream task fine tuning environment, wherein the environment is provided with an operating system Ubuntu18.04 and a deep learning framework Pytorch1.7.0, and a pre-training model library transformers4.5.1. The environment further comprises a data acquisition module, a data preprocessing module, a pre-training word segmentation module, a pre-training model architecture module, a downstream task model fine adjustment module, a super-parameter setting module, a downstream task performance evaluation module and a log recording module. The data acquisition module is connected with the data sample through a database; the pre-training word segmentation module loads a jieba Chinese word segmentation library and is used for segmenting Chinese text to input a model to obtain vectorization representation; the data preprocessing module loads statistical word frequency, stop words and word list establishment algorithm; the super parameter setting module receives user-defined super parameter configuration input by a user.

And secondly, the data acquisition module acquires a medical natural language sample for testing, namely 'urinary system infection is easy to happen to a urinary retention person' and a medical entity identification task label through a database, and sends the medical natural language sample and the label to the pre-training word segmentation module.

Thirdly, the pre-training word segmentation module receives a medical natural language sample from the database, and carries out word segmentation through the jieba database. Specifically, the word segmentation results are "urinary retention/patient/susceptibility/secondary/urinary system/infection". And then the pre-training word segmentation module inputs the segmented data to the data preprocessing module.

And fourthly, the data preprocessing module receives the segmented data from the pre-training word segmentation module, counts word frequencies of all words in the data, and selects N segmented words with word frequencies from big to small to establish a word list. For each word in the word segmentation list, the position corresponding to the word is encoded as 1 by searching the position of the word in the word list, and the other positions are 0. Thus, a word vector for each word in the medical natural language text can be obtained. In the embodiment, the single-heat encoding is adopted to process the medical natural language text, so that the effect of expanding the characteristics is achieved to a certain extent, and the method is suitable for the pretrained model with more parameters such as Bert. After preprocessing is performed on the data sent by all the pre-training word segmentation modules, the data preprocessing module sends the preprocessed data to the downstream task model fine adjustment module.

And fifthly, after the downstream task model fine adjustment module receives the preprocessing data from the data preprocessing module. And interacting with the pre-training model architecture module to obtain a network frame and an initialization weight of the pre-training model, and selecting a corresponding fine tuning model and setting a corresponding loss function for a medical entity to identify a downstream task. Specifically, since this example is a task for identifying a medical entity, a word classification head is added to the head of the pre-training model architecture in the fine-tuning model; and after the downstream task fine tuning module selects a specific fine tuning model, sending a super-parameter and loss function request to the super-parameter setting module.

And sixthly, after receiving the super-parameter and the loss function request from the downstream task fine adjustment module, the super-parameter setting module sends the loss function and the super-parameter corresponding to the corresponding task to the downstream task fine adjustment module. Specifically, since this example is a classification task, a cross entropy function is used; after the loss function is determined, the super-parameter setting module sends super-parameter settings corresponding to the corresponding downstream tasks, such as the fine-tuning cycle number 10, the initial learning rate 0.01, the type Adam of the optimizer and the like, along with the loss function type cross entropy loss function, to the downstream task fine-tuning module.

e＝Score(Q，K)；

a＝softmax(e)；

Attention Values＝aV；

where Score (Q, K) represents the attention Score, d represents the dimension of the key vector, softmax represents normalizing the attention scores of all words, a represents the weight of the relationship between Q and K, i.e. what the weight of Q should be when modeling the current K, and attritionvalues represents the calculated attention feature. Loading the query matrix, the key matrix and the value matrix into a graphics processor, so that the graphics processor obtains a first attention characteristic of the natural language text in a parallel processing mode based on the query matrix, the key matrix and the value matrix,

comprising the following steps: loading the query matrix, the key matrix and the value matrix into the graphics processor, so that the graphics processor calculates the attention weight in parallel based on the query matrix, the key matrix and the value matrix; multiplying the attention weight by the matrix of values to obtain a first attention feature of the natural language text. The query matrix, the key matrix and the value matrix are input of each attention mechanism in the multi-head attention mechanisms, and the first attention features are obtained by splicing the first attention features by the output of each attention mechanism; and obtaining a second attention characteristic of the natural language text according to the attention splicing characteristic. And according to the attention splicing characteristics, obtaining second attention characteristics of the natural language text, performing linear mapping on the attention splicing characteristics to obtain the second attention characteristics, or performing smoothing processing on the spliced parts of the attention splicing characteristics to obtain the attention splicing characteristics after smoothing processing, and performing linear mapping on the attention splicing characteristics after smoothing processing to obtain the second attention characteristics. The second attention characteristic refers to the final output characteristic of the self-attention layer in the transducer encoder, which is usually used as the input of the feedforward neural network, and the second attention characteristic and the first attention characteristic are both matrices. Assuming that the multi-head attention mechanism adopts 8 attention heads, each attention head obtains attention calculation results as a first attention characteristic, a second attention characteristic, … and an eighth attention characteristic respectively. And after all the features are spliced, attention splicing features are obtained. The linear mapping may be multiplied by a preset additional weight matrix, which is obtained by joint training in the model. And finally, inputting the attention splicing characteristics into a network layer newly added by a downstream task, and carrying out fine adjustment on the weight through a set loss function. In the fine tuning process, the downstream task fine tuning module sends loss function values, precision information and the like in the fine tuning process to the log recording module to record state information in the running process.

Eighth, the user selects the moment with highest model precision from the log record module as a final model to be used for knowledge mining tasks such as medical text information extraction, medical term normalization, medical text classification, medical question and answer prediction and the like, and a result of entity identification, { "entity": urinary retention "," start_idx ":0," end_idx ":2," entity_type ": dis" }, { "entity": "urinary infection", "start_idx":7, "end_idx":11, "entity_type": dis "} is obtained. Namely, 2 entities are shared in the input medical natural language text, wherein the first entity is urine retention and is positioned at the 0-2 position of the sentence, and the type is diseases; the second entity is "urinary infection", located at positions 7-11 of the sentence, of the type disease.

Claims

1. A method of medical natural language processing, comprising:

acquiring a medical natural language text of a semantic to be mined, and acquiring a medical natural language processing downstream task to be executed, wherein the medical natural language processing downstream task comprises medical text information extraction, medical term normalization, medical text classification and medical question-answering;

determining the architecture of the downstream task fine tuning model according to the medical natural language processing downstream task category;

performing parameter fine adjustment on a downstream task model architecture by taking a pre-training model network architecture and parameters as initialization weights of downstream tasks;

performing preprocessing operations such as word segmentation, word list establishment and the like on the acquired medical natural language text to obtain an initialization vector representation corresponding to the medical natural language text;

and obtaining semantic extraction vector representation by using a trained fine tuning network for the initialized vector representation of the medical natural language text, and finally applying the semantic extraction vector representation to a designated downstream task.

2. The medical natural language processing method of claim 1, wherein the text is a natural language containing a medical entity;

the method for preprocessing the medical text comprises the following steps:

and segmenting the medical text, deactivating the words, establishing a word list according to word frequency, and encoding the text by using single-hot encoding to obtain an initial vectorization representation.

3. The method of claim 2, wherein network architecture required for downstream tasks is specified on a pre-trained model output layer concatenation, such as cross entropy loss functions are applied to the full connection layer to realize the classification requirements of word segmentation from word list.

4. The medical natural language processing method of claim 1, wherein existing weights of the pre-trained model are utilized as an initialization for fine-tuning downstream task weights to ensure model convergence to an optimal solution.

5. The medical natural language processing method according to claim 1, wherein the whole network is subjected to parameter tuning according to the loss function value during training, and a model with highest classification precision on the verification set is selected as a final release output model.

6. The method for processing the medical natural language according to claim 5, wherein the medical natural language text of the semantic to be mined is input into a fine-tuned model, the fine-tuned model obtains probability distribution of classification labels corresponding to the current text, and the label with the highest probability is taken as a final classification result.

7. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the method according to any one of claims 1 to 6.