CN115565198A

CN115565198A - Medical text entity extraction method, system and equipment based on integrated column type convolution

Info

Publication number: CN115565198A
Application number: CN202211311244.2A
Authority: CN
Inventors: 区汝轩; 项颖; 张文慧; 王小林
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2022-10-25
Filing date: 2022-10-25
Publication date: 2023-01-03

Abstract

The invention discloses a medical text entity extraction method based on integrated column type convolution, which relates to the technical field of text information processing, and comprises the following steps: performing word segmentation processing on a pre-acquired medical text to obtain a word segmentation text; determining a word segmentation index corresponding to the word segmentation text according to a pre-established index word bank; taking the word segmentation index as the input of a pre-training model to obtain a high-dimensional text vector; extracting the features of the high-dimensional text vector to obtain a high-dimensional text feature vector; taking the high-dimensional text feature vector as an input of a preset first prediction model to obtain a first prediction result; correcting the first prediction result to obtain a second prediction result; and converting the second prediction result into a named entity based on a preset conversion function, so that the technical problem of poor recognition/extraction effect when the medical text named entity is recognized in the prior art is solved.

Description

Medical text entity extraction method, system and equipment based on integrated column type convolution

Technical Field

The invention relates to the technical field of text information processing, in particular to a medical text entity extraction method based on integrated column type convolution.

Background

The natural language literature in the chinese medical field, such as medical textbooks, medical encyclopedias, clinical cases, medical journals, admission records, examination reports, etc., contains a large amount of medical expertise and medical terminology. The entity recognition technology is combined with the medical professional field, and medical entities such as diseases and symptoms are found from the unstructured medical texts, so that the efficiency and the quality of clinical scientific research can be remarkably improved, and the system can serve downstream subtasks.

In the prior art, a classification (identification) model taking deep learning as a framework is usually adopted for named entity identification, wherein the network model taking a cavity convolution network as a feature extraction layer is widely used, and the model can obtain text local information and can realize parallel optimization.

Disclosure of Invention

The invention provides a medical text entity extraction method based on integrated column type convolution, which is used for solving the technical problem of poor recognition/extraction effect when medical text named entity recognition is carried out in the prior art.

A medical text entity extraction method based on integrated column type convolution comprises the following steps:

s1, performing word segmentation processing on a pre-acquired medical text to obtain a word segmentation text; determining a word segmentation index corresponding to the word segmentation text according to a pre-established index word bank; taking the word segmentation index as the input of a pre-training model to obtain a high-dimensional text vector;

s2, extracting the features of the high-dimensional text vector to obtain a high-dimensional text feature vector;

s3, taking the high-dimensional text feature vector as an input of a preset first prediction model to obtain a first prediction result;

s4, correcting the first prediction result to obtain a second prediction result;

s5, converting the second prediction result into a named entity based on a preset conversion function;

wherein, step S2 specifically includes:

s21, inputting the high-dimensional text vectors into a preset integrated convolution network in parallel to obtain a preset number of first text features; the preset integrated convolution network is composed of a preset number of one-dimensional row-type convolution layers;

and S22, performing feature splicing on the preset number of first text features to obtain a high-dimensional text feature vector.

Preferably, in step S21, the preset number of one-dimensional columnar convolution layers have different convolution kernels with different expansion coefficients.

Preferably, in step S3, the first prediction model is a feedforward neural network output model; the first prediction result is a set of score values of a high-dimensional text feature vector.

Preferably, step S4 specifically includes:

correcting the score value set by adopting a conditional random field model to obtain a second prediction result; and the second prediction result is a label index corresponding to the score value with the maximum numerical value in the corrected score value set.

Preferably, the preset conversion function is a mapping relationship between the tag index and the named entity.

Preferably, the step S21 specifically includes:

performing normalization processing on the high-dimensional text vectors by adopting layer normalization, and inputting the high-dimensional text vectors subjected to the normalization processing in parallel into a preset integrated convolution network to obtain a preset number of first text features; the preset integrated convolution network is composed of a preset number of column-type convolution layers.

Preferably, the step S22 specifically includes:

residual error connection is carried out on the high-dimensional text vectors after the normalization processing and the first text features of the preset number, and the first text features of the preset number after the processing are obtained;

and performing feature splicing on the preset number of processed first text features to obtain a high-dimensional text feature vector.

Preferably, the pre-training model is an ERNIE model.

An integrated columnar convolution-based medical text entity extraction system, comprising:

the preprocessing module is used for performing word segmentation processing on a pre-acquired medical text to obtain a word segmentation text; determining a word segmentation index corresponding to the word segmentation text according to a pre-established index word bank; taking the word segmentation index as the input of a pre-training model to obtain a high-dimensional text vector;

the feature extraction module is used for extracting features of the high-dimensional text vector to obtain a high-dimensional text feature vector;

the first prediction module is used for taking the high-dimensional text feature vector as the input of a preset first prediction model to obtain a first prediction result;

the second prediction module is used for correcting the first prediction result to obtain a second prediction result;

the entity conversion module is used for converting the second prediction result into a named entity based on a preset conversion function;

wherein the feature extraction module comprises:

the first feature extraction module is used for inputting the high-dimensional text vectors into a preset integrated convolution network in parallel to obtain a preset number of first text features; the preset integrated convolutional network is composed of a preset number of column-type convolutional layers;

and the second feature extraction module is used for performing feature splicing on the preset number of first text features to obtain high-dimensional text feature vectors.

An integrated column convolution-based medical text entity extraction device comprises a processor and a memory;

the memory is used for storing a computer program which, when executed by the processor, implements the aforementioned identification method.

According to the technical scheme, the invention has the following advantages: the invention provides a medical text entity extraction method based on integrated column type convolution, which is used for carrying out word segmentation processing on a pre-acquired medical text to obtain a word segmentation text; determining a word segmentation index corresponding to the word segmentation text according to a pre-established index word bank; taking the word segmentation index as the input of a pre-training model to obtain a high-dimensional text vector, then inputting the high-dimensional text vector into a preset integrated convolution network in parallel to obtain a corresponding number of first text features, and performing feature splicing on the corresponding number of first text features to obtain a high-dimensional text feature vector carrying text labels; further, the high-dimensional text feature vector is used as input of a preset first prediction model to obtain a first prediction result, and the first prediction result is corrected to obtain a second prediction result; and finally, converting the second prediction result into a named entity based on a preset conversion function. According to the identification method, the integrated convolutional network is composed of the one-dimensional array convolutional layers with the preset number, text feature extraction is carried out by adopting the one-dimensional array convolutional layers, the relative relation between high-dimensional text vectors can be kept, meanwhile, the attention ranges of different convolutional layers to the text sequence are adjusted by setting different expansion coefficients to different convolutional layers, the attention degree to the entity features with less occupation in the text sequence is improved, and the technical problem that in the prior art, when medical text named entity identification is carried out, certain number of entity features with less occupation are easily ignored, and therefore the classification (identification/extraction) effect is poor is solved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a flowchart of a method for extracting a medical text entity according to an embodiment of the present invention;

FIG. 2 is a flow chart of extracting feature vectors of high-dimensional text according to an embodiment of the present invention;

FIG. 3 is a diagram of an initial entity recognition model architecture provided by an embodiment of the present invention;

fig. 4 is a structural diagram of a medical text entity extraction system according to an embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a medical text entity extraction method based on integrated column convolution, which solves the technical problem that certain entity characteristics with small quantity are easy to ignore and the recognition effect is poor when the medical text named entity recognition is carried out based on a hole convolution network in the prior art.

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the embodiments described below are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

An embodiment of the present application provides a method for extracting a medical text entity, please refer to fig. 1, in embodiment 1, the method includes:

s1, performing word segmentation processing on a pre-acquired medical text to obtain a word segmentation text; determining a word segmentation index corresponding to the word segmentation text according to a pre-established index word bank; taking the word segmentation index as the input of a pre-training model to obtain a high-dimensional text vector; wherein, the pre-training model is an ERNIE model.

When training, the ERNIE model adopts two modes of phrase masking and entity masking to train besides a word level mask prediction training mode. And preprocessing the pre-acquired text vector by using an ERNIE model, wherein the output text vector also comprises semantic information of word combination besides the context information of the text. Preprocessing the pre-acquired text vectors using the ERNIE model may provide better extraction of entity information.

And S2, extracting the features of the high-dimensional text vector to obtain the high-dimensional text feature vector.

Wherein, step S2 specifically includes:

s21, inputting the high-dimensional text vectors into a preset integrated convolution network in parallel to obtain a preset number of first text features; the preset integrated convolution network is composed of a preset number of column-type convolution layers.

In step S21, before the high-dimensional text vectors are input to the preset integrated convolutional network in parallel, the high-dimensional text vectors are normalized, and then the normalized high-dimensional text vectors are used as the input of the preset integrated convolutional network. It can be understood that, in the training stage of the classification (identification) model, the input of the preset integrated convolution network is standardized, so that the training speed of the model can be effectively improved, and meanwhile, the distribution type of the standardized data is the same as that of the original data, and only the mean value and the variance are different, so that the accuracy of the model is not influenced. The benefit of parallel input and the benefit of column-wise convolution.

In step S22, in order to improve the recognition rate of the classification (recognition) model for the entities that constitute the simple entity, the output of each column-wise convolutional layer is processed by residual splicing, and the normalized high-dimensional text vector is residual-connected to the preset number of first text features to obtain the preset number of processed first text features, so that the negative impact on the recognition effect caused by the increase in scale of the classification model can be reduced.

And further, performing feature splicing on the preset number of the processed first text features to obtain a high-dimensional text feature vector. It can be understood that the preset integrated network is formed by a preset number of column-type convolution layers, and in the step S21, the normalized high-dimensional text vectors are input in parallel to the preset integrated convolution network, that is, the normalized high-dimensional text vectors are input into the preset number of column-type convolution layers, and each convolution layer corresponds to one output. Further, performing feature splicing on the preset number of processed first text features to obtain a high-dimensional text feature vector, namely performing feature splicing on the output corresponding to each convolution layer to obtain a final high-dimensional text feature vector. The above process can be seen in detail in fig. 2.

And S3, taking the high-dimensional text feature vector as an input of a preset first prediction model to obtain a first prediction result.

Presetting the output of the integrated convolution network as a high-dimensional text feature vector, and taking the high-dimensional text feature vector as the input of a feedforward neural network output model to obtain a score value set of the high-dimensional text feature vector; it will be appreciated that the high-dimensional text feature vector is converted to a set of score values by a feed-forward neural network (essentially a matrix multiplication operation). The output dimension of the feedforward neural network output model is the type of the text label, the first prediction model is the feedforward neural network output model, and the first prediction result is the score value set of the high-dimensional text feature vector.

In general, after the score value set of the high-dimensional text feature vectors is obtained, the label represented by the label index with the largest score value in the score value set can be selected as the type of the prediction entity, but the prediction entity obtained at this time only utilizes text context information, and does not consider the correlation between the high-dimensional text feature vectors. Therefore, embodiment 1 further corrects the score value of the acquired high-dimensional text feature vector after step S3, see step S4.

And S4, correcting the first prediction result to obtain a second prediction result.

Specifically, the score value set is corrected by adopting a conditional random field model to obtain a second prediction result; and the second prediction result is a label index corresponding to the score value with the maximum numerical value in the corrected score value set.

And S5, converting the second prediction result into a named entity based on a preset conversion function.

The preset conversion function is the mapping relation between the label index and the named entity.

In a specific embodiment, the steps S1 to S5 are as follows:

the medical text data is 'cytoreduction is closely related to the degree of lesion in lung', before input, the text is firstly participled to obtain 'cytoreduction \ depletion \ and \ lung \ internal \ lesion \ degree \ dense \ incisional \ related'.

Converting the text after word segmentation into word segmentation indexes according to the index word bank; in the thesaurus, each word has an index [1: cell, 2: minus, 3: I, \8230 ], 105: lung, \8230; 300: correlation, \8230, 1500: yangtze river ].

The segmented text is converted into a segmentation index to obtain a segmentation index (sequence) of 1,2, \ 8230;, 105, \ 8230;, 300 ″, and then the segmentation index is input into a pre-training model to obtain a high-dimensional text vector.

Taking the word segmentation index 1 as an example (1 represents: cell), a high-dimensional vector is obtained through a pre-training model

Inputting the high-dimensional text vector into the first prediction model to obtain a score value set, for example, a set with a length n [0.1,0.01,0.2, \ 8230;, 0.5, \ 8230;, 0.03] (where n is a tag type, and assuming that the 1 st number represents the degree of belonging to a tissue class and the ith number represents the degree of belonging to an organ class), at this time, the element score value of the tag index i is the largest and is 0.5, which indicates that the entity with the participle index 1 most probably belongs to the ith class, i.e., the organ class;

further, inputting the score value set into the second prediction model may obtain a corrected score value set, for example [0.6,0.02,0.1, \8230, 0.01, \8230, 0.1], at which the element score value with a tag index of 1 is the largest and is 0.6, which indicates that the entity with a participle index of 1 most likely belongs to class 1, i.e., the organization class.

And recording the corrected score value set for summarization, recording the label index of the element with the maximum score value, and then obtaining the type of the entity according to the mapping relation between the label index and the named entity.

The method comprises the steps of preprocessing a pre-acquired medical text to obtain a high-dimensional text vector with context information, then inputting the high-dimensional text vector into a preset integrated convolutional network in parallel to obtain a corresponding number of first text features, and performing feature splicing on the corresponding number of first text features to obtain a high-dimensional text feature vector with a text label; further, the high-dimensional text feature vector is used as input of a preset first prediction model to obtain a first prediction result, and the first prediction result is corrected to obtain a second prediction result; and finally, converting the second prediction result into a named entity based on a preset conversion function. According to the identification method, the integrated convolution network is composed of the preset number of one-dimensional columnar convolution layers, text features are extracted by the one-dimensional columnar convolution layers, the relative relation among high-dimensional text vectors can be reserved, and then the dependency among words is reserved. Meanwhile, the integrated convolution network formed by convolution layers with different expansion coefficients can improve the attention degree of the entity features with less occupation, reserve some entity features with less occupation, and further improve the classification (identification) effect of the model.

On the basis of the foregoing embodiment, the present application provides another preferred embodiment 2, further explaining and optimizing the technical solution, which is specifically as follows:

in step S21, the preset number of one-dimensional columnar convolution layers have different convolution kernels with different expansion coefficients.

The one-dimensional column convolution can carry out mutually independent convolution operation on the features with different dimensions, so that the model can focus on specific features in the word vector. Step S21 is summarized, and the mathematical form of the one-dimensional column convolution is as follows:

Col(in _x )＝Concat(Conv(in _x,1 ),...Conv(in _x,n ))

wherein x represents the xth word of the text sequence; y represents the y characteristic dimension of the word vector of the current word; j represents the distance of other word vectors participating in the operation relative to the current word vector; k refers to the convolution kernel size; w is the weight in the convolution kernel; n represents the characteristic dimension size of the word vector for the current word (i.e., the x-th word).

The classical one-dimensional convolution is to fuse information of each dimension of a word vector from a macroscopic angle, when the number of training entity types is not uniform, the classical one-dimensional convolution is prone to be biased to a plurality of entity types, and entities with a small number are ignored, so that implicit features of different entities can be extracted more accurately when the number of the entities is not uniform, in embodiment 2, a one-dimensional column convolution algorithm is adopted to process an input high-dimensional text vector. The result of the column-wise convolution output is a two-dimensional result, each word vector still retaining high-dimensional information. A simpler column-type convolution implementation mode is that input data is split into a plurality of data blocks by taking characteristic dimension as a unit, at the moment, a vector in each data block only contains one dimension characteristic, the plurality of data blocks are respectively subjected to one-dimensional convolution operation, and finally all output characteristic vectors are spliced into a high-dimension characteristic vector.

It should be noted that in chinese medicine, there are often some complex entities, which are usually overlapped or nested by some other entities. For example, the entity "lung lesion" may be classified as a "symptom" type, while the "lung" thereof is classified as a "body part" type. When a single convolutional network layer is used for entity extraction, the text range concerned by the convolutional network layer is already fixed, the weight can only be continuously adjusted within the fixed range, and the generalization performance is poor.

In order to extract different local information in a text sequence, the invention takes different expansion coefficient sizes as the variation conditions of convolution layers, the convolution layers with different expansion coefficients have different corresponding receptive fields, the convolution layers with different sizes are integrated into a convolution network, and the attention ranges of the different convolution layers to the sequence are adjusted, so that abundant context information is obtained, the entity identification in different ranges can be realized, and the accuracy of the entity identification is improved. Wherein, the influence of the expansion coefficient on the convolution layer is shown as the following formula:

wherein x represents the xth word of the text sequence; y represents the y-th feature dimension of the word vector of the current word; j represents the distance of other word vectors participating in the operation relative to the current word vector; k refers to the convolution kernel size; w is the weight in the convolution kernel; d represents the convolution kernel expansion coefficient.

According to the medical text named entity recognition method based on integrated column convolution, the ERNIE model is used as a preprocessing model, the phrase level and entity level mask prediction can be carried out, the high-dimensional text vector containing the hidden phrase semantic relation is obtained, and the entity recognition accuracy is improved. The feature extraction is carried out by adopting the one-dimensional column type convolution, so that mutually independent budgeting can be carried out on each dimension feature of the entity, and the feature information of the entity with different dimensions can be reserved without increasing the number of convolution kernels. Different receptive fields corresponding to the convolution kernel processing of different expansion coefficients are different, convolution layers with different sizes are integrated into a convolution network, the attention degree of entity features with less occupation can be improved through reasonable setting, some entity features with less occupation are reserved, and then the classification (identification) effect of the model is improved.

Referring to fig. 3, in embodiment 3, a training process of the model is as follows:

100. performing word segmentation processing on a pre-acquired medical text to obtain a word segmentation text; determining a word segmentation index corresponding to the word segmentation text according to a pre-established index word bank; taking the word segmentation index as the input of a pre-training model to obtain a high-dimensional text vector; wherein, the pre-training model is an ERNIE model.

It should be noted that the models have different purposes and have different requirements on the input format of the data. The pre-training model of the embodiment is input as a labeled medical text sequence during training and testing, and is input as an original medical text sequence when the pre-training model is put into use after training.

Sequence annotation refers to assigning a correct and unique label to each element (word) of a medical text sequence. The sequence marking mode is adopted to train the initial entity recognition model in a supervision mode, so that the recognition model obtained by training can learn the abstract expression of the text more accurately. Preferably, in this embodiment, when the initial entity recognition model is trained, elements in the medical text sequence are represented in a BIO labeling manner, and those skilled in the art should understand that in actual application, some other labeling manners may be used for labeling, and only the accuracy of labeling and the uniqueness of a label need to be ensured.

It can be understood that, in the model training process, the high-dimensional text vector obtained through the pre-training model carries the labeling label information. The annotation tag corresponds to each element (word) of the medical text sequence.

200. And (3) extracting the features of the high-dimensional text vectors in the step (100) by adopting a preset integrated convolution network to obtain the high-dimensional text feature vectors, wherein the preset integrated convolution network is formed by a preset number of one-dimensional column-type convolution layers, and the obtained high-dimensional text feature vectors carry labeling labels.

For convenience of display, in embodiment 3, a high-dimensional text vector is input into column-type convolution layers based on a pre-normalization architecture and having three different convolution kernel expansion coefficients in parallel, and then the obtained text features are spliced into a high-dimensional text feature vector; it is understood that the number of the convolution layers in fig. 3 is only for illustration, and those skilled in the art can set the number according to actual needs, and the embodiment is not limited thereto.

300. Inputting the high-dimensional text feature vector carrying the label into a feedforward neural network output model (FFN); obtaining the score value of each label;

400. and correcting the score value of the label by adopting a Conditional Random Field (CRF) model to obtain the corresponding label when the score value of the label is maximum.

500. And converting the label tag obtained in the step 400 into a named entity based on a preset conversion function.

Presetting a conversion function as a mapping relation between a label and a named entity

An embodiment of the present application further provides a system for extracting a medical text entity, please refer to fig. 4, in embodiment 4, the system includes:

the preprocessing module 01 is used for performing word segmentation processing on a pre-acquired medical text to obtain a word segmentation text; determining a word segmentation index corresponding to the word segmentation text according to a pre-established index word bank; taking the word segmentation index as the input of a pre-training model to obtain a high-dimensional text vector;

the feature extraction module 02 is used for performing feature extraction on the high-dimensional text vector to obtain a high-dimensional text feature vector;

the first prediction module 03 is configured to use the high-dimensional text feature vector as an input of a preset first prediction model to obtain a first prediction result;

the second prediction module 04 is used for correcting the first prediction result to obtain a second prediction result;

the entity conversion module 05 is configured to convert the second prediction result into a named entity based on a preset conversion function;

wherein the feature extraction module comprises:

the first feature extraction module 021 is used for inputting the high-dimensional text vectors into a preset integrated convolutional network in parallel to obtain a preset number of first text features; the preset integrated convolution network is composed of a preset number of column-type convolution layers;

and a second feature extraction module 022, configured to perform feature splicing on the preset number of first text features to obtain a high-dimensional text feature vector.

The embodiment of the application also provides medical text entity extraction equipment based on integrated columnar convolution, which comprises a processor and a memory;

The implementation process of the above system and device may participate in the foregoing method embodiments, which are not described again.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A medical text entity extraction method based on integrated column convolution is characterized by comprising the following steps:

s3, taking the high-dimensional text feature vector as input of a preset first prediction model to obtain a first prediction result;

wherein, step S2 specifically includes:

s21, inputting the high-dimensional text vectors into a preset integrated convolution network in parallel to obtain a preset number of first text features; the preset integrated convolution network is composed of a preset number of one-dimensional column-type convolution layers;

and S22, performing feature splicing on the first text features of the preset number to obtain high-dimensional text feature vectors.

2. The integrated columnar convolution-based medical text entity extraction method according to claim 1, wherein in step S21, expansion coefficients of convolution kernels corresponding to the preset number of one-dimensional columnar convolution layers are different.

3. The integrated column convolution-based medical text entity extraction method according to claim 1, wherein in step S3, the first prediction model is a feedforward neural network output model; the first prediction result is a set of score values of a high-dimensional text feature vector.

4. The integrated columnar convolution-based medical text entity extraction method according to claim 3, wherein the step S4 specifically comprises:

5. The integrated columnar convolution-based medical text entity extraction method according to claim 4, wherein the preset conversion function is a mapping relation between the label index and the named entity.

6. The integrated columnar convolution-based medical text entity extraction method according to claim 1, wherein the step S21 specifically includes:

performing normalization processing on the high-dimensional text vectors by adopting layer standardization, and inputting the high-dimensional text vectors subjected to the normalization processing into a preset integrated convolution network in parallel to obtain a preset number of first text features; the preset integrated convolution network is composed of a preset number of column-type convolution layers.

7. The integrated columnar convolution-based medical text entity extraction method according to claim 6, wherein the step S22 specifically comprises:

8. The integrated columnar convolution-based medical text entity extraction method of claim 1, wherein the pre-training model is an ERNIE model.

9. A medical text entity extraction system based on integrated columnar convolution, comprising:

wherein the feature extraction module comprises:

the first feature extraction module is used for inputting the high-dimensional text vectors into a preset integrated convolution network in parallel to obtain a preset number of first text features; the preset integrated convolution network is composed of a preset number of column-type convolution layers;

10. An integrated columnar convolution-based medical text entity extraction device is characterized by comprising a processor and a memory;

the memory is adapted to store a computer program which, when executed by the processor, implements the identification method of any one of claims 1-8.