CN115690072A

CN115690072A - Chest radiography feature extraction and disease classification method based on multi-mode deep learning

Info

Publication number: CN115690072A
Application number: CN202211414106.7A
Authority: CN
Inventors: 寸天睿; 徐爱迪; 韩健; 杨段生; 沙政; 赵治红
Original assignee: Chuxiong Normal University
Current assignee: Chuxiong Normal University
Priority date: 2022-11-11
Filing date: 2022-11-11
Publication date: 2023-02-03

Abstract

The invention provides a chest radiography feature extraction and disease classification method based on multi-mode deep learning, which mainly comprises the following steps: s1, data source acquisition; s2, preprocessing data; s3, fusing and matching image-text characteristics; s4, building a model; s5, training and optimizing a model; by adopting the image-text combined self-supervision model training method, the network model can be stably and quickly trained and reasoned under the condition of limited training data or small samples; the Transformer network structure is optimized, improved and designed, so that the overall characteristics of the chest X-ray film can be captured, and the method can be applied to the chest X-ray film analysis scene with the characteristics of small focus, irregular focus shape and the like.

Description

Chest radiography feature extraction and disease classification method based on multi-mode deep learning

Technical Field

The invention belongs to the technical field of intelligent medical treatment, and particularly relates to a chest radiography feature extraction and disease classification method based on multi-mode deep learning.

Background

The interpretation of medical images requires extensive medical expertise, but is prone to human discriminant errors. In countries with a large number of people, such as china, a specialist needs to interpret a large number of medical images in a short time, which is a tedious and time-consuming process. Therefore, if the disease type in the image can be automatically and accurately judged in a short time, a large quantity of medical images can be rapidly screened, and the labor intensity of clinical workers can be greatly reduced. In recent years, with the rapid development of deep learning techniques in the fields of computer vision, natural language processing, and the like, computer-aided diagnosis techniques based on artificial intelligence have attracted more and more attention in the industry. The rapid success of these fields has created a desire to provide more efficient and economical healthcare for patients. Among many imaging examinations, X-ray is widely used in china as compared with imaging examinations such as CT and MRI, and X-ray examination can be performed even in a health care center at the level of the village and the town. Therefore, the method has wide application prospect in automatically and accurately judging the disease type according to the X-ray film, and the research has great promotion effect on the development of intelligent medical treatment in China.

At present, a chest X-ray film automatic diagnosis technology based on deep learning mainly adopts a supervision model based on a Convolutional Neural Network (CNN), the used CNN such as a general CNN framework of AlexNet, resNet, VGG, denseNet, fasterR-CNN, inception V3, ***Net, mobileNet V2, SR, U-NET and variants thereof and the like or a CNN framework of CheXNet, tieNET and the like specially designed for X-ray films are supervised and trained and classified on Open-I, chestx-ray8, cheXpert, padChest, MIMIC-CXR and other public data sets. However, the accuracy of such supervision models is difficult to further improve due to the following reasons, and the generalization capability is limited: (1) The marking of the medical image needs professional knowledge, the marking difficulty of the image is high, the marking cost is high, data with complete marking is difficult to obtain, and a reference data set similar to the size of ImageNet in the natural image field is lacked, so that a supervision model with excellent performance in the natural image field is easy to be over-fitted during the training and use in the medical image field; (2) The existing data sets are extremely unbalanced in data quantity of different categories, and lack of confidence intervals is also an important reason for influencing precision; (3) The influence of the locality of the CNN convolution operation is limited when modeling the long-distance dependency, although the receptive field can be increased by deepening the number of convolution layers or using other improved convolution structures, the model computation complexity is increased a lot, and the method is not suitable for the requirement of a real medical scene on the diagnosis speed.

Disclosure of Invention

In order to solve the technical problems, the invention provides a chest radiography feature extraction and disease classification method based on multi-mode deep learning, and the invention adopts a self-supervision model training method combining image and text, so that a network model can be stably and rapidly trained and reasoned under the condition of limited training data or small samples; in addition, a Transformer network structure is optimized, improved and designed, so that the overall characteristics of the chest X-ray film can be captured, and the method can be applied to the chest X-ray film analysis scene with the characteristics of small focus, irregular focus shape and the like.

In order to achieve the technical purpose, the invention is realized by the following technical scheme:

the chest radiography feature extraction and disease classification method based on multi-modal deep learning comprises the following steps:

s1: data source acquisition: collecting an open-source chest X-ray film data set and an open-source medical image question and answer data set;

s2: data preprocessing: carrying out data cleaning and format unification on the acquired data, and dividing the data set into image-text pairs and an image-only data set; constructing a training set and a testing set of the project;

s3: fusing and matching image-text characteristics: performing image-text feature matching and fusion by adopting contrast learning in an AutoEncoder mode; performing image-text feature fusion in a cross attention mode by adopting a Transformer-based mode;

s4: constructing a model: constructing by using the image-text characteristics extracted in the S3 and adopting a Pythroch deep learning frame;

s5: model training and optimization: and repeatedly training the deep learning model on the constructed data training set, iteratively optimizing the structure and parameters of the model, and creating a project model which can be used for clinic.

Preferably, the data preprocessing specifically comprises the following steps:

1) Data cleaning and format unification are carried out on the acquired data, an original chest film is obtained from a plurality of data sets, the chest film has various formats such as dicom, jpg and png, the resolution difference is large, and therefore the data are uniformly converted into a 255x255 jpg gray image, and meanwhile, an ambiguous image of pathological diagnosis is cleaned;

2) Dividing the data set into a graph-text pair data set (accounting for 40 percent of the total data) and a graph-only data set (accounting for 40 percent of the total data);

3) The training set and the test set of the project are divided according to the proportion of 80 percent to 20 percent;

preferably, the specific method for image-text feature matching and fusion in the AutoEncoder mode is as follows: adopting contrast learning to carry out image-text feature matching and fusion, inputting the chest film into an image encoder based on ResNet depth convolution neural network or Vision Transformer to carry out feature extraction to obtain h _v Then, MLP mapping is carried out to obtain a feature v, and the text part adopts pre-trained ClientBERT to carry out vectorization of the medical report and character feature extraction to obtain h _u Obtaining u by carrying out nonlinear mapping on MLP, and finally obtaining fusion aligned image-text characteristics by maximizing the consistency between real image-text representation pairs with existing bidirectional loss, wherein the fusion aligned image-text characteristics have abundant clinical semantic information vectors and are used for downstream classification tasks;

preferably, the image encoder, the convolutional neural network uses a ResNet50 architecture, and the Transformer uses an original ViT model; for a text encoder, a BERT encoder is used for carrying out maximum pooling convergence output on all output vectors of the last layer, and the text encoder adopts a trained ClinicalBERT weight on an MIMIC data set;

preferably, the specific method for performing image-text feature fusion in the Transformer mode comprises the following steps: adopting a Transformer-based mode to perform image-text feature fusion in a cross attention mode: realizing feature fusion by using a Transformer self-attention mechanism and a cross attention mechanism; the chest piece image part is segmented into 16x16 patches by using a vision transform processing mode, the patches are linearly mapped into image embedding, the image embedding is sent into a standard transform, feature extraction is carried out through self-attribute, high-dimensional word vector embedding is obtained through pre-trained ClientBERT of the text part, text features are obtained through self-attribute, and then fusion matching is carried out on the text features and the image features through cross attention to obtain features which can be used for downstream tasks;

preferably, the transformer adopts a standard 6-layer self-attention encoder to extract respective features of the picture and the text, and then performs feature fusion alignment through an improved cross attention layer; wherein, query of the cross attention layer is an image feature, and Value and Key are text features;

preferably, in S4, for image enhancement of the input image, an image expansion method provided by itself in torchvision is used: random cutting, horizontal turning, affine transformation, color dithering and Gaussian smoothing; considering the particularity of the chest film, only brightness and contrast adjustment is used in color dithering; for text data, simple uniform distribution of samples from pathological text sentences rather than word-specific sampling is employed, considering that sentence-level sampling can preserve semantic information.

The beneficial effects of the invention are:

by adopting the image-text combined self-monitoring model training method, the network model can be stably and quickly trained and inferred under the condition of limited training data or small samples; the transform network structure is optimized, improved and designed, so that the overall characteristics of the chest X-ray film can be captured, and the method can be applied to the chest X-ray film analysis scene with the characteristics of small focus, irregular focus shape and the like.

Drawings

FIG. 1 is a schematic diagram of the technical scheme of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

s1: data source acquisition: an open source chest X-ray data set was collected as in table 1; and an open-source medical image question-answer dataset, as in table 2;

s3: image-text feature fusion and matching: performing image-text feature matching and fusion by adopting comparison learning in an AutoEncoder mode; performing image-text feature fusion in a cross attention mode by adopting a Transformer-based mode;

s4: constructing a model: constructing by using the graphics and text characteristics extracted in the step S3 and adopting a Pythrch deep learning framework;

Preferably, the data preprocessing specifically comprises the following steps:

3) Segmenting a training set and a testing set of the items according to the proportion of 80% to 20%;

preferably, the specific method for image-text feature matching and fusion in the AutoEncoder mode is as follows: adopting contrast learning to carry out image-text feature matching and fusion, inputting the chest film into an image encoder based on ResNet deep convolution neural network or Vision Transformer to carry out feature extraction to obtain h _v Then, MLP mapping is carried out to obtain a feature v, and the text part adopts pre-trained ClientBERT to carry out vectorization of the medical report and character feature extraction to obtain h _u Obtaining u by carrying out nonlinear mapping on MLP, and finally obtaining fusion aligned image-text characteristics by maximizing the consistency between real image-text representation pairs with existing bidirectional loss, wherein the fusion aligned image-text characteristics have abundant clinical semantic information vectors and are used for downstream classification tasks;

preferably, the specific method for performing image-text feature fusion in the Transformer mode is as follows: adopting a Transformer-based mode to perform image-text feature fusion in a cross attention mode: realizing feature fusion by using a Transformer self-attention mechanism and a cross attention mechanism; the chest piece image part is segmented into 16x16 patches by using a vision transform processing mode, the patches are linearly mapped into image embedding, the image embedding is sent into a standard transform, feature extraction is carried out through self-attribute, high-dimensional word vector embedding is obtained through pre-trained ClientBERT of the text part, text features are obtained through self-attribute, and then fusion matching is carried out on the text features and the image features through cross attention to obtain features which can be used for downstream tasks;

preferably, the transform adopts a standard 6-layer self-attention encoder to extract respective features of the picture and the text, and then performs feature fusion alignment through an improved cross-attention layer; wherein, query of the cross attention layer is an image feature, and Value and Key are text features;

preferably, in S4, for image enhancement of the input image, an image expansion method provided by itself in torchvision is used: random cutting, horizontal turning, affine transformation, color dithering and Gaussian smoothing; considering the particularity of the chest film, only brightness and contrast adjustment is used in color dithering; for text data, simple uniform distribution of samples from pathological text sentences rather than word-specific samples are employed, considering that sentence-level samples can retain semantic information.

TABLE 1 open source chest X-ray data set collected

Data set	Number of X-ray films	# number of reports	# number of patients
				Open-I	8121	3996	3996
NIHChest-XRay8	108948	0	32717
				CheXpert	224316	0	65240
PadChest	160868	109931	67625
				MIMIC-CXR	473057	206563	63478

TABLE 2 open-source medicine VQA data set collected

Data set	Number of X-ray films	# QA Log number
			VQA-RAD	315	3515
RadVisDial	91060	455300
			SLAKE	642	14K

TABLE 3 disease Classification accuracy

Claims

1. The chest radiography feature extraction and disease classification method based on multi-mode deep learning is characterized by comprising the following steps of:

s3: image-text feature fusion and matching: performing image-text feature matching and fusion by adopting contrast learning in an AutoEncoder mode; performing image-text feature fusion in a cross attention mode by adopting a Transformer-based mode;

s4: model construction: constructing by using the image-text characteristics extracted in the S3 and adopting a Pythroch deep learning frame;

2. The method for extracting features of chest radiographs and classifying diseases based on multi-modal deep learning according to claim 1, wherein the data preprocessing comprises the following specific steps:

2) Dividing the data set into a graph-text pair data set and a graph-only data set; wherein, the image-text pair data set accounts for 40% of the total data amount, and only the image-containing data set accounts for 40% of the total data amount;

3) The training set and test set of items were segmented at a ratio of 80% to 20%.

3. The chest radiograph feature extraction and disease classification method based on multi-modal deep learning of claim 1, wherein the specific method for image-text feature matching and fusion in the AutoEncoder mode is as follows: adopting contrast learning to carry out image-text feature matching and fusion, inputting the chest film into an image encoder based on ResNet depth convolution neural network or Vision Transformer to carry out feature extraction to obtain h _v Then, MLP mapping is carried out to obtain a feature v, and the text part adopts pre-trained ClientBERT to carry out vectorization of the medical report and character feature extraction to obtain h _u And also carrying out nonlinear mapping by MLP to obtain u, and finally obtaining fusion aligned image-text characteristics by maximizing the consistency between real image-text representation pairs with existing bidirectional loss, wherein the fusion aligned image-text characteristics have abundant clinical semantic information vectors and are used for downstream classification tasks.

4. The method of claim 3, wherein the image encoder, convolutional neural network uses ResNet50 architecture, and Transformer uses original ViT model; for the text encoder, a BERT encoder is used to perform maximum pooled aggregate output for all output vectors of the last layer, the text encoder using the trained ClinicalBERT weights on the MIMIC dataset.

5. The chest radiograph feature extraction and disease classification method based on multi-modal deep learning of claim 1, wherein the concrete method for image-text feature fusion in a Transformer mode is as follows: adopting a Transformer-based mode to perform image-text feature fusion in a cross attention mode: realizing feature fusion by using a Transformer self-attention mechanism and a cross attention mechanism; the chest piece image part is segmented into 16x16 patches by using a vision transform processing mode, the patches are linearly mapped into image embedding, the image embedding is sent into a standard transform, feature extraction is carried out through self-attribute, high-dimensional word vector embedding is obtained through pre-trained ClientBERT of the text part, text features are obtained through self-attribute, and then fusion matching is carried out on the text features and the image features through cross attention, so that features which can be used for downstream tasks are obtained.

6. The method for chest radiograph feature extraction and disease classification based on multi-modal deep learning as claimed in claim 5, wherein the transform uses a standard 6-layer self-attention encoder to extract the respective features of picture and text, and then performs feature fusion alignment through improved cross-attention layer; wherein Query of the cross attention layer is an image feature, and Value and Key are text features.

7. The method for feature extraction and disease classification of chest radiograph based on multi-modal deep learning as claimed in claim 1, wherein for the image augmentation of the input image in S4, the self-contained image augmentation method in torchvision is used: random cutting, horizontal turning, affine transformation, color dithering and Gaussian smoothing; considering the particularity of the chest film, only brightness and contrast adjustment is used in color dithering; for text data, simple uniform distribution of samples from pathological text sentences rather than word-specific samples are employed, considering that sentence-level samples can retain semantic information.