CN115294151A

CN115294151A - Lung CT interested region automatic detection method based on multitask convolution model

Info

Publication number: CN115294151A
Application number: CN202210773167.6A
Authority: CN
Inventors: 赵寅杰; 傅小龙; 潘小勇; 徐志勇; 林扬; 申宇嘉; 傅圆圆; 沈红斌
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2022-07-01
Filing date: 2022-07-01
Publication date: 2022-11-04

Abstract

A lung CT interested region automatic detection method based on a multitask convolution model obtains images in a tensor form by analyzing CT scanning, and performs rough lung segmentation, resampling and pixel value normalization processing on the images to obtain a three-dimensional CT image voxel matrix only containing lung regions; then, after the three-dimensional image is sliced, inputting the obtained two-dimensional images of a plurality of cross sections into a two-dimensional convolution model to obtain a plurality of ROI rough outlines; and intercepting the three-dimensional image block by taking the rough ROI profile as a center, respectively giving the probability that each ROI is a real ROI through a three-dimensional classifier, inputting the obtained three-dimensional image block containing the real ROI into a three-dimensional convolution model obtained by multi-task training after screening to obtain an accurate ROI profile, and rendering the profile into different colors according to the probability that the ROI is the real ROI. The invention can rapidly process a large batch of CT images and improve the ROI detection efficiency.

Description

Lung CT interested region automatic detection method based on multitask convolution model

Technical Field

The invention relates to a technology in the field of image processing, in particular to a lung CT (Computed Tomography) region-of-interest automatic detection method based on a multitask convolution model.

Background

With the development of medical technology, more and more lung CT images need to be processed in time, wherein the detection of a Region of interest (ROI) belongs to the first step of lung CT image processing. In the field of digital image processing, ROI refers to a specific region of interest among images by a technician. If the traditional method of manually checking the CT images layer by layer to judge whether the ROI exists is adopted, a large burden is added to a doctor, and the doctor needs to refer to a plurality of images to continuously compare for judging when finding the ROI, so that the method is time-consuming and labor-consuming. The ROI is automatically positioned and the ROI outline is outlined by utilizing the convolutional network, so that the processing time of the CT image is reduced on the basis of ensuring the accuracy, the mass processing of the CT image becomes possible, a doctor is liberated from redundant work, and the integral operating efficiency of the hospital can be improved.

At present, an algorithm aiming at the lung CT image ROI automatic detection lacks universality, can only process a CT image with a certain layer thickness, and has poor processing generalization on CT images with different layer thicknesses or obtained by scanning different machines. In addition, most algorithms only use an ROI mask or an anchor frame as a semantic label to train the convolutional network, so that the convolutional model only focuses on visual features based on voxel values in the CT image, ignores clinical biological features and is not beneficial to accurately delineating the ROI outline.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides the lung CT interested region automatic detection method based on the multitask convolution model, different models are trained aiming at different layer thicknesses, and the problem that CT images with different layer thicknesses contain different information contents can be effectively solved. In addition, the ROI mask and the good and malignant semantic labels of the ROI are simultaneously used during training of the three-dimensional semantic segmentation model, and the imaging characteristics and the biological characteristics of the ROI are combined, so that the ROI outline can be accurately drawn. The ROI contour is obtained through a convolution model based on a coder decoder structure, a large batch of CT images can be processed quickly, and ROI detection efficiency is improved.

The invention is realized by the following technical scheme:

the invention relates to a lung CT interested area automatic detection method based on a multitask convolution model, which comprises the steps of obtaining an image in a tensor form by analyzing CT scanning, and carrying out rough lung segmentation, resampling and pixel value normalization processing on the image to obtain a three-dimensional CT image voxel matrix only containing a lung area; then after the three-dimensional image is sliced, the cross section slices are input into a two-dimensional semantic segmentation network based on a coder decoder structure, and a plurality of ROI rough outlines are obtained; and intercepting the three-dimensional image block by taking the rough ROI profile as a center, respectively giving the probability that each ROI is a real ROI through a three-dimensional classifier, inputting the obtained three-dimensional image block containing the real ROI into a three-dimensional convolution model obtained by multi-task training after screening to obtain an accurate ROI profile, and rendering the profile into different colors according to the probability that the ROI is the real ROI.

The three-dimensional image is obtained by the following method: for original CT scanning, that is, DICOM (Digital Imaging and Communications in medicine) type CT scanning, two-dimensional slice matrixes in a plurality of DICOM files are combined into a three-dimensional image according to position coordinates of slices, and then the three-dimensional image is preprocessed, specifically including: finding a rough anchor frame of the lung region by using a digital image processing technology, and cutting off a non-lung part in the image according to the anchor frame; sampling pixel pitches of the image along the (x, y, z) axis to (0.7mm, 1.5mm), respectively; the pixel values are normalized according to the lung window [ -1024,400] and the mediastinal window [ -160,240] to obtain two channels corresponding to the lung window and the mediastinal window, respectively.

The two-dimensional semantic segmentation network is a neural network based on U2-Net, the whole of the two-dimensional semantic segmentation network is of an encoder-decoder structure, and meanwhile each unit of the encoder and the decoder is of an encoding-decoding structure, wherein the encoder gradually enlarges the receptive field of a convolutional layer through a plurality of down-sampling, extracts high-dimensional semantic features in a feature map, and obtains the rough position of a nodule in the current input; and the decoder samples the feature map to the resolution which is the same as the input resolution step by step through a plurality of upsampling, extracts the low-dimensional visual features in the feature map and obtains the mask of the nodule in the input image.

The output e of each layer of the coder and the decoder _i |i∈[1,N]And { d }and { d _j |j∈[1,N]The resolutions of the ROI and the ROI are different, namely, the encoder decoder structure enables the model to extract the characteristics of the ROI under different resolutions through multi-layer down sampling and up sampling, the problem that the sizes of the ROI in various cases are inconsistent can be effectively solved, and N is the number of layers of the encoder and the decoder.

The corresponding layers of the coder and the decoder are provided with short-circuit connection, namely the input of the decoder of the j layer is Concat (e) _j ,Up(d _j -1)), wherein: concat means that the feature maps are spliced according to channel dimensions, and Up means that the feature maps are Up-sampled to twice the original resolution by using an interpolation method, so that a decoder simultaneously processes the decoding of the previous layerThe characteristic containing high-dimensional semantic information output by the device and the characteristic containing low-dimensional visual information output by the encoder at the same height are beneficial to improving the ROI outline drawing precision on the basis of ensuring the ROI position accuracy.

The ROI is the probability of a real ROI, and three-dimensional image blocks with the size of 96 multiplied by 48 are intercepted according to a plurality of ROI rough outlines output by a two-dimensional semantic segmentation model as centers; and then inputting the three-dimensional image block into a three-dimensional classifier based on a residual error network to obtain the probability that the ROI is a real ROI.

The three-dimensional classifier based on the residual error network is based on the three-dimensional variety of the residual error network, and the residual error structure is specifically as follows: y = F (x, { W) _i }) + x, where: subscript i refers to the ith layer of the network, x is the input of the current layer, y is the output of the current layer, { W _i Is a parameter of the i-th layer,

in the form of a function, e.g.

The problem of network deepening post-training degeneration can be partially solved through residual error connection. The bottleneck layer maps the feature map to a high-dimensional space to process and then compresses the feature map to a low-dimensional space, so that the characteristic that the multi-dimensional features can better express semantic information is reserved, the training parameters of the network are reduced, and the difficulty of network training is reduced.

The screening is as follows: and judging the ROI with the probability smaller than a certain threshold value as a false positive, and not outputting the outline of the ROI to a final mask.

The precise ROI outline is obtained by inputting a three-dimensional image block which takes the screened ROI as a center and has the size of 128 multiplied by 64 into a three-dimensional semantic segmentation network based on a coder decoder structure; the contours of each ROI are then output to an RT STRUCT file, which renders the contours of each ROI using different colors depending on the magnitude of the probability output by the classifier.

The three-dimensional semantic segmentation network is a three-dimensional variant of the two-dimensional semantic segmentation network, and only two-dimensional operation in the three-dimensional semantic segmentation network is changed into corresponding three-dimensional operation, namely, two-dimensional convolution, pooling and normalization layers are respectively changed into three-dimensional convolution, pooling and normalization layers.

The multi-task training is as follows: when the three-dimensional semantic segmentation network is trained, besides the main task of outputting the ROI mask, an auxiliary task of judging the quality of the ROI is added. Specifically, in the three-dimensional semantic segmentation network, in addition to an encoder for performing feature extraction on an input CT image and a decoder for outputting an ROI mask, a classifier for determining whether the current ROI is benign or malignant is added. The classifier takes the output of each layer of the encoder as input, and finally obtains the probability value representing the current ROI as benign through proper down-sampling and convolution operations. In the training process, parameters of the encoder are updated by the back propagation gradient from the decoder and the classifier at the same time;

the Loss function of the three-dimensional classifier and the classification part in the multi-task training is Focal local, and specifically comprises the following steps:

wherein: y is the sum of the average power of the power supply,

respectively indicating a real label and a predicted value of the ROI benign and malignant, wherein the hyper-parameter alpha is used for controlling the weight of the positive and negative samples, and when alpha is more than 0.5, the positive samples have greater contribution to the model parameters; the weight of the difficult and easy samples is controlled by the loss function through the hyper-parameter gamma, when gamma is larger than 1, the weight of the difficult samples is larger when gradient back propagation is carried out, namely, the model focuses more on the difficult samples.

The two-dimensional and three-dimensional semantic segmentation networks have a Loss function of Focal local + Dice local, and specifically comprise:

wherein: p is the sum of the total of the p,

the real label and the predicted value of the ROI mask are respectively, subscript i represents the ith pixel value in the mask, focal local is a pixel Loss function, the Loss function is calculated and summed for each pixel point, the gradient of the Loss function in the form is better, and stable training of a model is easy. The Dice Loss function is only related to a foreground region in the mask, is consistent with the final evaluation index, and can simultaneously alleviate the problem of class imbalance, but when the current scene region is too small, the model optimization process is unstable. By combining the two loss functions, the ROI mask output by the model can be better on the Dice coefficient on the premise of ensuring stable training.

The training data sets adopted by the training are two hospital internal data sets. The first data set contained 989 samples with a CT slice thickness of 5mm, the second data set contained 172 samples with a CT slice thickness of less than 1.5mm, and two sets of models were obtained by training on the two data sets, respectively. In practical application, when the thickness of the input CT image layer is greater than 3mm, the corresponding model trained from the first data set is used, otherwise, the model trained from the second data set is used.

Technical effects

The method simultaneously uses the ROI mask and the benign and malignant label as real labels for training the three-dimensional semantic segmentation model, so that the model can learn the visual characteristics in the CT image through the ROI mask and can also learn the biological characteristics in the CT image through the benign and malignant label of the ROI. In addition, the three steps of rough contour delineation, false positive inhibition and accurate contour delineation and the corresponding model are adopted, the high efficiency of positioning the ROI in the three-dimensional CT image and the accuracy of ROI contour delineation are considered, and meanwhile, the false positive of ROI positioning is reduced. Finally, the invention trains different models aiming at the CT images with different layer thicknesses, thereby improving the accuracy of delineating the ROI outline in the CT images with different layer thicknesses.

Drawings

FIG. 1 is a flow chart of an embodiment;

FIG. 2 is a schematic view of an ROI in a CT scan two-dimensional slice;

FIG. 3 is a schematic diagram of a process for roughly segmenting a lung region;

FIG. 4 is a three-dimensional semantic segmentation network architecture used in the present invention;

FIG. 5 is a schematic diagram of a two-dimensional comparison of an original image (left), ROI real mask (middle) and an ROI mask (right) output based on the present invention;

FIG. 6 is a schematic diagram showing a three-dimensional comparison of the real mask of ROI (left) with the ROI mask outputted based on the present invention (right).

Detailed Description

As shown in fig. 1, the present embodiment relates to a lung CT ROI automatic detection process based on a multi-task convolution model, according to the lung CT scan in DICOM format shown in fig. 2, a semantic segmentation model of a two-dimensional encoder-decoder structure is used to scan the whole lung region, a rough outline of an ROI is found, each ROI is judged by a classifier of a three-dimensional residual structure, a probability that the ROI is a real ROI is given, and finally, an ROI with a probability greater than a certain threshold is accurately outlined by the semantic segmentation model of the three-dimensional encoder-decoder structure.

The detection method comprises the following specific steps:

the first step, the CT image in DICOM format is analyzed, and a plurality of two-dimensional gray level images are combined into a three-dimensional matrix according to coordinates.

In the second step, the lung region is found using digital image processing techniques. The method specifically comprises the following steps: carrying out threshold binarization on the gray level image; the connected components are searched, the largest two connected components are reserved, the connected components are the lung area, and the lung area is cut out to remove the extrapulmonary part, as shown in fig. 3.

And thirdly, sampling the pixel pitch of the image to (0.7 mm,0.7mm and 1.5 mm) along the (x, y and z) axis respectively, and normalizing the pixel value of the three-dimensional image by using the lung window [ -1024,400] and the mediastinal window [ -160,240] respectively, wherein the specific operation is that the window lower limit is subtracted from the pixel value and then the pixel value is divided by the window width.

And fourthly, slicing the three-dimensional image, and slicing and resampling the three-dimensional image into a plurality of images with the sizes of 7 multiplied by 512 multiplied by 256, wherein the channel number 7 represents seven continuous two-dimensional slices.

And fifthly, inputting the sliced plurality of images into a two-dimensional semantic segmentation network of a coder decoder structure to obtain a rough outline of the ROI.

And sixthly, taking each ROI obtained in the fifth step as a center, cutting a three-dimensional image block with the size of 96 multiplied by 48, inputting the three-dimensional image block into a three-dimensional classifier based on a residual structure, and giving the probability P that each ROI is a real ROI.

And seventhly, intercepting a three-dimensional graphic block with the size of 128 multiplied by 64 by taking the ROI with the real probability greater than a certain threshold value in the sixth step as a center, inputting the three-dimensional graphic block into a three-dimensional semantic segmentation network of a coder decoder structure, and giving an accurate outline of each ROI. The three-dimensional semantic segmentation network structure is shown in fig. 4.

And eighthly, writing the precise contour obtained in the seventh step into an output file RT STRUCT in a DICOM format, and respectively endowing different colors according to the corresponding probability P of each ROI in the sixth step, wherein the contour with the P being more than 0.7 and less than or equal to 0.8 is green, the contour with the P being more than 0.8 and less than or equal to 0.9 is yellow, and the contour with the P being more than 0.9 is red. Two-dimensional and three-dimensional comparisons of the model-output ROI mask and the physician-annotated ROI mask are shown in FIGS. 5 and 6, respectively.

Compared with the prior art, the average Dice of the method is 0.7026 on a test set of 119 thick-layer CT images (the layer thickness is more than 3 mm), 0.7046 on a test set of 21 thin-layer CT images (the layer thickness is less than or equal to 3 mm), the average time of processing the thick-layer CT images on a single NVIDIA GeForce GTX 3090Ti GPU by the software is less than one minute, and the average time of processing the thin-layer CT images is less than two minutes.

The foregoing embodiments may be modified in many different ways by one skilled in the art without departing from the spirit and scope of the invention, which is defined by the appended claims and not by the preceding embodiments, and all embodiments within their scope are intended to be limited by the scope of the invention.

Claims

1. A lung CT interested region automatic detection based on a multitask convolution model is characterized in that a tensor form image is obtained by analyzing CT scanning, and rough lung segmentation, resampling and pixel value normalization processing are carried out on the image to obtain a three-dimensional CT image voxel matrix only containing a lung region; then, after the three-dimensional image is sliced, the cross section slices are input into a two-dimensional semantic segmentation network based on a coder decoder structure to obtain a plurality of ROI rough outlines; and intercepting the three-dimensional image block by taking the rough ROI profile as a center, respectively giving the probability that each ROI is a real ROI through a three-dimensional classifier, inputting the obtained three-dimensional image block containing the real ROI into a three-dimensional convolution model obtained by multi-task training after screening to obtain an accurate ROI profile, and rendering the profile into different colors according to the probability that the ROI is the real ROI.

2. The pulmonary CT region-of-interest automatic detection method based on the multitask convolution model as claimed in claim 1, wherein the two-dimensional semantic segmentation network, the neural network based on U2-Net, is an encoder-decoder structure as a whole, and each unit of the encoder and the decoder is an encoding-decoding structure, wherein the encoder gradually enlarges the receptive field of the convolution layer through a plurality of downsampling, extracts the high-dimensional semantic features in the feature map, and obtains the rough position of the nodule in the current input; and the decoder samples the feature map to the resolution which is the same as the input resolution step by step through a plurality of upsampling, extracts the low-dimensional visual features in the feature map and obtains the mask of the nodule in the input image.

3. The pulmonary CT ROI automatic detection based on multi-tasking convolution model as claimed in claim 2, wherein the output { e ] of each layer of said encoder and decoder _i |i∈[1，N]And { d }and { d } _j |j∈[1，N]The resolutions of the encoders and the decoders are different, namely, the encoder and the decoder structure adopts multilayer down-sampling and up-sampling, and N is the layer number of the encoders and the decoders;

the corresponding layers of the encoder and the decoder are provided with short circuit connection, namely the input of the decoder of the j-th layer is Concat (e) _j ，Up(d _j-1 ) Whereinsaid: concat means that the feature map is spliced according to channel dimensions, and Up means that the feature map is Up-sampled to twice the original resolution by using an interpolation method.

4. The pulmonary CT ROI (region of interest) automatic detection method based on multi-task convolution model as claimed in claim 1, wherein said three-dimensional classifier is a three-dimensional variety based on residual error network, and its residual error structure is specifically: y = F (x, { W) _i }) + x, where: subscript i refers to the ith layer of the network, x is the input of the current layer, y is the output of the current layer, { W _i Is a parameter of the i-th layer,

in the form of a function, e.g.

The bottleneck layer recompresses the feature map to a lower dimensional space by mapping the feature map to a higher dimensional space for processing.

5. The pulmonary CT region-of-interest automatic detection method based on the multitask convolution model as claimed in claim 1, wherein the precise ROI outline is obtained by inputting a three-dimensional image block which is 128 x 64 in size and takes the screened ROI as a center into a three-dimensional semantic segmentation network based on a coder decoder structure; the contours of each ROI are then output to an RT STRUCT file, which renders the contours of each ROI using different colors depending on the magnitude of the probability output by the classifier.

6. The pulmonary CT ROI automatic detection based on a multitask convolution model according to claim 5, wherein said three-dimensional semantic segmentation network is a three-dimensional variant of said two-dimensional semantic segmentation network, and only changes two-dimensional operation thereof to corresponding three-dimensional operation, namely, changes two-dimensional convolution, pooling and normalization layers to three-dimensional convolution, pooling and normalization layers respectively.

7. The pulmonary CT region of interest automatic detection based on multitask convolution model as claimed in claim 6, characterized in that said multitask training means: when the three-dimensional semantic segmentation network is trained, besides the main task of outputting the ROI mask, an auxiliary task for judging whether the ROI is good or bad is added; specifically, in the three-dimensional semantic segmentation network, a classifier for judging whether the current ROI is benign or malignant is added in addition to an encoder for extracting the characteristics of an input CT image and a decoder for outputting an ROI mask; the classifier takes the output of each layer of the encoder as input, and finally obtains a probability value representing that the current ROI is benign through proper down-sampling and convolution operations; during the training process, the parameters of the encoder are updated simultaneously by the backpropagation gradient from the decoder and the classifier.

8. The pulmonary CT region of interest automatic detection based on multitask convolution model according to claim 7, characterized in that the Loss function of said three-dimensional classifier and the classification part in multitask training is Focal local, which is specifically:

wherein:

9. The pulmonary CT ROI (computed tomography) automatic detection based on the multitask convolution model as claimed in claim 5 or 6, wherein the two-dimensional and three-dimensional semantic segmentation networks have a Loss function of Focal local + Dice local, and specifically comprise:

wherein: p is the sum of the total of the p,

respectively representing a real label and a predicted value of the ROI mask, wherein subscript i represents the ith pixel value in the mask, focal local is a pixel Loss function, and the Loss function is calculated and summed for each pixel point; the Dice Loss function is only related to a foreground region in the mask, is consistent with the final evaluation index, and simultaneously alleviates the problem of class imbalance, but when the current foreground region is too small, the model optimization process is unstable.