CN117011246A

CN117011246A - Segmented vertebra CT image segmentation method and system based on transducer

Info

Publication number: CN117011246A
Application number: CN202310850038.7A
Authority: CN
Inventors: 邵明昊; 郭延恩; 唐文彬; 江宗康; 宓海; 蔡宁
Original assignee: Shanghai Jirui Medical Technology Co ltd
Current assignee: Shanghai Jirui Medical Technology Co ltd
Priority date: 2023-07-12
Filing date: 2023-07-12
Publication date: 2023-11-07

Abstract

A segmented vertebra CT image segmentation method based on TRANSFORMER adopts a deep learning neural network model conforming to a classical encoder-decoder network architecture, wherein an encoder comprises a preprocessing module, a first convolution unit layer module and a self-attention module, and the encoder comprises four stages, which are mainly used for capturing multi-scale context characteristics and improving characteristic representation capability; and the decoder comprises a link layer module, a second convolution unit layer module and a self-attention module, wherein the decoder module part comprises three stages and is focused on restoring long-distance positioning details and gradually reconstructing the characteristic diagram to the size of the original input image. Therefore, the invention provides better feature learning capability in medical image segmentation tasks, better processes the spatial relationship between features and pixels of different scales, and can reduce the risk of overfitting, thereby improving the accuracy of segmentation results.

Description

Segmented vertebra CT image segmentation method and system based on transducer

Technical Field

The invention relates to the technical field of surgical operation machines based on medical image processing, in particular to a segmented vertebra CT image segmentation method and system based on a transducer applied to a spinal operation robot.

Background

In recent years, image processing and computer technology have been rapidly developed, and images provide a lot of important information in daily life. Image processing extracts image information, and clinical images are combined with computer technology to analyze the illness state of patients, so that active prevention of diseases and reduction of cure risks of patients are achieved.

In diagnosis and detection of bone vertebra diseases, computed Tomography (CT) can better display tissue structures due to simple operation, quick imaging and high resolution, so that the CT can be widely used in spinal disease diagnosis.

In clinical work on spinal disease, doctors apply computer-aided navigation techniques to acquire spinal images prior to surgery, combine image processing with three-dimensional reconstruction visualization, and display two-dimensional or three-dimensional spinal images in a virtual world coordinate space to fully evaluate spinal conditions based on digitized information, simulate surgical paths, and control surgical tools to perform surgery according to locators to target sites. Therefore, the processing of the spine image directly reflects the medical facts of the spine, and is an important basis for clinical diagnosis and surgical treatment.

The efficient and accurate image segmentation can reduce the actual physiological form of the vertebrae of the patient to the greatest extent, and is convenient for the operating doctor to better grasp the relation of focus distribution and important tissue structures. Clinically, image segmentation is mainly completed through manual annotation, and depending on clinical experience and knowledge level of doctors, the image segmentation process is tedious, time-consuming and irreproducible due to factors such as fatigue of reading and large workload, and larger errors can be aggregated in subjective judgment. By means of a deep learning technology, vertebra segmentation results with higher accuracy are obtained by performing automatic image segmentation on vertebra CT images, automatic and effective image segmentation of the vertebrae is achieved, and theoretical reference and technical support are provided for image processing and clinical development of spinal medicine.

There are also some automatic image segmentation methods, such as chinese patent publication No. CN113313717a, which is a segmented vertebra CT image segmentation method based on deep learning. Chinese patent publication No. CN114049955a, a computerized tomography spine fracture auxiliary diagnostic system. Chinese patent publication No. CN114240930A, lumbar vertebra identification positioning device and method based on deep learning and electronic equipment.

However, the above-described patent medical image segmentation task technique generally uses a model such as a Convolutional Neural Network (CNN) to extract features and classify the features. However, these methods may sometimes be affected by factors such as poor image quality and noise, which may lead to inaccurate image segmentation results, which are embodied in several aspects as follows:

(1) models such as convolutional neural networks in the prior art are typically local and can only capture local features.

(2) Structures in medical images may be of different sizes, requiring simultaneous processing of features of different dimensions. Prior art methods typically use multi-scale convolution or the like to address this problem, but the effect of this approach is not stable.

(3) There are often complex spatial relationships between pixels in medical images, which are important for proper segmentation. Conventional methods typically use convolution operations to process spatial relationships, but such methods may lose some information.

(4) Medical images are typically limited in number, and segmentation tasks require highly accurate results, which can easily lead to overfitting.

Disclosure of Invention

In order to solve the technical problems, the invention provides a segmented vertebra CT image segmentation method based on a transducer, which can provide better feature learning capability in a medical image segmentation task by using the transducer technology, better process the spatial relationship between features and pixels of different scales, reduce the risk of overfitting and further improve the accuracy of segmentation results.

In order to achieve the above purpose, the technical scheme of the invention is as follows:

a transducer-based segmented vertebrae CT image segmentation method based on a deep learning neural network model following a classical encoder-decoder network architecture, the encoder comprising a preprocessing module, a first convolution unit layer module, a self-attention module and an output finishing module; the decoder comprises a contact layer module, a second convolution unit layer module and a self-attention module; the method comprises a deep learning neural network model training step S1 and a cone segmentation step S2:

the deep learning neural network model training step S1 includes:

step S11: acquiring original 3D-CT images of previous N patient spinal surgeries and labeling M cone edge contours in the corresponding original 3D-CT images to form an image mask of a cone; the labeling image features are category feature information, and M cone edge contours of each patient are used for forming M cone MASK; the preprocessing module is used for processing the original 3D-CT image into a 3D-CT image with unified standard;

step S12: sequentially inputting the original 3D-CT images to an encoder, and executing four stages of the encoder; the first stage, the first convolution module receives the preprocessed 3D-CT image to obtain a first image, the first convolution module comprises a first convolution unit and a second convolution unit which are used as downsampling to capture and obtain a multi-scale context feature, the first convolution unit and the second convolution unit comprise a convolution layer, a BN layer and a ReLU layer, the number of input channels of the first convolution unit is half of the number of output channels, and the number of input channels of the second convolution unit is the same as the number of output channels; wherein the first image is an image having external contour features;

step S13: performing a second stage of an encoder, namely inputting the first image to a self-attention module, wherein the self-attention module comprises an input feature map unit layer, a third convolution unit layer, a light-weight attention unit layer and a contact layer for receiving the first image; the number of input channels of the third convolution unit layer is half of the number of output channels; the number of input channels and the number of output channels of the light-weight attention unit layer are the same, the light-weight attention unit layer is used for extracting global characteristic information of the first image to obtain a second image, learning long-distance dependent information of the second image and fusing local semantic characteristics of the first image so that the second image has stronger semantic information than the first image; the linkage layer is used for enabling the output channel number of the encoder to be the same as the channel number of the first convolution unit;

step S14: executing a third stage and a fourth stage of the encoder, namely, in the third stage, the third convolution unit layer input of the self-attention module is an output characteristic diagram of the second stage, and executing the step S13 again; in the fourth stage, the third convolution unit layer input of the self-attention module is the output feature map of the third stage, and the step S13 is re-executed;

step S15: sequentially inputting the encoder output characteristic diagram to a decoder, and executing three stages of the decoder; the method comprises the steps of a first stage, combining the feature images of an output feature image of a fourth stage of an encoder after upsampling with the feature images of the third stage of the encoder through a contact layer module, wherein the contact layer module is used for connecting input tensors and output tensors in a deep neural network to obtain feature images with different resolutions after combining the feature images; the number of input channels of the contact layer module is half of the number of output channels; then carrying out combined input characteristic processing through a second convolution unit layer module, wherein the number of input channels of the second convolution unit layer module is twice the number of output channels; finally, through the self-attention module, the number of input channels and the number of output channels of the self-attention module are the same;

step S16: executing a second stage and a third stage of the decoder, namely, in the second stage, the input of the second convolution unit layer module is an output characteristic diagram of the first stage, and executing the step S15 again; in the third stage, the second convolution unit layer module inputs an output feature map of the second stage, and step S15 is re-executed;

step S17: training the deep learning neural network model by adopting a gradient descent method according to the middle image characteristics and the corresponding cone MASK, so as to obtain a final deep learning neural network model;

the cone segmentation step S2 includes:

step S21: when an operation is required to be performed on a certain patient, performing and processing on an original 3D-CT image of the patient by adopting the same preprocessing mode in the step S12 to obtain a preprocessed 3D-CT image;

step S22: and acquiring the trained deep learning neural network model, and dividing the preprocessed 3D-CT image as input to obtain the cone MASK formed by the divided M cone edge contours.

Further, the preprocessing mode in the step S12 and the step S21 is to perform uniform resolution, resampling and standardized preprocessing on the original 3D-CT image, so as to obtain a preprocessed 3D-CT image.

Further, in step S13, the output sorting module further includes a bottleneck unit having residual characteristics, and the bottleneck module includes an input feature map, two convolution layers, a nonlinear activation function, a residual connection, and two batch normalization layers.

Further, in the encoder, the first convolution unit and the second convolution unit include a 3 x 3 convolution layer, the number of input channels of the first convolution unit is 32, and the number of output channels is 64; the second convolution unit has an input channel number of 64, an output channel number of 64, input large of the 3D-CT image small of 1 x 128 x 128.

Further, in the second stage of the encoder, the third convolution unit layer input channel number and output channel number are half in the third stage and fourth stage of the encoder, respectively, and the light-weighted attention unit layer input channel number and output channel number are half in the third stage and fourth stage of the encoder.

Further, in the second stage of the encoder, the third convolution unit layer has an input channel number of 64, an output channel number of 128, the light attention unit layer has an input channel number of 128, and an output channel number of 128; in the third stage and the fourth stage, the number of input channels is 128, and the number of output channels is 256; the number of input channels of the third convolution unit layer is 256, and the number of output channels is 256; the light attention unit layer has 256 input channels and 256 output channels.

Further, in the first stage of the decoder, the number of input channels and the number of output channels of the link layer module are half in the second stage and the third stage of the encoder, respectively, and the number of input channels and the number of output channels of the self-attention unit layer are half in the second stage and the third stage of the encoder.

Further, in the first stage of the decoder, the number of input channels of the link layer module is 256, the number of output channels is 512, the number of input channels of the second convolution unit layer module is 512, the number of output channels is 256, the number of input channels of the self-attention module is 256, and the number of output channels is 256.

Further, in the first phase of the decoder, the third convolution unit layer module is a 3 x 3 convolution layer, a BN layer, a ReLU layer, the number of input channels of the third convolution unit layer module is 512, and the number of output channels is 256.

Further, the unified 3D-CT image is processed through unified resolution, resampling and scanning image data dimension stabilization standardization.

According to the technical scheme, the method and the device provided by the invention have the advantages that by means of a deep learning technology, the vertebrae CT image is automatically segmented, the vertebrae segmentation result with higher accuracy is obtained, the vertebrae is automatically and effectively segmented, conditions are created for operation planning, and a powerful technical support is provided. Compared with the prior art, the method has the beneficial technical effects that:

(1) the long-distance dependence is learned based on a transducer technology, and the global characteristic can be learned by utilizing a self-attention mechanism, so that the information of the long-distance dependence can be processed better.

(2) By using a transducer technology, the method can adapt to the characteristics of different scales, namely, the characteristics of the different scales of the 3D-CT image can be better processed, and the method is more stable.

(3) Using the transducer technique, the spatial relationship, i.e., the spatial relationship between the processing pixels, can be better captured, thereby better capturing structural information in the image.

(4) The use of a transducer can reduce the number of parameters of the model, thereby reducing the risk of overfitting. Furthermore, the transducer's attention mechanism can help the model handle noise and uncertainty better.

Drawings

FIG. 1 is a flow chart of a TRANSFORMER-based segmented vertebra CT image segmentation method in an embodiment of the invention

FIG. 2 is a schematic diagram of a deep learning neural network model according to an embodiment of the invention

Detailed Description

The following describes embodiments of the present invention in further detail with reference to FIGS. 1-2.

It should be noted that, the segmentation method adopted by the present invention is based on a deep learning neural network model, the deep learning neural network model follows a classical encoder-decoder network architecture, in the embodiment of the present invention, the encoder includes a preprocessing module, a first convolution unit layer module, a self-attention module and an output finishing module, and in the encoder portion, four stages are included, which are mainly used for capturing multi-scale context features and improving feature representation capability; and the decoder comprises a link layer module, a second convolution unit layer module and a self-attention module, wherein the decoder module part comprises three stages and is focused on restoring long-distance positioning details and gradually reconstructing the characteristic diagram to the size of the original input image.

Referring to fig. 1, fig. 1 is a flowchart illustrating a segmented vertebra CT image segmentation method based on TRANSFORMER according to an embodiment of the present invention. As shown in fig. 1, the method includes a deep learning neural network model training step S1 and a vertebral body segmentation step S2.

The deep learning neural network model training step S1 includes:

step S11: acquiring original 3D-CT images of previous N patient spinal surgeries and labeling M cone edge contours in the corresponding original 3D-CT images to form cone MASK; the labeling image features are category feature information, and M cone edge outlines of each patient are used for forming M cone MASK.

It is clear to those skilled in the art that the key element of spine image processing is image segmentation, which breaks down the image, extracts the region of interest, and finds objects or object boundaries in the image. The nature of the image is that it is composed of individual pixels, each with gray values, and the edges of the target area can be found, usually by using the characteristics of the pixels.

The essence of image segmentation is the pixel classification problem, each pixel is assigned a label to determine whether the pixel should be divided into target regions. And waiting until all the pixel points are classified, wherein the obtained pixel point set is the wanted segmentation result.

After the M cone edge outlines in the original 3D-CT images corresponding to the plurality of 3D-CT images and the labels are provided, the deep learning neural network can be constructed and trained, and can be used as a segmentation neural network for segmenting the plurality of 3D-CT images.

That is, the segmented neural network is a neural network which is trained by a plurality of 3D-CT images and data of M vertebral edge outlines in the corresponding original 3D-CT images to become a vertebral segmentation model based on CT images.

Before training, the above empirical data needs to be subjected to standard unified preprocessing, and in the embodiment of the present invention, unified preprocessing in modes of unified resolution, resampling, standardization and the like may be adopted, specifically:

(1) unified resolution

Because the 3D-CT images in the data set are derived from different scanning devices, different 3D-CT images can cause different spatial resolutions of the images due to different scanning parameters, and even the resolutions of the 3D-CT images in the various dimensions are different. It is clear to those skilled in the art that differences in resolution between images can make the network difficult to train.

Therefore, in the embodiment of the present invention, the resolution of all images can be unified to 1mm×1mm.

(2) Resampling of

In an embodiment of the present invention, the image resampling may employ a linear interpolation method, and the label may employ a nearest neighbor interpolation method.

(3) Standardization of

In an embodiment of the present invention, the mean μ and standard deviation σ of the images in the entire dataset may be counted first, and then the mean μ is subtracted from each 3D-CT image and divided by the standard deviation σ to dimensionally stabilize the data of each scanned image within a certain range.

Step S12: sequentially inputting the original 3D-CT images to an encoder, and executing four stages of the encoder; the first stage, the first convolution module receives the preprocessed 3D-CT image to obtain a first image, the first convolution module comprises a first convolution unit and a second convolution unit which are used as downsampling to capture and obtain a multi-scale context feature, the first convolution unit and the second convolution unit comprise a convolution layer, a BN layer and a ReLU layer, the number of input channels of the first convolution unit is half of the number of output channels, and the number of input channels of the second convolution unit is the same as the number of output channels; wherein the first image is an image having external contour features.

In the embodiment of the invention, the N original 3D-CT images are sequentially sent into a deep learning neural network model network in the form of an array, i.e. first passing through two 3 x 3 convolutional layers, BN layer, reLU layer in order. Wherein, the number of input channels of the first convolution unit can be 32, and the number of output channels can be 64; the second convolution unit may have an input channel number of 64, an output channel number of 64, the size of the input image may be is 1×1×128 x 128.

The second convolution module used for downsampling comprises a 3 x 3 convolution layer, BN layer and ReLU layer, i.e. downsampling is performed by the second convolution module, the stride is 2, which doubles the channel number of the 3D-CT image feature map, and reduces the size of the 3D-CT image feature map by half. Preferably, the number of input channels of the second convolution module may be 64, and the number of output channels may be 128.

Next, step S13: performing a second stage of an encoder, inputting the first image to a self-attention module, the self-attention module including an input feature map unit layer, a third convolution unit layer, a lightweight attention unit layer, and a transpose convolution unit layer for receiving the first image; the number of input channels of the third convolution unit layer is half of the number of output channels; the number of input channels and the number of output channels of the light-weight attention unit layer are the same, the light-weight attention unit layer is used for extracting global characteristic information of the first image to obtain a second image, learning long-distance dependent information of the second image and fusing local semantic characteristics of the first image so that the second image has stronger semantic information than the first image; the linkage layer is used for enabling the output channel number of the encoder to be the same as the channel number of the first convolution unit.

The input of the self-attention module is an output feature map of the second convolution module, the number of input channels can be 128, and the number of output channels can be 128, so that global feature information is extracted and local semantic features are fused to obtain stronger semantic information.

In some embodiments of the present invention, the Self-Attention (transform) module may include an input feature Layer, a third convolution unit, a lightweight Attention unit (MHSA), and a Contact Layer. Wherein the third convolution unit consists of convolution, reLU activation function and BN regularization.

In particular, the third convolution unit may be a 1 x 1 convolution layer, the method is used for splitting the channel number of the input feature map to obtain a 3D-CT image feature map with half of the channel number, the 3D-CT image is then used for extracting global features and learning long-distance dependent information through an attention unit, and the output size of the 3D-CT image is unchanged.

The light attention unit can be classified as an improved multi-head self-attention unit, which is a neural network module based on an attention mechanism and is also an important component in a transducer model. The method is mainly used for processing the input data of the sequence type, and can calculate a weight vector for each input position to represent the importance relation among different positions, and the specific implementation steps are as follows:

firstly, an input (input sequence) is subjected to three linear transformation layers to obtain q (query), k (key) and v (value) vectors; performing similarity calculation on the q and k vectors to obtain an intensity score, namely dividing the dot product between q and k by a scaling factor sqrt (d_k) (d_k is the dimension of the k vector), and performing softmax calculation to obtain a weight vector; multiplying the weight vector and the v vector to obtain output; finally, the outputs are spliced together and dimension transformation is carried out through a linear transformation layer to obtain the final output.

That is, after the input sequence is subjected to linear transformation, similarity calculation is performed on the q and k vectors to obtain an attribute score, which indicates the correlation between the position and other positions. And (3) obtaining weight distribution through softmax normalization, and then carrying out weighted summation on the weight distribution and the v vector to obtain the output of the position. Where Q represents the query vector, K represents the key vector, V represents the value vector, d_k represents the dimension of the K vector, and T represents the transpose.

In the embodiment of the invention, the multi-head self-attention module is improved by the light-weight attention module, three linear transformation layers are changed into three convolution unit layers, the input 3D-CT image feature images are downsampled, and the calculation amount is reduced by reducing the size of the feature images.

The transposition convolution unit layer consists of transposition convolution, a ReLU activation function and BN regularization and is mainly used for up-sampling images; the residual unit mainly adds the input 3D-CT image feature map and the output 3D-CT image feature map of the transpose convolution unit layer to obtain deeper semantic features.

The input of the third convolution unit is the output characteristic diagram of the second stage, the number of input channels can be 128, and the number of output channels can be 256; the input of the self-attention module may be an output characteristic diagram of the convolution unit module, the number of input channels may be 256, and the number of output channels may be 256.

Step S14: executing a third stage and a fourth stage of the encoder, namely, in the third stage, the third convolution unit layer input of the self-attention module is an output characteristic diagram of the second stage, and executing the step S13 again; in the fourth stage, the third convolution unit layer input of the self-attention module is the output feature map of the third stage, and step S13 is executed again.

In other preferred embodiments of the present invention, the self-attention module may further include a bottleneck (BottleNck) module with residual characteristics, which mainly adds the input feature map and the output feature map of the transposed convolution unit layer to obtain the further semantic features.

The bottleneck module with residual characteristics comprises an input feature map, two convolution layers (a fourth convolution unit and a fifth convolution unit), a nonlinear activation function, a residual connection and two batch normalization layers. The output of the self-attention module generates two branches, the first branch does not change, the second branch is used for capturing local 3D-CT image characteristics through a bottleneck (Bottlenck) module with residual characteristics, the output size of the local 3D-CT image characteristics is not changed, and finally the 3D-CT image characteristic images of the two branches pass through a contact layer so that the output channel number is changed into the original channel number.

Summarizing, the input 3D-CT image feature map is fused with a small channel number bottleneck convolution kernel through a fourth convolution unit, so that the calculation amount and the model complexity are reduced. And then, enhancing the depth of the fused feature map by a fifth convolution unit, and extracting more high-level features. And then, carrying out nonlinear transformation on the convolution output result through a nonlinear activation function, and enhancing the nonlinear characteristic of the network. Furthermore, to achieve a residual connection of the network, the convolved output results are added to the batch normalized input signature. Finally, the performance of the network is further optimized by the second batch normalization layer. The bottleneck module with the residual characteristics has higher effectiveness and stability in training and testing, and has wide application prospects in the fields of computer vision, natural language processing and the like.

Referring to fig. 2 again, fig. 2 is a schematic diagram of a deep learning neural network model in an embodiment of the present invention, and as shown in fig. 2, the decoder may include a link layer module, a second convolution unit layer module, and a self-attention module in order. In the decoder section, it is split into three phases, which focus on restoring long range localization details and gradually reconstructing the feature map to the size of the original 3D-CT image.

Step S15: sequentially inputting the encoder output characteristic diagram to a decoder, and executing three stages of the decoder; the method comprises the steps of a first stage, combining the feature images of an output feature image of a fourth stage of an encoder after upsampling with the feature images of the third stage of the encoder through a contact layer module, wherein the contact layer module is used for connecting input tensors and output tensors in a deep neural network to obtain feature images with different resolutions after combining the feature images; the number of input channels of the contact layer module is half of the number of output channels; then carrying out combined input characteristic processing through a second convolution unit layer module, wherein the number of input channels of the second convolution unit layer module is twice the number of output channels; and finally, through the self-attention module, the number of input channels of the self-attention module is the same as the number of output channels.

Specifically, the 3D-CT image feature map is up-sampled prior to each self-attention module, using tri-linear interpolation, followed by connecting the high-level features of the encoder with the up-sampled 3D-CT image features for feature fusion, similar to U-Net. For each up-sampling step, the number of channels of the 3D-CT image feature map is reduced by half, and the 3D-CT image size is doubled.

In the embodiment of the invention, the input of the first stage of the decoder can be the up-sampled characteristic diagram of the output characteristic diagram of the fourth stage of the encoder and the output characteristic diagram of the third stage of the encoder, namely, the channel numbers of the two characteristic diagrams are combined through a link layer firstly, and are changed from 256 to 512; then a 3 x 3 convolution layer, BN layer, reLU layer is used as the merging input feature, changing the channel number from 512 to 256; finally, a self-attention module is passed.

The fourth convolution unit module inputs an output characteristic diagram of the second linkage layer module, the number of input channels can be 512, and the number of output channels can be 256; the input of the self-attention module is an output characteristic diagram of the fourth convolution unit module, the number of input channels can be 256, and the number of output channels can be 256.

It will be clear to those skilled in the art that for each up-sampling step the number of channels of the feature map is reduced by half and the image size is doubled. The input of the first stage of the decoder is the up-sampled characteristic diagram of the output characteristic diagram of the fourth stage of the encoder and the output characteristic diagram of the third stage of the encoder, and the channel numbers of the two characteristic diagrams are combined through a contact layer to be changed from 256 to 512; then a 3 x 3 convolution layer, BN layer, reLU layer is used as the merging input feature, changing the channel number from 512 to 256; finally, a self-attention module is passed. The input of the second convolution unit layer module is an output characteristic diagram of the contact layer, the number of input channels is 512, and the number of output channels is 256; the input of the self-attention module is an output characteristic diagram of the convolution unit module, the input channel number is 256, and the output channel number is 256.

Step S16: executing a second stage and a third stage of the decoder, namely, in the second stage, the input of the second convolution unit layer module is an output characteristic diagram of the first stage, and executing the step S14 again; in the third stage, the second convolution unit layer module inputs the output feature map of the second stage, and step S15 is executed again.

That is, the second and third stages are similar to the first stage in that the second convolution unit layer module inputs an output feature map of the link layer, the number of input channels is 256, and the number of output channels is 128; the input of the self-attention module is an output characteristic diagram of the convolution unit module, the input channel number is 128, and the output channel number is 128. Finally, the Softmax activation function is integrated with the 1 x 1 convolutional layer to obtain the final segmentation results for multiple categories.

Step S17: and training the deep learning neural network model by adopting a gradient descent method according to the medium image characteristics and the corresponding cone MASK, so as to obtain the final deep learning neural network model.

After the training segmentation network is established, directly storing a model for standby, and adopting a PyTorch framework when training the neural network, namely inputting a training set into the network for training after finishing the network structure design and initialization work, and adopting a gradient descent method to enable the network to converge to an optimal value according to the loss change conditions of the training set and the testing set in the training process.

In order to optimize the neural network parameters, a gradient descent algorithm calculation network calculates a loss function of the deep learning neural network model at the time of each small batch data training, and calculates a gradient of the loss function to the neural network parameters. The loss function is continuously reduced by the movement of the parameters in the gradient direction, so that the optimization of the parameters of the deep learning neural network model is realized. In addition, by observing the loss change conditions of the training set and the test set, whether the current deep learning neural network model has the over-fitting phenomenon can be judged. If the training set loss is reduced and the test set loss is increased, the phenomenon that the current deep learning neural network model is over-fitted on the training set is represented, and model training can be optimized through methods of adding regularization items, reducing model complexity and the like, so that the model is converged to an optimal value.

After the final deep learning neural network model is trained, test data can be input and preprocessed. The cone segmentation step S2 includes:

step S21: when an operation is required to be performed on a certain patient, the original 3D-CT image of the patient is subjected to uniform resolution, resampling and standardized preprocessing, and a preprocessed 3D-CT image is obtained.

(1) Unified resolution

(2) Resampling of

(3) Standardization of

Step S22: and acquiring the trained deep learning neural network model, dividing the preprocessed 3D-CT image of the patient as input, and finally obtaining a division result (for example, obtaining a cone MASK formed by the edge contours of M divided cones and storing the cone MASK in a nii.gz format).

In summary, by means of the deep learning technology, the invention obtains the vertebrae segmentation result with higher accuracy by automatically segmenting the vertebrae CT image, realizes automatic and effective segmentation of vertebrae, creates conditions for operation planning and provides powerful technical support.

The foregoing description is only of the preferred embodiments of the present invention, and the embodiments are not intended to limit the scope of the invention, so that all changes made in the equivalent structures of the present invention described in the specification and the drawings are included in the scope of the invention.

Claims

1. A segmented vertebra CT image segmentation method based on a transducer, the segmentation method being based on a deep learning neural network model, characterized in that the deep learning neural network model follows a classical encoder-decoder network architecture, the encoder comprising a preprocessing module, a first convolution unit layer module and a self-attention module; the decoder comprises a contact layer module, a second convolution unit layer module and a self-attention module; the method comprises a deep learning neural network model training step S1 and a cone segmentation step S2:

the deep learning neural network model training step S1 includes:

step S11: acquiring original 3D-CT images of previous N patient spinal surgeries and labeling M cone edge contours in the corresponding original 3D-CT images to form cone MASK; the labeling image features are category feature information, and M cone edge contours of each patient are used for forming M cone MASK; the preprocessing module is used for processing the original 3D-CT image into a 3D-CT image with unified standard;

the cone segmentation step S2 includes:

2. The segmented vertebra CT image segmentation method according to claim 1, wherein the preprocessing in step S12 and step S21 is a unified resolution, resampling and standardized preprocessing of the original 3D-CT image, so as to obtain a preprocessed 3D-CT image.

3. The transducer-based segmented vertebrae CT image segmentation method according to claim 1, wherein in step S13, the self-attention module further comprises a bottleneck unit having residual characteristics, the bottleneck module comprising one input feature map, two convolution layers, one nonlinear activation function, one residual connection, and two batch normalization layers.

4. The method for segmented vertebrae CT image segmentation based on Transformer according to any one of claims 1 or 2, wherein in the encoder, the first convolution unit and the second convolution unit comprise 3 x 3 convolution layers, the first convolution unit has an input channel number of 32 and an output channel number of 64; the second convolution unit has an input channel number of 64, an output channel number of 64, input large of the 3D-CT image small of 1 x 128 x 128.

5. The method according to any one of claims 1 or 2, wherein in the second stage of the encoder, the number of input channels and the number of output channels of the third convolution unit layer are half of those in the third stage and the fourth stage of the encoder, respectively, and the number of input channels and the number of output channels of the light-weighted attention unit layer are half of those in the third stage and the fourth stage of the encoder.

6. The method according to claim 5, wherein in the second stage of the encoder, the third convolution unit layer has an input channel number of 64, an output channel number of 128, the light attention unit layer has an input channel number of 128, and an output channel number of 128; in the third stage and the fourth stage, the number of input channels is 128, and the number of output channels is 256; the number of input channels of the third convolution unit layer is 256, and the number of output channels is 256; the light attention unit layer has 256 input channels and 256 output channels.

7. The method of transducer-based segmented vertebrae CT image segmentation according to any of claims 1 or 2, wherein in the first stage of the decoder the number of input channels and the number of output channels of the tie layer module is half in the second stage and the third stage of the encoder, respectively, and the number of input channels and the number of output channels of the self-attention unit layer is half in the second stage and the third stage of the encoder.

8. The method for segmented vertebrae CT image segmentation based on a transducer according to claim 7, wherein in the first stage of the decoder, the number of input channels of the link layer module is 256, the number of output channels is 512, the number of input channels of the second convolution unit layer module is 512, the number of output channels is 256, the number of input channels of the self-attention module is 256, and the number of output channels is 256.

9. The method for segmented vertebra CT image segmentation based on a transducer according to any of claims 1 or 2, wherein, in the first stage of the decoder, the third convolution unit layer module is a 3 x 3 convolution layer, a BN layer, a ReLU layer, the number of input channels of the third convolution unit layer module is 512, and the number of output channels is 256.

10. The method of segmentation of segmented vertebrae from a transducer-based CT image according to any of claims 1 or 2, wherein the unified 3D-CT image is a unified resolution, resampled and scanned image data dimensionally stable normalized image.