CN115239740A

CN115239740A - GT-UNet-based full-center segmentation algorithm

Info

Publication number: CN115239740A
Application number: CN202210645929.4A
Authority: CN
Inventors: 田沄; 刘彬; 李岩松; 赵世凤
Original assignee: Beijing Normal University
Current assignee: Beijing Normal University
Priority date: 2022-06-08
Filing date: 2022-06-08
Publication date: 2022-10-25

Abstract

The invention provides a GT-UNet-based full-center segmentation algorithm, which comprises the following steps: preprocessing input three-dimensional multi-modality cardiac images (including CT and MRI); converting the preprocessed data into a plurality of mutually independent slices, conveying the slices to a two-dimensional segmentation network for training, and outputting a class probability map a; cutting the preprocessed data into a plurality of independent data volumes, transmitting the data into a three-dimensional segmentation network for training, and outputting a class probability graph b; sending the class probability map a and the class probability map b into a fusion module, comparing the class probability map a and the class probability map b pixel by pixel, and performing full-center segmentation by using the maximum probability class; the whole-center segmentation algorithm adaptively adjusts the size of a receptive field according to input, and effectively utilizes global information to perform remote modeling, so that the segmentation precision of the algorithm is effectively improved.

Description

Full-center segmentation algorithm based on GT-UNet

Technical Field

The invention relates to the technical field of medical image processing, in particular to a GT-UNet-based full-center segmentation algorithm.

Background

The automatic segmentation of the whole heart is taken as an important step for quantitatively evaluating and quantitatively diagnosing the heart disease of the heart structure, the complete region and the edge of the heart can be accurately extracted, and then a heart three-dimensional model is established to assist a doctor in subsequent clinical diagnosis and treatment, so that the method has important application value and clinical significance for cardiac operation navigation, interventional therapy guidance, computer-aided diagnosis and the like.

Computed Tomography (CT) and Magnetic Resonance Imaging (MRI) are common Imaging diagnosis methods for heart diseases, although doctors can obtain anatomical information of the internal structure of the heart from an Imaging examination slice of a patient, which is helpful for performing non-invasive quantitative assessment on heart function subsequently, but also greatly increases the workload of doctors invisibly, the traditional image segmentation method is manual reading, and then a radiologist manually segments a boundary by using professional software. In order to reduce the heavy workload of imaging doctors and improve the segmentation precision of cardiac structures, the research of automatic image segmentation and diagnosis by computer-aided doctors is not slow.

Based on the technical problems of the existing full-heart segmentation technology, the invention provides a full-heart segmentation algorithm based on GT-UNet.

Disclosure of Invention

The invention provides a GT-UNet based full-center segmentation algorithm.

The invention adopts the following technical scheme:

a Graph-Reasoning and transform-module based (GT-UNet) full-heart segmentation algorithm, comprising:

step 1, preprocessing input three-dimensional multi-modal cardiac images including CT and MRI;

step 2, converting the preprocessed data into a plurality of mutually independent slices, conveying the slices to a two-dimensional segmentation network for training, and outputting a class probability map a;

step 3, cutting the preprocessed data into a plurality of independent data volumes, transmitting the data volumes to a three-dimensional segmentation network for training, and outputting a class probability chart b;

and 4, sending the class probability map a and the class probability map b into a fusion module, comparing the class probability map a and the class probability map b pixel by pixel, performing full-center segmentation by using the maximum probability class, and outputting a segmentation result.

Further, in step 1, the pretreatment comprises:

step 1.1, cutting a Region of interest (ROI);

step 1.2, resampling and normalizing the ROI.

Further, converting the preprocessed data into a plurality of mutually independent slices, and conveying the slices to a two-dimensional segmentation network for training comprises the following steps:

step 2.1, obtaining a mapping function f (-) representing a characteristic linear combination, and enabling an input characteristic diagram X epsilon R of an original coordinate space omega ^L×C Mapping into an interaction space H through a mapping function f (·), and obtaining a new feature V = f (X) epsilon R ^N×C Wherein, N is the node number in H space, C is the expected feature dimension, and the feature calculation formula is:

wherein, b _i ∈R ^1×L Is a learnable mapping weight, x _j ∈R ^1×C ，v _i ∈R ^1×C ，b _ij Is a binary combination generated by convolution operation, and takes the value of 0 or 1;

step 2.2, using Graph Convolution Network (GCN) to perform inference to obtain a full connectivity Graph storing new features, and performing inference by learning interactive edge weight corresponding to each node, wherein the definition of the single-layer Graph Convolution network is as follows:

Z＝GVW _g ＝((I-A _g )V)W _g ∈R ^N×C ……(2)，

wherein A is _g And G denotes a node adjacency matrix of NxN size, A _g Is randomly initialized and learned in the training process, I is an identity matrix, Z belongs to R ^N×C For nodes that are globally inferred, W _g The state of each node is updated at the time of a state updating function;

step 2.3, projecting the node Z in the interaction space H to an original coordinate space omega, and enabling a reverse mapping function Y = g (Z) to be epsilon to R ^L×C Can be obtained from equation 3:

wherein, y _i ∈R ^1×C Is a learnable inverse mapping weight, d _ij As a weighted scalar, Z _j Representing the jth inference node.

Further, in step 2.1, by converting the function f (X)

And B = θ (X; W) _θ ) To reduce the input dimension, where B = [ B ] ₁ ，…，b _N ]∈R ^N×L In order to map the weights, the weights are,

and θ (X) is two convolution layers, W _θ And

is a learnable convolution kernel for each layer.

Further, in step 3, the three-dimensional segmentation network includes: CNN encoder F for extracting multi-scale feature maps from input images ^CNN (. The), deTrans encoder that processes the attention multiscale feature map embedded with position coding in an end-to-end manner, CNN decoder for generating and feeding the DeTrans encoder to decoder segmentation.

Further, the CNN encoder F ^CNN (. H) contains a Conv-In-Relu module and a three-stage Resnet module, where the Conv-In-Relu module first performs a convolution operation with a convolution kernel of 7 × 7 × 7 × 064 and a step size of (1, 2), followed by example normalization and ReLu processing; then, the residual error data is sent to a Resnet module in the first stage, the residual error data comprises three residual error units, residual error operation with the step size of (2, 2) and the convolution kernel of 3 multiplied by 13 multiplied by 23 multiplied by 3192 is firstly executed, then residual error operation with the step size of (1, 1) and the convolution kernel of 3 multiplied by 43 multiplied by 53 multiplied by 6192 is carried out twice, 192 characteristic diagrams with the size of 48 multiplied by 740 multiplied by 840 are obtained, and the characteristic diagrams are sent to a Resnet module in the second stage; except that the number of convolution kernels is updated to 384 from 192 by the Resnet module, the other parameters are the same as those of the first stage, 384 characteristic diagrams with the size of 24 × 20 × 20 are finally obtained, the characteristic diagrams are sent to the Resnet module of the third stage, two residual error units are arranged in the Resnet module, the residual error operation with the step size of (2, 2) and the convolution kernel of 3 × 3 × 3 × 384 is firstly executed, and then the residual error operation with the step size of (1, 1) and the convolution kernel of 3 × 3 × 3 × 384 is executed; by F ^CNN The definition formula of the generated characteristic diagram is as follows:

where L denotes the number of feature layers, L is a specific layer, x is an input feature map, Θ denotes parameters required by the encoder, C denotes the number of channels, H denotes the height of the input image, W denotes the width of the input image, and D denotes the depth of the input data, i.e., the number of slices.

Further, the DeTrans encoder comprises a sequence layer for converting the input image and a plurality of stacked deformable DeTrans layers, and the DeTrans encoder is used for generating the feature map generated by the CNN encoder

Flattening the image into a one-dimensional image patch sequence, and embedding a three-dimensional fixed position coding sequence into the flattened one-dimensional sequence

To capture sequences of relative or absolute positions between various substructures of the heart.

Furthermore, the CNN decoder includes four upsampling modules, each of the first three upsampling modules includes a convolutional layer having a step size of 2 × 2 × 2 and a convolutional kernel of 2 × 2 × 2, the number of the corresponding convolutional kernels is 384, 192, and 64, the convolutional layer is followed by a three-dimensional residual block to refine the feature map, and then the feature map output by the encoder and the feature map obtained after the transposed convolution are connected by jumping to sum pixel by pixel, so as to retain more low-layer information; the final upsampling module consists of one upsampling layer and one 1 x 1 convolutional layer, mapping the feature maps of 64 channels into the desired number of classes.

Compared with the prior art, the invention has the following advantages:

the GT-UNet based full-center segmentation algorithm can effectively capture the global relationship and is suitable for different heart data sets, wherein the graph reasoning unit captures the global relationship by projecting the characteristics into an interactive space to carry out relationship reasoning, the Transformer module can overcome induction deviation of convolution operation and inherent limitation of local sensitivity, the size is adjusted in a self-adaptive mode according to input, and the global information is effectively utilized for remote modeling, so that the segmentation precision of the algorithm is effectively improved.

Drawings

FIG. 1 is a flowchart of a GT-UNet based full-center segmentation algorithm in an embodiment of the present invention.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, the present invention will be described in further detail below with reference to the accompanying drawings and specific embodiments, it being understood that the embodiments and features of the embodiments of the present application can be combined with each other without conflict.

As shown in fig. 1, the GT-UNet based full-center segmentation algorithm includes:

step 3, cutting the preprocessed data into a plurality of independent data volumes, transmitting the data volumes to a three-dimensional segmentation network for training, and outputting a class probability map b;

Specifically, in step 1, a non-zero template is generated according to an input image, and clipping is performed according to the size and the position of a boundary frame; resampling, wherein the xy plane adopts third-order spline interpolation, and the z axis adopts a nearest neighbor interpolation method; normalization using the z-score method;

in step 2, the convolutional neural network is good at extracting local relations, but is very weak in capturing global relations, and a multilayer superposition is usually required to achieve an expected effect, so that the difficulty and cost of global reasoning of the CNN are increased sharply, and a global modeling and reasoning are generally used to benefit a global segmentation task, so in the embodiment, a global semantic reasoning unit based on graph convolution is added in the two-dimensional segmentation network;

the two-dimensional convolution segmentation network specifically comprises:

step 2.1, obtaining a mapping function f (-) representing a characteristic linear combination, and enabling an input characteristic diagram X epsilon R of an original coordinate space omega ^L×C Mapping into an interaction space H through a mapping function f (·), and obtaining a new feature V = f (X) epsilon R ^N×C Wherein N is the number of nodes in the H space, C is the expected feature dimension, and the feature calculation formula is as follows:

wherein, b _i ∈R ^N×L Is a learnable mapping weight, x _j ∈R ^1×C ，v _i ∈R ^1×C ，b _ij Is a binary generated using a convolution operationPreparing a combination, wherein the value is 0 or 1;

step 2.2, using Graph Convolution Network (GCN) to carry out reasoning to obtain a full connected graph storing new characteristics, and reasoning by learning interactive edge weight corresponding to each node, wherein the definition of the single-layer graph convolution network is as follows:

Z＝GVW _g ＝((I-A _g )V)W _g ∈R ^N×C ；

wherein, A _g And G denotes a node adjacency matrix of NxN size, A _g Is randomly initialized and learned in the training process, I is an identity matrix, and Z belongs to R ^N×C For nodes subject to global reasoning, W _g The state of each node is updated at the time of a state updating function;

step 2.3, projecting the node Z in the interaction space H to an original coordinate space omega, and reversely mapping a function Y = g (Z) epsilon R ^L×C Can be obtained from equation 3:

As an improvement of this embodiment, in order to further reduce the algorithm input dimension, the function f (X) is converted into

And B = θ (X; W) _θ ) Wherein B = [ B ] ₁ ，…，b _N ]∈R ^N×L In order to map the weights, the weights are,

and θ (X) is two convolution layers, W _θ And

is a learnable convolution kernel for each layer.

In step 2.3, the network takes a CNN encoder-decoder as a basic framework, a Transformer-based deformable encoder (DeTrans) is inserted for modeling and analyzing the long-distance dependency relationship, the network mainly comprises a CNN encoder, the DeTrans encoder and a CNN decoder, wherein the CNN encoder extracts a multi-scale feature map from an input image, the DeTrans encoder processes an attention multi-scale feature map embedded with position encoding in an end-to-end manner, and the CNN decoder reconstructs the feature map;

wherein, CNN encoder F ^CNN <xnotran> (·) Conv-In-Relu Resnet , conv-In-Relu 7 × 7 × 7 × 064, (1,2,2) , ReLu , Resnet , , (2,2,2), 3 × 13 × 23 × 3192 , (1,1,1), 3 × 43 × 53 × 6192 , 192 48 × 740 × 840 , Resnet , Resnet 192 384, , 384 24 × 20 × 20 , Resnet , , (2,2,2), 3 × 3 × 3 × 384 , (1,1,1), 3 × 3 × 3 × 384 ; </xnotran> By F ^CNN The definition formula of the generated characteristic diagram is as follows:

wherein, L represents the number of feature layers, Θ represents a parameter required by an encoder, C represents the number of channels, H represents the height of an input image, W represents the width of the input image, and D represents the depth of input data, i.e., the number of slices;

to overcome the inductive bias of convolution operations and the inherent limitations of locality sensitivity, the present inventionEmbodiments introduce a DeTrans encoder, the core point is a multi-scale deformable self-attention (MS-DMSA) mechanism for capturing remote pixel dependency, the DeTrans encoder is composed of a sequence layer of input image conversion and a plurality of stacked deformable DeTrans layers, while a Transformer can only process data in a sequence-to-sequence mode and does not contain circulation and convolution operations, and a characteristic diagram generated by a CNN encoder needs to be processed

Flattening is a one-dimensional image patch sequence, but direct operation inevitably loses some key spatial position relations, and a three-dimensional fixed position coding sequence needs to be embedded in the flattened one-dimensional sequence

To capture sequences of relative or absolute positions between various substructures of the heart;

in this embodiment, the wavelength is used to form a trigonometric function of a geometric progression from 2 pi to 10000 · 2 pi to calculate the coordinates of each dimension pos by the following specific calculation formula:

where pos represents the position, i is the dimension, # ∈ { D, H, W } represents the depth, height and width of the input image, respectively, and for each feature level l, PE needs to be assigned _D ，PE _H ，PE _W Spliced together as a three-dimensional position code p _l Then with the unfolded f _l Adding element by element to obtain an input sequence of the DeTrans encoder;

the self-attention layer in the initial Transformer can check all possible spatial positions according to the size of the characteristic diagram, and the invention introduces a position module only focusing on key sampling points around a reference point, which is named as MS-DMSA, so that the parameter and calculation cost, z _q ∈R ^C Is a characteristic representation of the query matrix q,

is a three-dimensional coordinate normalized by the reference point, when a multi-dimensional feature map extracted from the last L stages of the CNN encoder is given

The ith attention head is characterized by:

wherein K is the number of sampling key points ^ (z) _q ) _ilqk ∈[0，1]To focus on the weight, Δ _pilqk ∈R ³ Indicating the sample offset, σ, of the kth sample point in the ith feature level _l (. A) is to

Rescaled to the ith characteristic level, ^ (z) _q ) _ilqk And Δ _pilqk Are all to the query feature z _q The values of the parameters obtained by performing the linear projection, where the MS-DMSA layer is defined as:

h is the number of the attention heads, phi (·) represents a linear projection layer for weighting and aggregating the characteristics of all the attention heads, a DeTrans layer is composed of an MS-DMSA layer and a feedforward network, each layer adopts jump connection and carries out layer normalization, a DeTrans encoder is created by repeatedly stacking the DeTrans layers, and then an output sequence is formed into a characteristic diagram again according to the size of a three-dimensional scale;

in this embodiment, the CNN decoder includes four upsampling modules, each of the first three modules includes a convolutional layer having a step size of 2 × 2 × 2 and convolutional kernels of 2 × 2 × 2, the number of the convolutional kernels is 384, 192, and 64, respectively, the convolutional layers are followed by a three-dimensional residual block to refine the feature map, the feature map output by the encoder and the feature map obtained after performing the transpose convolution are subjected to pixel-by-pixel summation through jump connection, more low-layer information is retained, the last module is composed of one upsampling layer and one 1 × 1 convolutional layer, and the feature maps of 64 channels are mapped into a desired number of classes.

In step 4, the output results of the two sub-convolution networks are fused together by the fusion module, firstly, preprocessed data are converted into a plurality of mutually independent slices and are conveyed to the two-dimensional segmentation network for training, a category probability graph a is output, the preprocessed data are cut into a plurality of data volumes and are conveyed to the three-dimensional segmentation network for training, a label prediction probability graph b is output, then the two probability graphs are conveyed to the fusion module for pixel-by-pixel comparison, and final full-center segmentation is carried out according to the category with the maximum probability.

The present invention is not limited to the above-described embodiments, which are described in the specification and illustrated only for illustrating the principle of the present invention, but various changes and modifications may be made within the scope of the present invention as claimed without departing from the spirit and scope of the present invention. The scope of the invention is defined by the appended claims.

Claims

1. A GT-UNet based full-heart segmentation algorithm, comprising:

step 1, preprocessing an input three-dimensional multi-modal heart image;

and 4, sending the class probability map a and the class probability map b into a fusion module, comparing the class probability map a and the class probability map b pixel by pixel, and performing full-center segmentation by using the maximum probability class.

2. The GT-UNet based full-heart segmentation algorithm according to claim 1, wherein in step 1, the preprocessing comprises:

step 1.1, cutting ROI;

step 1.2, resampling and normalizing the ROI.

3. The GT-UNet based full-heart segmentation algorithm according to claim 1, wherein the step 2 of converting the preprocessed data into a plurality of independent slices and feeding the slices into the two-dimensional segmentation network for training comprises:

step 2.1, obtaining a mapping function T (-) representing the linear combination of the characteristics, and inputting the characteristic diagram X epsilon R of the original coordinate space omega ^L×C Mapping into an interaction space H through a mapping function f (·), and obtaining a new feature V = f (X) epsilon R ^N×C Wherein, N is the node number of H space, C is the expected feature dimension, and the feature calculation formula is:

step 2.2, reasoning is carried out by using a graph convolution network GCN to obtain a full connected graph storing new characteristics, reasoning is carried out by learning interactive edge weights corresponding to each node, and the definition of the single-layer graph convolution network is as follows:

Z＝GVW _g ＝((I-A _g )V)W _g ∈R ^N×C ......(2)，

wherein, A _g And G represents a node adjacency matrix of size NxN, A _g Is randomly initialized and learned in the training process, I is an identity matrix, Z belongs to R ^N×C For nodes that are globally inferred, W _g A state update function, in which the state of each node is updated;

step 2.3, project node Z in interaction space H to original seatMarking a space omega, and mapping a function Y = g (Z) epsilon R reversely ^L×C Can be obtained from equation 3:

4. The GT-UNet based full heart segmentation algorithm according to claim 3, wherein in step 2.1, by converting the function f (X)

and θ (X) is two convolution layers, W _θ And

is a learnable convolution kernel for each layer.

5. The GT-UNet based full-heart segmentation algorithm according to claim 1, wherein in step 3, the three-dimensional segmentation network comprises: a CNN encoder for extracting a multi-scale feature map from an input image, a retrans encoder for processing an attention multi-scale feature map embedded with position encoding in an end-to-end manner, a CNN decoder for generating and feeding the retrans encoder to decoder segmentation.

6. The GT-UNet based full-center segmentation algorithm of claim 5, wherein the CNN encoder F ^CNN (. C) contains a Conv-In-Relu module and three stages of RAn esnet module, wherein the Conv-In-Relu module first performs a convolution operation with a convolution kernel of 7 × 7 × 064 and a step size of (1, 2), followed by example normalization and ReLu processing; then, the residual error data is sent to a Resnet module in the first stage, the residual error data comprises three residual error units, residual error operation with the step size of (2, 2) and the convolution kernel of 3 multiplied by 13 multiplied by 23 multiplied by 3192 is firstly executed, then residual error operation with the step size of (1, 1) and the convolution kernel of 3 multiplied by 43 multiplied by 53 multiplied by 6192 is carried out twice, 192 characteristic diagrams with the size of 48 multiplied by 740 multiplied by 840 are obtained, and the characteristic diagrams are sent to a Resnet module in the second stage; except that the Resnet module updates the number of convolution kernels from 192 to 384, the other parameters are the same as those in the first stage, 384 characteristic diagrams with the size of 24 x 20 are finally obtained, the characteristic diagrams are sent to the Resnet module in the third stage, in the Resnet module, two residual units are provided, firstly, a residual operation with the step size of (2, 2) and the convolution kernel of 3 x 384 is executed, then, residual operation with the step size of (1, 1) and the convolution kernel of 3 multiplied by 384 is executed once; by F ^CNN (. The definition of the generated feature graph is as follows:

wherein L represents the number of feature layers, L is a specified layer, x is an input feature map, Θ represents parameters required by the encoder, C represents the number of channels, H represents the height of an input image, W represents the width of the input image, and D represents the depth of input data, i.e., the number of slices.

7. The GT-UNet based full-centric segmentation algorithm according to claim 5, wherein the DeTrans encoder comprises a sequence layer of input image conversion and a plurality of stacked deformable DeTrans layers, the DeTrans encoder is used for converting the feature map generated by the CNN encoder

8. The GT-UNet based full-heart segmentation algorithm according to claim 5, wherein the CNN decoder comprises four upsampling modules, each of the first three upsampling modules comprises a convolution layer with a step size of 2 × 2 × 2 and a convolution kernel of 2 × 2 × 2, the number of the corresponding convolution kernels is 384, 192, and 64, respectively, and the feature map is refined by a three-dimensional residual block, and then the feature map output by the encoder and the feature map obtained after the transposition convolution are subjected to pixel-by-pixel summation through jump connection, so as to retain more low-layer information; the final upsampling module consists of one upsampling layer and one 1 x 1 convolutional layer, mapping the feature maps of 64 channels into the desired number of classes.