CN116681894A

CN116681894A - Adjacent layer feature fusion Unet multi-organ segmentation method, system, equipment and medium combining large-kernel convolution

Info

Publication number: CN116681894A
Application number: CN202310707529.6A
Authority: CN
Inventors: 王蓉芳; 牟钊汕; 郝红侠; 缑水平; 焦昶哲
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2023-06-15
Filing date: 2023-06-15
Publication date: 2023-09-01

Abstract

A method, a system, equipment and a medium for segmenting adjacent layer feature fusion Unet multiple organs combined with large-kernel convolution are provided, wherein the method is implemented by constructing an adjacent layer feature fusion UNet multiple organ segmentation model combined with large-kernel-depth convolution; adding large-core depth convolution in an encoder network, and combining local information and global information; constructing adjacent layer feature fusion modules so that the model fully utilizes information among different layer features; constructing a large-core GRN channel response module, modeling long-distance dependency on features fused by features of adjacent layers, and under the condition that the number of feature channels is increased, carrying out global response normalization on the channels fused with the features by the large-core GRN channel response module, so as to compare and select the channels, and improving the segmentation performance of the whole model by utilizing the fused features; the system, the equipment and the medium can divide the organ image based on the dividing method; the method has the advantages of few use parameters, reduced complexity of organ segmentation, good practicality and high efficiency.

Description

Adjacent layer feature fusion Unet multi-organ segmentation method, system, equipment and medium combining large-kernel convolution

Technical Field

The invention relates to the technical field of multi-organ segmentation of medical images, in particular to a method, a system, equipment and a medium for multi-organ segmentation of adjacent layer feature fusion Unet combined with large-kernel convolution.

Background

Organ segmentation in medical images is a fundamental task premise for many clinical applications, such as Computer Aided Diagnosis (CAD), computer Aided Surgery (CAS), etc. Automated and accurate segmentation of multiple organs is an important but challenging task for computer-aided diagnosis and image-guided surgery systems. Accurate segmentation is an important component of many clinical applications where contours of regions of interest are manually delineated by physicians, where manual delineation is cumbersome, time-consuming and laborious, and where organ structures and backgrounds are complex, organ boundaries are blurred, and thus manual delineation of organ contours on medical images is challenging. In recent years, with the development of deep learning in the field of medical imaging. The multi-organ medical image segmentation is performed by utilizing deep learning, so that the organ outline can be accurately drawn, the lesion area can be positioned, a doctor can be quickly assisted in positioning a target, the doctor pressure is reduced, the diagnosis efficiency is improved, and the method has positive and important effects on modern clinical application.

The related art schemes include a conventional multi-organ segmentation method, a conventional Machine Learning (ML) -based multi-organ segmentation method, and a Deep Learning (DL) -based multi-organ segmentation method.

Traditional segmentation methods are usually obtained by classical image segmentation methods, such as threshold segmentation, edge detection segmentation, region growth segmentation and the like, which involve manual work, target boundaries and mathematical models, and have poor segmentation results under the conditions of complex tissues and organs and fuzzy overlapping of boundaries. The multi-organ segmentation method based on traditional machine learning is driven by a machine learning algorithm, such as a segmentation method based on a map, which is to use priori knowledge to segment, and register an artificial segmentation or a predefined structure outline to a target image through marker propagation.

In recent years, deep learning technology has been greatly developed, and Convolutional Neural Networks (CNNs) have been successfully applied to medical image segmentation by virtue of strong feature extraction capability. Among the different CNN variants, the U-Net network model and variants based on it have been advanced medical segmentation models for many years due to their simple architecture and excellent performance. However, CNN-based models often suffer from limitations in capturing remote relationships due to the inherent locality of convolution operations. Recently, with the advent of Vision Transformer (ViT), particularly the introduction of Swin transducer (Swin-T), the ViT-based model has been made a medical segmentation backbone with its superior performance. The shift window scheme in Swin-T can overcome the limitations of high resolution input while preserving the global self-attention advantage of the transducer. While Swin-T reduces the complexity of the model, viT-based models are typically large in parameters, requiring more labeling samples and computational resources. Furthermore, in the field of semantic segmentation, it utilizes feature fusion of different layers to improve segmentation performance, and recent work has demonstrated this, such as PSPNet and HRNet for natural image segmentation, and unet++ and unet3+ for medical image segmentation. These studies indicate that fusing low-level features with more details and high-level features with more semantics can improve the segmentation performance of each field. However, these methods involve fusion of all features, which may lead to high computational complexity of the model, and these studies do only simple linear mapping of the fused features, which is not deep enough to extract the features.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a method, a system, equipment and a medium for segmenting multiple organs by combining large-kernel convolution with adjacent layer feature fusion, wherein medical images can be effectively and efficiently segmented by utilizing large-kernel residual connection, adjacent layer feature fusion and large-kernel GRN channel response, good segmentation performance is realized with lower complexity and parameter quantity, and the method, the system, the equipment and the medium have the advantages of few use parameters, reduced complexity of organ segmentation, good practicability and high efficiency.

In order to achieve the above purpose, the invention adopts the following technical scheme:

the adjacent layer feature fusion Unet multi-organ segmentation method combining large-kernel convolution is used for carrying out data set division, preprocessing data, data sampling and data enhancement on a sample containing a tag, then constructing an adjacent layer feature fusion UNet multi-organ segmentation model combining large-kernel depth convolution, and constructing an adjacent layer feature fusion module, so that the model fully utilizes information among different layer features to obtain lower layer features with more details and higher layer features with more semantics; and constructing a large-core GRN channel response module, modeling long-distance dependency on the characteristics fused by the characteristics of adjacent layers, and under the condition that the number of characteristic channels is increased, carrying out global response normalization on the channels fused with the characteristics by the large-core GRN channel response module, so that the channels are compared and selected, and the segmentation performance of the whole model is improved by utilizing the fused characteristics.

The constructing of the adjacent layer feature fusion UNet multi-organ segmentation model combined with large-kernel depth convolution comprises the following steps: constructing an encoder network, a base network of a decoder, an adjacent layer feature fusion module and a large core GRN channel response module; adding large-core depth convolution in the encoder network, combining the large-core depth convolution with 3×3 convolution, and combining local information and global information.

A method for segmenting multiple organs by combining large-kernel convolution and adjacent layer feature fusion Unet comprises the following specific steps:

s1, data set division

Randomly dividing a sample containing a label in the image data set into a training set and a testing set;

s2, data preprocessing

Resampling the influence data after the data set is divided to eliminate the difference between images with different sources, and facilitate the calculation and comparison of the characteristics in the images: resampling the 3D CT data to the same resolution, wherein image sample sampling uses bilinear interpolation and label sample uses adjacent interpolation;

the data is normalized, adverse effects caused by singular sample data are eliminated, and the convergence speed of network training is accelerated, and the formula is as follows:

wherein R represents CT data after normalization processing,wr and Hr represent the width and height of resolution after the CT data are subjected to normalization processing, zr represents the number of slices, I represents the CT value before the normalization processing, max (I) is the maximum CT value, and min (I) is the minimum CT value;

Centering on the target slice and stacking up and down slices as network inputs: firstly, selecting a target slice on the z axis of R after normalization processing, stacking adjacent slices by taking the target slice as a center, and taking the size of Wr multiplied by Hr multiplied by s as the input of a network, wherein s represents the number of adjacent slice stacks, if the number of stacked slices is insufficient, carrying out corresponding mirror filling, wherein the whole process is as follows:

assuming that i represents a slice at the z-axis position in R after normalization, the network input can be expressed as:

X＝[i-1,i,i+1]

wherein X represents the input to the network,wr and Hr represent the wide-high of the original resolution of the CT data, when i=1, i.e. i is the first slice starting at the z-axis position in R, x= [1, i+1]The method comprises the steps of carrying out a first treatment on the surface of the When i=i _max I.e. i is the last slice in z-axis position in R, x= [ i-1, i _max ,i _max ]；

S3, data sampling

Sampling the data X obtained by the processing in the step S2, namely traversing X and sequentially sampling; the sequential sampling mode is to take slicesStep length is S, and sampling is sequentially performed on X from left to right and from top to bottom, wherein +.>H and W denote the width and height of the slice, respectively, if the slice +.>Which is outside the X size range during sampling, then +. >To withdraw samples, if->If H and W are greater than Wr and Hr of X, filling the mixture around X;

s4, data enhancement

Performing data augmentation to expand training data, avoiding the phenomenon of overfitting caused by training with fewer samples, and performing horizontal overturn, vertical overturn, -90 degrees and 90 degrees rotation and horizontal left-right translation on the data processed in the step S3 according to probability to realize data augmentation by data augmentation;

s5, constructing an adjacent layer feature fusion Unet multi-organ segmentation network combined with large-kernel depth convolution, and naming the network as ASF-LKUNet;

s6, training an ASF-LKUNet segmentation network constructed in the step S5;

s7, testing the optimal model obtained by training in the step S7 by using the test set in the step S1 and the data obtained by processing in the step S3, and quantitatively evaluating the segmentation performance of the model by using a DSC coefficient and a Haoskov distance (HD 95); wherein HD95 is a calculation of the distance between two sets, the smaller the value, the smaller the distance representing the two sets; the calculation mode is as follows:

wherein ,for tag feature map Y and partition feature map +.>Unidirectional Haosdorf distance between ++>For segmenting feature map->And the unidirectional Haoskov distance between the label characteristic diagram Y, and max (DEG) is calculated as Y and + >The distance between boundary points is sorted from small to large, and the sorting distance before 95% of sorting is taken;

the higher the DSC coefficient and the lower the HD95, the better the representative model segmentation performance, and thus the segmentation performance of the model is comprehensively evaluated.

S6, training an ASF-LKUNet segmentation network constructed in the step S5;

wherein ,for tag feature map Y and partition feature map +.>Unidirectional Haosdorf distance between ++>For segmenting feature map->And the unidirectional Haoskov distance between the label characteristic diagram Y, and max (DEG) is calculated as Y and +>The distance between boundary points is sorted from small to large, and the sorting distance before 95% of sorting is taken;

The specific process of the data sampling in the step S3 is as follows:

s301, firstly, calculating filling lengths of X in X and y axes, wherein the filling lengths are as follows:

P _x ＝(S _x -((H _r -H)modS _x )modS _x

P _y ＝(S _y -((W _r -W)modS _y )modS _y

wherein ,P_x For a fill length X in the X-axis, P _y Fill length for X in y-axis, S _x and S_y Representing the step sizes in the x and y directions during sampling, S _x and S_y Taking the same step length, mod table taking remainder, hr and Wr representing the width and height of the original resolution of CT data, W and H representing the sliceIs wide and tall, if->Hr and Wr of (A) are greater than X, then P is filled in the X and Y axes of X, respectively _x and P_y If H is greater than Hr, then filling P in X of X _x If W is greater than Wr, then filling P in y of X _y ；

S302, calculating the number of slices in the directions of the x axis and the y axis according to the step S301;

let the number of slices in the x-direction be N _x The calculation method can be expressed as:

N _x ＝(H _r +P _x -H)|S _x +1

wherein, i represents integer division;

number of slices in y-direction N _y The calculation method can be expressed as:

N _y ＝(H _r +P _y -H)|S _y +1

s303, according toAnd S calculates the coordinates, then the coordinates x' in the x direction can be expressed as:

x′＝[x′ ₁ ,x′ ₂ ,...,x′ _i ,....,x′ _n ]

wherein ,x′_i ＝(i-1)*S _x I=1,..n, N represents the number of slices N in the x-direction _x When i=n, x' _n ＝Wr-W；

The coordinate y' in the y direction can be expressed as:

y′＝[y′ ₁ ,y′ ₂ ,...,y′ _i ,....,y′ _n ]

wherein y′_i ＝(i-1)*S _y I=1,..n, N represents the number of slices N in the y direction _y When i=n, y' _n ＝Hr-H；

S304, slicingStep S, sequentially sampling from left to right and from top to bottom in the xy direction of X to obtain a slice B, wherein the specific process is as follows:

from the coordinates of X 'and y' of step S303, the upper left corner coordinates v to be sampled by X are obtained:

wherein ,n＝N_x *N _y ；

Positioned on X according to the coordinates v and slicedThe size of (2) is truncated to obtain a sampling slice B, and the formula is as follows:

B＝[B ₁ ,B ₂ ,...,B _n ]

wherein n＝N_x *N _y ，B _i ＝X(v _i )，B _i Representing the collectionThe sample was taken to be the i-th slice,X(v _i ) Expressed in X by the coordinate v _i Positioning and then slicing->The size of (1) is truncated at X to obtain a sample slice.

The specific method of the S5 is as follows

S501, constructing a large-core residual error connection convolution module, wherein the large-core residual error connection convolution module comprises a batch normalization layer, a ReLU nonlinear activation layer, a 3x3 convolution layer and a 7x7 depth convolution layer; the first 3x3 convolution layer is responsible for extracting local features, the number of feature channels is doubled, the second 3x3 convolution layer is responsible for enhancing the extraction of the local features, and the 7x7 depth convolution layer captures global features and increases the number of channels of the features by using large-kernel convolution; the large-core residual error connection convolution module is applied to the down-sampling process of the model: under the operation of two different convolutions of 3x3 and 7x7, the extracted global information and the local information are fused in an addition mode, so that the module can capture local and global features at the same time, and the limitation problem of CNN long-distance dependency modeling is effectively relieved;

S502, constructing downsampling according to the step S501, wherein the downsampling comprises a 3x3 convolution layer, a large kernel residual error connection convolution module and a maximum pooling layer, wherein the 3x3 convolution layer carries out channel weft lifting on initial input data of a network, the large kernel residual error connection convolution module carries out feature extraction, and the maximum pooling layer reduces feature resolution; the resolution of the features is reduced to half of the original resolution per 1 maximum pooling layer;

s503, constructing a residual connection convolution module, wherein the residual connection convolution module consists of a batch normalization layer, a ReLU nonlinear activation layer and a 3x3 convolution layer; the residual connection convolution module firstly fuses the features extracted by downsampling and the features extracted by upsampling in an additive mode, a 3x3 convolution layer connected with the residual is responsible for extracting the features and reducing the number of feature channels, in the other two 3x3 convolution layers, the first 3x3 convolution layer is responsible for extracting the features and reducing the number of feature channels to half of the original number, and the second convolution layer is responsible for enhancing the feature extraction; the residual connection convolution module is applied to the up-sampling process of the model, a 3x3 convolution layer is used for extracting features, residual connection convolution and information extracted by conventional convolution are fused in an adding mode, and the segmentation performance of the model is improved;

S504, constructing an adjacent layer feature fusion method, and carrying out adjacent layer feature fusion on the extracted features in the downsampling process of the step S502; when there are 3 adjacent layer features fused, it can be expressed as:

when there are 2 adjacent layer features fused, it can be expressed as:

wherein , and />Representing the fused features->Representing a feature map, wherein a subscript s represents a current scale, and a superscript (h, w, c) represents resolution and channel number at a corresponding scale; conv _2×2 (. Cndot.) represents a 2x2 convolutional layer with a step size of 2, the number of output channels being twice the number of input channels; upConv _2×2 (. Cndot.) represents a 2x2 transposed convolutional layer with a step size of 2, the number of output channels being half the number of input channels, conv _2×2 (. Cndot.) and upConv _2×2 (. Cndot.) downsampling and upsampling adjacent layer features, respectivelyThe same size as the current scale; />

Is an operation of connecting different features;

s505, constructing a GRN module, wherein GRN comprises three steps: global feature aggregation, feature normalization and feature calibration;

for a feature X of input size (H, W, C), it can be expressed as X ε R ^H×W×C C is the number of characteristic channels, and then:

1) Global feature aggregation

In the global feature aggregation process, the spatial features are aggregated into a vector through a g function, which can be expressed as:

Wherein the above equation, by using the L2 norm, can yield one value for each channel feature, resulting in a set of aggregated values: in the formula ,/>Is a scalar that aggregates the statistics of the ith channel; let X be a feature of n dimensions, i.e. x= (X) ₁ ,x ₂ ,x ₃ ,...x _n ) The L2 norm may be expressed as +.>

2) Feature normalization

In the feature normalization process, scalar normalization of statistical information of the ith channel can be expressed as:

wherein ,||X_i I is the L2 norm of the i-th channel,representing the current number of channels;

3) Feature calibration

In the feature calibration process, the feature normalized score calculated in step 2) of step S505 is used to calibrate the raw input response, which can be expressed as:

wherein ,X_i The i-th feature map is shown as such, and />Representing global feature aggregation and feature normalization respectively,represents the current X _i Resolution size of (2);

two additional learnable parameters γ and β are added and initialized to zero, and a residual connection is additionally added between the input and output of the GRN layer, the final GRN can be expressed as:

s506, constructing a large-core GRN channel response module according to the step S505; the module consists of a layer normalization layer, a 7x7 depth convolution layer, a GRN layer and a 3x3 convolution layer; the 7x7 depth convolution layer is responsible for feature extraction of features obtained by feature fusion of adjacent layers in the step S504, the GRN is responsible for global response normalization of extracted feature channels, and the 3x3 convolution layer is responsible for feature extraction and feature channel number reduction;

S507, constructing up-sampling based on the step S506 and the step S503, wherein the up-sampling is composed of a residual error connection convolution module and a 2x2 transposed convolution layer 1x1 convolution layer; the residual connection convolution module firstly fuses the features extracted by downsampling and the features extracted by upsampling in an additive mode, then extracts the features, the 2x2 transposition convolution layer is responsible for increasing the resolution of the features, the resolution of the features is increased to half of the original resolution through 1 2x2 transposition convolution layers, and the 1x1 convolution layer is responsible for mapping into a final segmentation result;

s508, according to the steps S502, S504, S506 and S507, finally forming the adjacent layer feature fusion Unet multi-organ segmentation network combined with the large-kernel depth convolution: ASF-LKUNet.

The specific method of the step S6 is as follows:

s601, constructing a loss function of the method by using a cross entropy loss function and a Dice loss function;

in the process of training ASF-LKUNet segmentation network, the loss function uses cross entropy loss functionAnd the Dice loss function->The definition is as follows:

wherein y represents a label and wherein,a predicted value representing each category; i represents a pixel in the feature map, and c represents a category;

in the training process, an Adam (adaptive moment estimation) optimizer is adopted; the loss function J (theta) is used for solving the bias guide of the (theta), The parameter θ is updated in the negative gradient direction, +.>θ' is the updated network parameter, θ _j For the pre-update network parameters, σ is the learning rate, < ->To input training data of the network, h _θ (x ⁱ ) For the weight of the training set, y ⁱ For the labels corresponding to the training set, m is the number of samples input in each training, randomly extracting a group of samples from the training set, and updating according to a gradient descent rule after each training;

s602, training a model by using the training set in the data set constructed in the step S1, the data obtained in the step S3 and the data obtained by enhancing the data in the step S4, and selecting the model with the highest evaluation index in the training process; the evaluation index uses a Dice coefficient (DSC), the DSC generally measures the similarity of two samples, the value range is [0,1], and the higher the DSC value is, the higher the similarity of the two samples is, which is defined as:

wherein , wherein The network segmentation feature map is represented, and Y represents the label feature map.

The segmentation system based on the adjacent layer feature fusion Unet multi-organ segmentation method combined with large-kernel convolution comprises the following components:

the large-core residual error is connected with a convolution encoder and is used for inputting the slice B through a network and extracting characteristic global information and local information;

the residual is connected with a convolution decoder and is used for outputting a segmentation result graph and extracting multi-resolution depth features;

The adjacent layer feature fusion module is used for fusing the features of the adjacent layers and can obtain low-layer features with more details and high-layer features with more semantics;

the large-core GRN channel response module is used for carrying out global response normalization on the fused characteristic channels to enhance channel selection, and enhancing global and local information extraction through large-core depth convolution, so that the model can fully utilize different characteristics and improve the capability of capturing global and local information of the model.

The segmentation equipment based on the adjacent layer feature fusion Unet multi-organ segmentation method combined with large-kernel convolution comprises the following components:

a memory for storing a computer program, data and a model;

and the processor is used for realizing the operation of the landslide identification method based on the evolution pruning lightweight convolutional neural network in any one of the steps 1 to 7 when the computer program is executed.

A computer readable storage medium, which is responsible for reading and storing programs and data, wherein the computer readable storage medium stores a computer program, and when the computer program is executed by a processor, the computer program can segment organ images based on the adjacent layer feature fusion uiet multi-organ segmentation method combining large-kernel convolution in steps S1 to S7.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention provides a large-kernel residual connection convolution method (LK Residual Block), wherein residual connection can promote training, alleviate degradation and alleviate the problem of overfitting, particularly for medical images with limited marked samples, large-kernel depth convolution is used at a residual link part, and can be combined with large-kernel depth convolution to capture local and global information at the same time, so that the limitation problem of CNN long-distance dependency modeling can be effectively relieved, and the large-kernel depth convolution can have ViT capacity of capturing global information, and compared with ViT frameworks, parameters are fewer, and less marked data and calculation resources are needed. The LK Residual Block method achieves better segmentation results than conventional residual connection methods, such as ResUnet.

2. Aiming at the problems that the full-connection feature fusion method can cause high computational complexity of a model, the fused features cannot be effectively utilized and the exploration is not deep enough, the adjacent layer feature fusion and large-core Global Response Normalization (GRN) channel response method (LKGRN) is provided. In the adjacent layer feature fusion method, different from the full-connection feature fusion method, the adjacent layer feature fusion method fuses adjacent features in series, so that the computational complexity can be effectively reduced, and the low-layer features with more details and the high-layer features with more semantics can be fused, so that the segmentation performance is improved. The LKGRN method adaptively selects more meaningful channel information for fused features and enhances inter-channel feature extraction by large kernel depth convolution channel response based on GRN improvement. The GRN can increase the contrast and selectivity of channels, explore the relation among the channels, effectively utilize and pay attention to the fused characteristics, generate no additional parameters, further reduce the complexity and effectively relieve the local attention problem by using large-kernel depth convolution in the LKGRN method, and prove the superior performance of the LKGRN method on a multi-organ data set.

3. The method realizes good segmentation performance with lower complexity and parameter quantity, and can effectively and efficiently segment medical images by utilizing the large-core residual error connection, the adjacent layer feature fusion and the large-core GRN channel response method.

In conclusion, the invention has the advantages of few use parameters, reduced complexity of organ segmentation, good practicality and high efficiency.

Drawings

FIG. 1 is a flow chart of the present invention.

Fig. 2 is an overall structure diagram of ASF-LKUNet of the present invention.

FIG. 3 is a block diagram of a large kernel residual join convolution of the present invention.

Fig. 4 is a block diagram of a residual join convolution of the present invention.

FIG. 5 is a block diagram of a large core GRN channel response module of the invention.

Detailed Description

The invention will be described in further detail with reference to the drawings and examples.

The adjacent layer feature fusion Unet multi-organ segmentation method combining large-core convolution is characterized in that after data set division, data preprocessing, data sampling and data enhancement are carried out on a sample containing a tag, an adjacent layer feature fusion UNet multi-organ segmentation model combining large-core depth convolution is constructed, and an encoder network, a base network of a decoder, an adjacent layer feature fusion module and a large-core GRN channel response module are constructed; adding large-core depth convolution in an encoder network, combining the large-core depth convolution with 3×3 convolution, and combining local information and global information; constructing adjacent layer feature fusion modules, so that the model fully utilizes information among different layer features to obtain lower layer features with more details and higher layer features with more semantics; and constructing a large-core GRN channel response module, modeling long-distance dependency on the characteristics fused by the characteristics of adjacent layers, and under the condition that the number of characteristic channels is increased, carrying out global response normalization on the channels fused with the characteristics by the large-core GRN channel response module, so that the channels are compared and selected, and the segmentation performance of the whole model is improved by utilizing the fused characteristics.

Referring to fig. 1, a method for segmenting multiple organs by combining large-kernel convolution and feature fusion of adjacent layers specifically comprises the following steps:

s1, data set division

s2, data preprocessing

Resampling the influence data after the data set is divided to eliminate the difference between images with different sources, and facilitate the calculation and comparison of the characteristics in the images: resampling 3D CT data to the same resolution of 1 x 3mm ³ Wherein image sample sampling uses bilinear interpolation and label sample uses adjacent interpolation;

the invention mainly aims at the segmentation of human organs such as aorta, gall bladder, left kidney, right, liver, pancreas, spleen and stomach, so the CT value range is selected to be [ -125,275], in addition, the data are normalized, the adverse effect caused by singular sample data is eliminated, and the convergence rate of network training is accelerated, and the formula is as follows:

wherein R represents CT data after normalization processing,wr and Hr represent the width and height of resolution after the CT data are subjected to normalization processing, zr represents the number of slices, I represents the CT value before the normalization processing, max (I) is the maximum CT value, and min (I) is the minimum CT value; max (I) is denoted as 275 and min (I) is-125.

assuming i represents a slice at the mid-z-axis position of the 3D sample, the network input can be expressed as:

X＝[i-1,i,i+1]

wherein X represents the input to the network,wr and Hr represent the width and height of the original resolution of the CT data, when i=1, i.e. i is the first slice starting at the z-axis position in the 3D sample, x= [1, i+1]The method comprises the steps of carrying out a first treatment on the surface of the When i=i _max I.e. i at the last slice in the 3D sample at the z-axis position, x= [ i-1, i _max ,i _max ]；

S3, data sampling

Because the sample resolutions of different CT data are inconsistent, and the input of the network needs fixed resolution, the input size of the network and the training time of the network are in direct proportion. Therefore, in order to reduce the network training time and ensure that all contents in each data sample participate in training, the invention takes the size of 256X256X3 as network input and samples the data X obtained by processing in the step S2, namely traverses X and sequentially samples out slices with the size of 256X256X 3; the sequential sampling mode is to take slices Step length is S, and sampling is sequentially performed on X from left to right and from top to bottom, wherein +.>H and W denote 256, if sliced +.>Which is outside the X size range during sampling, then +.>To withdraw samples, if->If H and W are greater than Wr and Hr of X, filling the mixture around the X as a center, wherein the concrete process is as follows:

P _x ＝(S _x -((H _r -H)modS _x )modS _x

P _y ＝(S _y -((W _r -W)modS _y )modS _y

wherein ,P_x For a fill length X in the X-axis, P _y Fill length for X in y-axis, S _x and S_y Representing the step sizes in the x and y directions during sampling, the step size S of the present invention _x and S_y Are all 128, mod table takes remainder, hr and Wr represent CT dataWidth and height of original resolution, W and H denote slicesIn the present invention, H and W are 256, if +.>Hr and Wr of (A) are greater than X, then P is filled in the X and Y axes of X, respectively _x and P_y If H is greater than Hr, then filling P in X of X _x If W is greater than Wr, then filling P in y of X _y ；

N _x ＝(H _r +P _x -H)|S _x +1

wherein, i represents integer division;

N _y ＝(H _r +P _y -H)|S _y +1

s303, according to And S calculates the coordinates, then the coordinates x' in the x direction can be expressed as:

x′＝[x′ ₁ ,x′ ₂ ,...,x′ _i ,....,x′ _n ]

The coordinate y' in the y direction can be expressed as:

y′＝[y′ ₁ ,y′ ₂ ,...,y′ _i ,....,y′ _n ]

according to the coordinates of X 'and y' of step S302, the upper left corner coordinate v to be sampled by X is obtained:

wherein ,n＝N_x *N _y ；

B＝[B ₁ ,B ₂ ,...,B _n ]

wherein n＝N_x *N _y ，B _i ＝X(v _i )，B _i Indicating that the sampling results in the i-th slice,X(v _i ) Expressed in X by the coordinate v _i Positioning and then slicing->The size of (1) is intercepted at X to obtain a sampling slice;

s4, data enhancement

Because the data samples are rare, the training data is required to be expanded by data augmentation, the phenomenon of overfitting caused by training with fewer samples is avoided, and the data augmentation is realized by carrying out horizontal overturning, vertical overturning, -90-degree and 90-degree rotation and horizontal left-right translation on the data processed in the step S3 according to probability;

S5, constructing a multi-organ segmentation network

And constructing an adjacent layer feature fusion Unet multi-organ segmentation network combined with large-kernel depth convolution, and naming the network as ASF-LKUNet. Please refer to fig. 2;

s501, constructing a large-core residual error connection convolution module, wherein the module consists of 2 batch normalization layers, 2 ReLU nonlinear activation layers, 2 3x3 convolution layers and 1 7x7 depth convolution layers. Please refer to fig. 3. The first 3x3 convolution layer is responsible for extracting local features, the number of feature channels is doubled, the second convolution layer is responsible for enhancing the local feature extraction, and the 7x7 depth convolution layer captures global features by utilizing large-kernel convolution and increases the number of channels of the features. The module is applied to a down sampling process of a model, and the extracted global information and local information are fused in an addition mode under the operation of two different convolutions of 3x3 and 7x7, so that the module can capture local and global characteristics at the same time, the limitation problem of CNN long-distance dependency modeling is effectively relieved, and a 7x7 depth convolution layer has fewer parameters and needs less marking data and calculation resources compared with a segmentation model of a ViT framework;

s502, constructing downsampling according to S501, wherein the downsampling comprises 1 3x3 convolution layers, 4 large-core residual error connection convolution modules and 4 maximum pooling layers of 2x 2; the 3x3 convolution layer is responsible for carrying out channel weft lifting on network initial input data, the large-core residual error connection convolution module is responsible for extracting features, the maximum pooling layer is responsible for reducing the resolution of features, and the resolution of features is reduced to half of the original resolution after passing through 1 maximum pooling layer;

S503, constructing a residual connection convolution module, wherein the residual connection convolution module consists of 2 batch normalization layers, 2 ReLU nonlinear activation layers and 3x3 convolution layers; please refer to fig. 4; the residual connection convolution module firstly fuses the features extracted by downsampling and the features extracted by upsampling in an additive mode, a 3x3 convolution layer connected with the residual is responsible for extracting the features and reducing the number of feature channels, in the other two 3x3 convolution layers, the first 3x3 convolution layer is responsible for extracting the features and reducing the number of feature channels to half of the original number, and the second convolution layer is responsible for enhancing the feature extraction; the residual connection convolution module is used for extracting features by using 3x3 convolution layers in the up-sampling process of the model, and the residual connection convolution and information extracted by conventional convolution are fused in an addition mode, so that the segmentation performance of the model is improved;

s504, constructing an adjacent layer feature fusion method, and carrying out adjacent layer feature fusion on the extracted features in the downsampling process of the step S502. When there are 3 row-adjacent layer features fused, it can be expressed as:

when there are 2 adjacent layer features fused, it can be expressed as:

wherein , and />Representing the fused features->The feature map is represented, the subscript s represents the current scale, and the superscript (h, w, c) represents the resolution and channel number at the corresponding scale. Conv _2×2 (. Cndot.) represents a 2x2 convolutional layer with a step size of 2, and the number of output channels is twice the number of input channels. upConv _2×2 (. Cndot.) represents a 2x2 transposed convolutional layer with a step size of 2, and the number of output channels is half that of input channels. Conv _2×2 (. Cndot.) and upConv _2×2 (. Cndot.) downsampling and upsampling adjacent layer features to the same size as the current scale, respectively; />Is an operation of connecting different features;

s505, constructing a GRN module, wherein GRN comprises three steps: global feature aggregation, feature normalization and feature calibration.

For a feature X of input size (H, W, C), it can be expressed as X ε R ^H×W×C The following steps are:

1) Global feature aggregation

/>

wherein the above equation, by using the L2 norm, can yield one value for each channel feature, resulting in a set of aggregated values: in the formula ,/>Is a scalar that aggregates the statistics of the ith channel; let X be a feature of n dimensions, i.e. x= (X) ₁ ,x ₂ ,x ₃ ,...x _n ) The L2 norm may be expressed as,

2) Feature normalization

wherein ,||X_i I is the L2 norm of the i-th channel, Representing the current number of channels;

3) Feature calibration

to simplify the optimization, two additional learnable parameters γ and β need to be added and initialized to zero, and a residual connection is additionally added between the input and output of the GRN layer, the final GRN can be expressed as:

s506, constructing a large-core GRN channel response module according to the step S505. The module consists of a layer normalization layer, a 7x7 depth convolution layer and a GRN layer 3x3 convolution layer; please refer to fig. 5. The 7x7 depth convolution layer is responsible for feature extraction of features obtained by feature fusion of adjacent layers in the step S504, the GRN is responsible for global response normalization of extracted feature channels, and the 3x3 convolution layer is responsible for feature extraction and feature channel number reduction; therefore, in the large-core convolution GRN channel response module, the features of adjacent layers are fused, low-level details are combined with high-level semantics, global and local information extraction and channel selection are enhanced through GRN-based improved large-core channel response, and therefore the model can fully utilize different features, and the capability of capturing global and local information of the model is improved.

S507, constructing up-sampling based on the step S506 and the step S503, wherein the up-sampling consists of 4 residual error connection convolution modules, 4 2x2 transpose convolution layers and 1x1 convolution layer; the residual connection convolution module firstly fuses the features extracted by downsampling and the features extracted by upsampling in an additive mode, then extracts the features, the 2x2 transposition convolution layer is responsible for increasing the resolution of the features, the resolution of the features is increased to half of the original resolution through 1 2x2 transposition convolution layers, and the 1x1 convolution layer is responsible for mapping into a final segmentation result;

according to step S502, step S504, step S506 and step S507, finally forming a neighboring layer feature fusion Unet multi-organ segmentation network combined with large-kernel depth convolution: ASF-LKUNet; please refer to fig. 2.

S6, training an ASF-LKUNet segmentation network constructed in the step S5;

in training the segmenter, the loss function uses a cross entropy loss functionAnd the Dice loss function->The definition is as follows:

in the training process, an Adam (adaptive moment estimation) optimizer is adopted; first, the loss function J (theta) is used for solving the bias derivative of the (theta), The parameter θ is updated in the negative gradient direction, +.>θ' is the updated network parameter, θ _j For the pre-update network parameters, σ is the learning rate, < ->To input training data of the network, h _θ (x ⁱ ) For the weight of the training set, y ⁱ For the labels corresponding to the training set, m is the number of samples input in each training, randomly extracting a group of samples from the training set, and updating according to a gradient descent rule after each training;

wherein , wherein Representing a network segmentation feature map, and Y represents a label feature map;

s7, testing the optimal model obtained by training in the step S7 by using the test set in the step S1 and the data obtained by processing in the step S3, and quantitatively evaluating the segmentation performance of the model by using a DSC coefficient and a Haoskov distance (HD 95); the HD95 calculates the distance between the two sets, and the smaller the value, the higher the similarity between the two sets; the calculation mode is as follows:

wherein ,for tag feature map Y and partition feature map +.>Unidirectional Haosdorf distance between ++>For segmenting feature map->And the unidirectional Haoskov distance between the label characteristic diagram Y, and max (DEG) is calculated as Y and +>The distances among the boundary points are ranked from small to large, and the distances ranked at 95% are taken;

The invention provides a large-kernel residual connection convolution method (LK Residual Block), wherein residual connection can promote training, alleviate degradation and alleviate the problem of overfitting, particularly for medical images with limited marked samples, large-kernel depth convolution is used at a residual link part, local and global information can be captured simultaneously by combining large-kernel depth convolution, the limitation problem of CNN long-distance dependency modeling can be effectively relieved, and the large-kernel depth convolution can have ViT capacity of capturing global information, and compared with ViT frameworks, parameters are fewer, and less marked data and calculation resources are required. The LK Residual Block method achieves better segmentation results than conventional residual connection methods, such as ResUnet.

The invention also aims at the problems that the full-connection fusion characteristic method can cause high computational complexity of a model, the fused characteristic cannot be effectively utilized and the exploration is not deep enough, and provides an adjacent layer characteristic fusion and large core Global Response Normalization (GRN) channel response method (LKGRN). In the adjacent layer feature fusion method, different from the full-connection feature fusion method, the adjacent layer feature fusion method fuses adjacent features in series, so that the computational complexity can be effectively reduced, and the low-layer features with more details and the high-layer features with more semantics can be fused, so that the segmentation performance is improved. The LKGRN method adaptively selects more meaningful channel information for fused features and enhances inter-channel feature extraction by large kernel depth convolution channel response based on GRN improvement. The GRN can increase the contrast and selectivity of channels, explore the relation among the channels, effectively utilize and pay attention to the fused characteristics, generate no additional parameters, further reduce the complexity and effectively relieve the local attention problem by using large-kernel depth convolution in the LKGRN method, and prove the superior performance of the LKGRN method on a multi-organ data set.

Claims

1. The adjacent layer feature fusion Unet multi-organ segmentation method combining large-kernel convolution is characterized in that after data set division, data preprocessing, data sampling and data enhancement are carried out on a sample containing a tag, an adjacent layer feature fusion UNet multi-organ segmentation model combining large-kernel depth convolution is constructed, and an adjacent layer feature fusion module is constructed, so that the model fully utilizes information among different layer features to obtain lower layer features with more details and higher layer features with more semantics; and constructing a large-core GRN channel response module, modeling long-distance dependency on the characteristics fused by the characteristics of adjacent layers, and under the condition that the number of characteristic channels is increased, carrying out global response normalization on the channels fused with the characteristics by the large-core GRN channel response module, so that the channels are compared and selected, and the segmentation performance of the whole model is improved by utilizing the fused characteristics.

2. The method for constructing a large-kernel-convolution-combined adjacent-layer feature fusion Unet multi-organ segmentation model according to claim 1, wherein the constructing the large-kernel-convolution-combined adjacent-layer feature fusion Unet multi-organ segmentation model comprises: constructing an encoder network, a base network of a decoder, an adjacent layer feature fusion module and a large core GRN channel response module; adding large-core depth convolution in the encoder network, combining the large-core depth convolution with 3×3 convolution, and combining local information and global information.

3. The method for segmenting the multiple organs by combining large-kernel convolution and fusion of adjacent layer features according to claim 1 or 2 is characterized by comprising the following steps:

s1, data set division

s2, data preprocessing

X＝[i-1,i,i+1]

S3, data sampling

Sampling the data X obtained by the processing in the step S2, namely traversing X and sequentially sampling; the sequential sampling mode is to take slicesStep length is S, and sampling is sequentially performed on X from left to right and from top to bottom, wherein +.>H and W denote the width and height of the slice, respectively, if the slice +.>Which is outside the X size range during sampling, then +.>To withdraw samples, if->If H and W are greater than Wr and Hr of X, filling the mixture around X;

s4, data enhancement

S6, training an ASF-LKUNet segmentation network constructed in the step S5;

wherein ,for tag feature map Y and partition feature map +.>Unidirectional Haosdorf distance between ++>To divide feature picturesAnd the unidirectional Haoskov distance between the label characteristic diagram Y, and max (DEG) is calculated as Y and +>The distance between boundary points is sorted from small to large, and the sorting distance before 95% of sorting is taken;

4. The method for segmenting multiple organs by combining large-kernel convolution and fusion of adjacent layer features as set forth in claim 3, wherein the specific process of data sampling in the step S3 is as follows:

P _x ＝(S _x -((H _r -H)modS _x )modS _x

P _y ＝(S _y -((W _r -W)modS _y )modS _y

N _x ＝(H _r +P _x -H)|S _x +1

wherein, i represents integer division;

N _y ＝(H _r +P _y -H)|S _y +1

x′＝[x′ ₁ ,x′ ₂ ,...,x′ _i ,....,x′ _n ]

wherein ,x′_i ＝(i-1)*S _x ,i＝1..n, N represents the number of slices N in the x-direction _x When i=n, x' _n ＝Wr-W；

The coordinate y' in the y direction can be expressed as:

y′＝[y′ ₁ ,y′ ₂ ,...,y′ _i ,....,y′ _n ]

wherein ,n＝N_x *N _y ；

B＝[B ₁ ,B ₂ ,...,B _n ]

wherein n＝N_x *N _y ，B _i ＝X(v _i )，B _i Indicating that the sampling results in the i-th slice,X(v _i ) Expressed in X by the coordinate v _i Positioning and then slicing->The size of (1) is truncated at X to obtain a sample slice.

5. The method for segmenting multiple organs by combining large-kernel convolution and fusion of adjacent layer features according to claim 3, wherein the specific method of S5 is as follows:

s501, constructing a large-core residual error connection convolution module, which comprises the following steps: batch normalization layer, reLU nonlinear activation layer, 3x3 convolution layer, and 7x7 depth convolution layer; the first 3x3 convolution layer is responsible for extracting local features, the number of feature channels is doubled, the second 3x3 convolution layer is responsible for enhancing the extraction of the local features, and the 7x7 depth convolution layer captures global features and increases the number of channels of the features by using large-kernel convolution; the large-core residual error connection convolution module is applied to the down-sampling process of the model: under the operation of two different convolutions of 3x3 and 7x7, the extracted global information and the local information are fused in an addition mode, so that the module can capture local and global features at the same time, and the limitation problem of CNN long-distance dependency modeling is effectively relieved;

when there are 2 adjacent layer features fused, it can be expressed as:

wherein , and />Representing the fused features->Representing a feature map, wherein a subscript s represents a current scale, and a superscript (h, w, c) represents resolution and channel number at a corresponding scale; conv _2×2 (. Cndot.) represents a 2x2 convolutional layer with a step size of 2, the number of output channels being twice the number of input channels; upConv _2×2 (. Cndot.) represents a 2x2 transposed convolutional layer with a step size of 2, the number of output channels being half the number of input channels, conv _2×2 (. Cndot.) and upConv _2×2 (. Cndot.) downsampling and upsampling adjacent layer features to the same size as the current scale, respectively; (. Cndot.) is the operation of connecting different features;

1) Global feature aggregation

2) Feature normalization

3) Feature calibration

wherein ,X_i The i-th feature map is shown as such, and />Representing global feature aggregation and feature normalization, respectively, +.>Represents the current X _i Resolution size of (2);

6. The method for segmenting multiple organs by combining large-kernel convolution and feature fusion of adjacent layers according to claim 3, wherein the specific method in the step S6 is as follows:

In the training process, an Adam (adaptive moment estimation) optimizer is adopted; the loss function J (theta) is used for solving the bias guide of the (theta),the parameter θ is updated in the negative gradient direction, +.>θ' is the updated network parameter, θ _j For the pre-update network parameters, σ is the learning rate, < ->To input training data of the network, h _θ (x ⁱ ) For the weight of the training set, y ⁱ For the labels corresponding to the training set, m is the number of samples input in each training, randomly extracting a group of samples from the training set, and updating according to a gradient descent rule after each training;

7. A segmentation system based on the adjacent layer feature fusion Unet multi-organ segmentation method combined with large-kernel convolution as claimed in any one of claims 1 to 6, comprising:

the adjacent layer feature fusion module is used for fusing the features of the adjacent layers to obtain low-layer features with more details and high-layer features with more semantics;

8. A segmentation apparatus based on the adjacent layer feature fusion Unet multi-organ segmentation method in combination with large-kernel convolution as claimed in any one of claims 1 to 6, comprising:

a memory for storing a computer program, data and a model;

9. A computer readable storage medium, which is responsible for reading and storing programs and data, wherein the computer readable storage medium stores a computer program, and when the computer program is executed by a processor, the computer program can segment organ images based on the adjacent layer feature fusion uiet multi-organ segmentation method combined with large-kernel convolution as described in step S1 to step S7.