CN109191476B

CN109191476B - Novel biomedical image automatic segmentation method based on U-net network structure

Info

Publication number: CN109191476B
Application number: CN201811048857.5A
Authority: CN
Inventors: 胡学刚; 杨洪光; 郑攀; 王良晨
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2018-09-10
Filing date: 2018-09-10
Publication date: 2022-03-11
Anticipated expiration: 2038-09-10
Also published as: CN109191476A

Abstract

The invention belongs to the technical field of image processing and computer vision, and relates to a new biomedical image automatic segmentation method based on a U-net network structure, which comprises the steps of dividing a biomedical data set into a training set and a test set, and carrying out normalization pretreatment on the test set after the test set and the test set are subjected to amplification treatment; inputting the images of the training set into an improved U-net network model, and outputting the images to generate a classification probability map through a softmax layer; calculating the error between the classification probability map and the gold standard through a centralized loss function, and obtaining a weight parameter of the network model through a gradient back propagation method; inputting the images in the test set into an improved U-net network model, and outputting the images to generate a classification probability map through a softmax layer; obtaining a segmentation result graph of the image according to the class probability in the classification probability graph; the method solves the problems that in the image segmentation process, the contribution of a simple sample to a loss function is too large to well learn a difficult sample, and the like.

Description

Novel biomedical image automatic segmentation method based on U-net network structure

Technical Field

The invention belongs to the technical field of image processing and computer vision, and particularly relates to a novel biomedical image automatic segmentation method based on a U-net network structure.

Background

Medical image segmentation has very important significance for three-dimensional positioning, three-dimensional visualization, surgical planning, computer-aided diagnosis and the like, and is one of the hot research fields of image processing and analysis. The method comprises three types of manual segmentation, semi-automatic segmentation and automatic segmentation. The manual segmentation method is time-consuming, depends on subjective factors such as knowledge and experience of clinical experts, has poor repeatability, and cannot completely meet the real-time clinical requirements. The semi-automatic segmentation method adopts human-computer interaction, improves the segmentation speed to a certain extent, still depends on an observer, and limits the application of the semi-automatic segmentation method in clinical practice. The automatic segmentation method completely extracts the edges of the region of interest by means of a computer, completely avoids the influence of subjective factors of an observer, improves the data processing speed and has good repeatability. However, due to the complex changes of the target individual structure and the low contrast, noise, etc. effects of various medical imaging modalities and techniques in biomedicine, the variability of medical images is high. Therefore, automatically segmenting biomedical images becomes one of the research hotspots of the current image processing

In recent years, pixel-based methods and structure-based methods have made substantial progress in biomedical image segmentation. These methods use manual features and a priori knowledge to achieve the desired results in some simple segmentation tasks, which tend to be poor when applied to objects with complex varying characteristics. Recently, Deep Neural Networks (DNNs), particularly fully convolutional neural networks (FCNs), are very effective for medical image segmentation, and are considered as a basic structure for image segmentation using deep learning. The network achieves segmentation by pixel classification and has a structure comprising a down-sampling part and an up-sampling part. The down-sampling part consists of a convolutional layer and a max-pooling layer, and the up-sampling part consists of a convolutional layer and a deconvolution (transposed convolution) layer. U-net is a FCN-based medical image segmentation method, which comprises an encoder (encoder) and a decoder (decoder). The encoder and decoder correspond to the down-sampled and up-sampled portions in the FCN, respectively, and the decoder connects the encoder by hopping to fuse the detail features, thereby improving the segmentation effect and obtaining champions in the ISBI2015 cell segmentation tournament. Subsequently, a series of medical image segmentation methods based on the U-net structure are successively proposed and successfully applied to clinical diagnosis.

In a segmentation network based on a U-net structure, the resolution of an input image is reduced after passing through an encoder, and a decoder generally recovers the resolution gradually and finally outputs a segmentation result graph by using two methods of deconvolution or bilinear interpolation followed by 2 x 2 convolution. However, the zero padding operation of deconvolution before convolution and the non-learnable nature of bilinear interpolation can affect the performance of the decoder. The complexity of the shape and size of the object is one of the major difficulties in biomedical image segmentation. There are generally two approaches to solving this problem: one is to use a manual feature transformation algorithm with spatial invariance, such as the scale invariant feature transform algorithm (SIFT). However, this approach tends to fail when the target changes are too complex. The second is completed by data enhancement and a neural network with the capability of learning geometric transformation. Data enhancement is the increase of the number of images in a data set by some geometrical transformation methods such as rotation, flipping, scaling, etc., but is very time consuming and not suitable for objects with complex geometrical transformations. Spatial Transformer Networks (STNs) are a convolutional neural network architecture model proposed by Jaderberg et al, and have good robustness, Spatial invariances such as translation, expansion, rotation, disturbance, bending and the like, and good effects are obtained on some small image classification tasks. STNs warp the feature map by learning global transformation parameters, such as affine transformations. However, learning global transformation parameters is difficult and very time consuming. Deformable convolution also has the ability to learn geometric transformations by sampling the feature map locally and densely to generate a new feature map to accommodate geometric changes in the image. Compared to STNs, the deformable convolution is less computationally intensive and easier to train.

In addition, biomedical images often have the problem of unbalanced distribution of positive and negative samples, and similar samples are difficult to be divided, for example, samples in the edge area of a target are more difficult to be divided than samples in the central area. These two problems can cause the loss function to converge to a bad position, resulting in a reduced generalization capability of the model. A centralized loss function (Focal loss) is initially applied to a dense target detection task to solve the problems that samples generated by an Anchor mechanism are seriously unbalanced, simple samples contribute too much to the loss function, and difficult samples cannot be well learned.

Disclosure of Invention

In view of the above, an object of the present invention is to provide a new method for automatically segmenting a biomedical image based on a U-net network structure, which utilizes deformable convolution to improve the capability of an encoder for extracting features, and provides a new upsampling method to enhance the capability of a decoder for recovering resolution and fusing features, and uses a centralized loss function to improve the capability of a model for learning a difficult sample, thereby finally improving the segmentation effect.

In order to achieve the aim, the invention provides a novel biomedical image automatic segmentation method based on a U-net network structure, which comprises the following steps:

s1: dividing a biomedical data set into a training set and a testing set, carrying out data amplification treatment on the training set, and carrying out normalization pretreatment on the testing set and the data set subjected to amplification treatment;

s2: inputting the images of the training set into an improved U-net network model, and generating a classification probability map with the channel number of 2 by the output images through a softmax layer, wherein the classification probability map has the same resolution as that of the input images;

s3: calculating the error between the classification probability map and the gold standard through a centralized loss function, and obtaining a weight parameter of the improved U-net network model through a gradient back propagation method;

s4: inputting the images in the test set into a trained improved U-net network model in S3, and outputting the images to generate a classification probability map through a softmax layer;

s5: and according to the class probability in the classification probability map, taking the class with the highest probability as the class of the pixel position to obtain a segmentation result map of the image.

Further, step S1 specifically includes:

s11: rotating the image data in the training set by a rotation angle between (-20 degrees and 20 degrees), and intercepting the maximum rectangle of the rotated image data;

s12: the rotated image data is flipped up and down and left and right with a probability of 80%, and then the process proceeds to step S13:

s13: elastically distorting the image data with a probability of 80%, and then jumping to step S14;

s14: zooming the image data in a range of (50%, 80%) to complete data amplification processing;

s15: calculating the mean value and standard deviation of the image data in the training set after the test set and the amplification processing, and processing the contrast of the image according to a contrast normalization formula, wherein the contrast normalization formula is expressed as:

I＝(I-Mean)/Std；

where I represents the contrast of the image, Mean represents the Mean of the image data, and Std represents the standard deviation of the image data.

Preferably, the improved U-net network model comprises a deformable encoder and a decoder network with a reconstructed upsampling structure, wherein the deformable encoder comprises an input layer, a first deformable convolution layer, a second deformable convolution layer, a first maximum pooling layer, a third deformable convolution layer, a fourth deformable convolution layer, a second maximum pooling layer, a fifth deformable convolution layer, a sixth deformable convolution layer, a third maximum pooling layer, a seventh deformable convolution layer, an eighth deformable convolution layer, a fourth maximum pooling layer and a ninth deformable convolution layer in sequence; the decoder network with the reconstruction upsampling structure comprises a first conventional convolutional layer, a first reconstruction upsampling layer, a second conventional convolutional layer, a third conventional convolutional layer, a second reconstruction upsampling layer, a fourth conventional convolutional layer, a fifth conventional convolutional layer, a third reconstruction upsampling layer, a sixth conventional convolutional layer, a seventh conventional convolutional layer, a fourth reconstruction upsampling layer, an eighth conventional convolutional layer, a ninth conventional convolutional layer and a tenth conventional convolutional layer, namely an output layer; the first conventional convolution layer is connected with the ninth deformable convolution layer, the first reconstructed up-sampling layer is spliced with the eighth deformable convolution layer, the second reconstructed up-sampling layer is spliced with the sixth deformable convolution layer, the third reconstructed up-sampling layer is spliced with the fourth deformable convolution layer, and the fourth reconstructed up-sampling layer is spliced with the second deformable convolution layer; group normalization is added before the activation function of each deformable convolution layer and conventional convolution layer.

Preferably, the operation of the deformable convolution layer comprises:

s21: inputting the feature map with the size of h multiplied by w multiplied by c into a deformable convolution layer, and performing convolution on the feature map by using the convolution layer with an activation function of elu;

s22: inputting the convolution result in the S21 into the convolution layer with the activation function of rule to carry out convolution operation;

s23: reconstructing the convolution result of the S22 to generate an offset domain of 3 hx 3 wx 2;

s24: carrying out bilinear interpolation on the feature map by using an offset domain to generate a feature map of 3 hx 3 wx c;

s25: and inputting the feature map of 3h multiplied by 3w multiplied by c into a 3 multiplied by 3 convolution layer with the convolution kernel number of d and the step length of 3 to obtain the feature map of h multiplied by w multiplied by d, namely the output of the deformable convolution.

Preferably, the operation of reconstructing the upsampled layer comprises:

s31: for a feature map with the resolution of h multiplied by w and the number of channels of c, firstly, the number of channels is increased by 2 times through 2c convolutions of 1 multiplied by 1;

s32: the feature graph output by the S31 is subjected to group standardization operation and relu activation function to obtain a h multiplied by w multiplied by 2c feature graph;

s33: and dividing the feature map obtained in the step S32 into c/2 parts, each part being h × w × 4, performing reconstruction up-sampling on each part, and finally generating a feature map of 2h × 2w × c/2, thereby completing an up-sampling process in which the resolution is doubled and the number of channels is reduced by half.

Preferably, the concentration loss function L_focalExpressed as:

wherein, alpha is a constant and is a factor for coping with class imbalance; gamma is a parameter greater than 0, which is a parameter for controlling the contribution gap of the difficult and easy sample to the loss function; y (x) represents an input feature map; p (x) represents the value at pixel x; for positive samples, the larger p (x) indicates a simple sample, corresponding to (1-p (x))^γThe smaller, thereby reducing its contribution to the loss function; the smaller the p (x), the more difficult the sample is, corresponding to (1-p (x))^γThe larger the loss function, the larger its ratio in the loss function.

The invention has the beneficial effects that:

1) according to the invention, the target of the biological target is often non-rigid, and besides rotation and inversion, an elastic distortion data amplification method is adopted, so that the number of pictures of a training set can be greatly increased;

2) the invention adopts a full convolution neural network structure of U-net, can learn from data, is a segmentation network from images to images, and can integrate detailed information and global information to further improve the segmentation effect;

3) the invention aims at that the conventional convolution used by the encoder limits the learning capability of geometric deformation due to the fixed geometric structure thereof, and adopts the deformable convolution which can adaptively change the convolution structure according to data;

4) the present invention addresses the shortcomings of the upsampling method used by the encoder, such as the need for zero padding prior to convolution for deconvolution, the non-learnability of bilinear interpolation, and proposes a reconstructed upsampling convolution that does not require zero padding and is learnable;

5) the invention aims at the problem that biomedical images often have unbalanced distribution of positive and negative samples, and similar samples are difficult to be divided, for example, samples in the edge area of a target are more difficult to be divided than samples in the central area, and a centralized loss function is adopted, so that the problems that simple samples contribute too much to the loss function and cannot well learn the difficult samples and the like can be solved.

Drawings

In order to make the object, technical scheme and beneficial effect of the invention more clear, the invention provides the following drawings for explanation:

FIG. 1 is a flow chart of the biomedical image automatic segmentation method based on the U-net network structure according to the present invention;

FIG. 2 is a schematic diagram of an improved U-net network architecture of the present invention;

FIG. 3 is a schematic diagram of a deformable convolution according to the present invention;

FIG. 4 is a schematic diagram of a reconstructed upsampling structure of the present invention;

FIG. 5 is a 1/30 slice of a cytogram of a training set used in the present invention;

FIG. 6 is a schematic representation of a gold standard corresponding to the cell map of FIG. 5 according to the present invention;

FIG. 7 is a pictorial representation of slice 18 of the test set of the present invention;

FIG. 8 is a graph showing the effect of cell division obtained by the method of the present invention.

Detailed Description

The following further describes a new biomedical image automatic segmentation method based on a U-net network structure in the present invention with reference to the drawings of the specification.

The invention provides a biomedical image automatic segmentation method based on a U-net network structure, as shown in figure 1, comprising the following steps:

Because the deep neural network needs a large amount of data for training and the morphological change of the biological target is complex, the data amplification needs to be performed on the training set, and the data amplification on the training set comprises the following steps:

s14: the image data was subjected to (50%, 80%) range scaling processing, and data expansion processing was completed.

Due to the influence of factors such as equipment and imaging conditions, the brightness of a picture in a data set is not uniform, or the contrast of a biological target is not high due to over-brightness or over-darkness, so that the image is subjected to normalization processing. The normalization can greatly keep the characteristics of the biological target while improving the contrast with the background, so that the biological target can be segmented in a subsequent deep neural network, and the normalization processing is as follows:

I＝(I-Mean)/Std；

wherein I represents the contrast of the image, Mean represents the Mean of the image data, and Std represents the standard deviation of the image data; the image data in this embodiment refers to the contrast of the image in the data set.

Concentration loss function L_focalExpressed as:

wherein, alpha is a constant and is a factor for coping with class imbalance; gamma is a parameter greater than 0, which is a parameter for controlling the contribution gap of the difficult and easy sample to the loss function; y (x) represents an input feature map; p (x) represents the value at pixel x; wherein, for positive sample, the larger p (x) indicates a simple sample, corresponding to (1-p (x))^γThe smaller, thereby reducing its contribution to the loss function; the smaller the p (x), the more difficult the sample is, corresponding to (1-p (x))^γThe larger the loss function, the larger the ratio of the loss function

An improved U-net network structure is shown in fig. 2, where C1-C9 correspond to first to ninth deformable convolutional layers, P1-P4 correspond to first to fourth maximum pooling layers, R1-R4 correspond to first to fourth reconstructed upsampling layers, and C10-C19 correspond to first to tenth conventional convolutional layers, respectively, where the number of convolutional cores of the tenth conventional convolutional layer is a category number. The left part (from the input layer to C9) in fig. 2 is the encoder part, and the white boxes represent the characteristic diagram of the encoder part; taking the normalized image as input, the number of convolution kernels of the initial two deformable convolution layers C1 and C2 is 16, the resolution of the feature map is reduced by half every time the maximum pooling layer is passed, and the number of convolution kernels of the two deformable convolution layers following the pooling layer is twice as large as the number of convolution kernels of the two deformable convolution layers preceding the pooling layer. The deformable convolution can be expressed by the following equation:

where y (p0) represents the value of a pixel p0 in the output characteristic diagram y, x (p0+ pn + Δ pn) represents the value of p0+ pn + Δ pn in the pixel in the input characteristic diagram x, pn represents a convolution shift parameter, and R is the value range of pn, represented by R { (-1, -1), (-1,0), (-1,1), (0, -1), (0,0), (0,1), (1, -1), (1,0), (1,1) }, Δ pn is an offset, usually fractional, so x (p0+ pn + Δ pn) can be determined by bilinear interpolation.

The structure of the deformable convolution is as shown in fig. 3, the deformable convolution operates on the input feature map by additional convolution, for the input feature map with the size of h × w × c, firstly, the input feature map passes through two convolution layers with the convolution kernel number of 18, the activation function of the first convolution layer is an Exponential Linear Unit (ELU) activation function, and different from a modified Linear Unit (ReLU) activation function, the activation function still outputs when the input is a negative value, so that more information can be kept and input to the second convolution layer, the activation function of the second 3 × 3 convolution layer is replaced by a hyperbolic tangent function (tank), so that the output is mapped between (-1,1), and the corresponding offset is between (-1, 1);

new feature map obtained from the second convolutional layerAnd is reconstructed into a 3h multiplied by 3w multiplied by 2 offset domain, two channels of which correspond to the x coordinate offset and the y coordinate offset respectively, and each 3 multiplied by 3 area represents the offset of one convolution (corresponding to delta p in the deformable convolution)_n). The offset field and the mesh field are added to obtain p₀+p_n+Δp_nCarrying out bilinear interpolation operation on the input characteristic diagram to expand the resolution of the input characteristic diagram by three times, and carrying out d convolution with the step length of 3 to generate an output characteristic diagram, wherein d is the number of deformable convolution kernels in the figure 2; the deformable convolution can adapt to the geometric change of the image according to the self-adaptive change of the data, and the calculation amount is increased slightly; in the training process, extra convolution kernels of an offset domain and convolution kernels of an output characteristic graph are obtained through learning at the same time, and in order to learn offset, gradients can be propagated backwards through formulas of bidirectional interpolation operation.

The right part of fig. 2 (from C10 to C19) is the decoder part, and the grey boxes represent the signature generated by the decoder part. With the output of C9 as input, the number of convolution kernels of C10 is the same as that of C9, which is 256, and the resolution of the feature map is doubled every time an upsampled layer is reconstructed. As shown in fig. 4, for a feature map with a resolution of h × w and a channel number of c, the channel number is increased by 2 times by 2c convolutions of 1 × 1, and then a new feature map of h × w × 2c is obtained through a data set normalization operation and a ReLU activation function. Then, the new feature map is divided into c/2 parts (represented by different gray scales in fig. 4), and a reconstruction operation is performed on each part, so that the h × w × 4 feature map becomes 2h × 2w × 1, and finally, a 2h × 2w × c/2 feature map, that is, an upsampled feature map is generated, thereby completing an upsampling process in which the resolution is expanded by two times and the number of channels is reduced by half. The reconstruction upsampling method predicts the value of each pixel in the upsampled feature map by convolution directly on the input feature map, is learnable and does not need to fill zeros, and therefore, is more efficient than the deconvolution and bilinear interpolation upsampling methods; in addition, the 1 × 1 convolution is lightweight, so the parameters and computation of this method are also less than the other two methods.

The output of C19 is converted to a class probability map via the softmax function.

Except for special descriptions, the convolutions used in the present invention are all 3 × 3, step size is 1, activation function is ReLU; the convolutional layer obtains a feature map with the same resolution by zero padding.

During training, the error between the classification probability map and the golden standard is calculated through a centralized loss function, the weight parameters of the model are updated through gradient back propagation by adopting an Adam optimization algorithm, and finally the convergence model of the neural network is obtained.

During testing, the trained network model is loaded, a test image with any size and subjected to normalization processing is input, and the output of the improved U-net network structure passes through the softmax layer to generate a classification probability map.

Example 1

In this embodiment, a TensorFlow open source deep learning library is adopted, an NVIDIA Tesla M40GPU is used for acceleration, an Adam optimization algorithm is used for training a model, the initial learning rate is 0.001, a poly learning rate attenuation strategy is adopted, and L2 regularization (attenuation factor is 0.0005) is adopted to reduce the overfitting condition; experiments were performed using the Drosophila EM dataset provided by the ISBI 2012 electron microscopy cell segmentation challenge.

The training data set of this example consisted of 30 serial sections of drosophila first instar central neural stem cells under an electron microscope, each section containing 512 × 512 pixels and corresponding to a segmented gold standard; as shown in fig. 5-6, in the gold standard images, white represents cells, black represents cell membranes, and the test set consists of thirty additional images. Since deep learning requires a large amount of data for training, data enhancement methods such as random inversion, rotation and elastic distortion are adopted to increase the number of images in a training set.

In this example, the ISBI competition organizer has provided two evaluation indices, namely V_RandAnd V_Info. The closer these two indices are to 1, the more accurate the segmentation is. As can be seen from Table 1, the model parameters used in the method of the present invention are reduced by 30M and V is reduced compared to U-net_RandAnd V_InfoAre all improved, and V of the method_RandCompared with U-net, the improvement is 0.57 percent and reaches 97.84 percent. The concentration loss function does not need to calculate a complex weight graph compared with the loss function, so the training time is greatly reduced.

Further, as shown by the arrows in fig. 7, the boundary of the cell nucleus is very close to the cell boundary, so that the cell boundary is difficult to detect. As shown in FIG. 8, the method of the present invention can accurately classify cells by using the concentration loss function, and can better ensure the continuity of cell boundaries, which is very important in cell segmentation.

TABLE 1 comparison of the method herein with the U-net method in EM dataset experiments

Table 2 shows a comparison of some excellent results on the ranking board of the method herein and the ISBI cell division competition. Among them, the M2FCN method adopts a multi-stage network structure, and the training process is very complicated. Other methods employ post-processing or multi-training model averaging to improve the segmentation. From index V_RandThe method has the best effect, and the index V is_InfoThe fusion net method is best seen. In fact, the post-processing methods used in IALIC and CUMedVision can also be applied in the method of the present invention to enhance the segmentation performance. Meanwhile, the invention can adopt a ResNet structure like a fusion Net, thereby further continuously increasing the depth of the model and improving the robustness of the model.

TABLE 2 experimental comparison of the method of the invention with other methods on EM data sets

Example 2

In this example, the experiment was performed using the Warwick-QU dataset provided by the GLASS challenge, which is different from example 1. The data set contained 165 original images (after staining), each of which corresponded to one of the expert labeled gold standard images, where 85 images in the training set were used for training and verified with test set a and test set B.

During training, the original image is randomly cropped into image blocks with the size of 512 × 512 (when the length or width of the original image is smaller than 512, 0 is filled in the original image), so that the size of the image in each batch is consistent during training, and overfitting is reduced as a method for enhancing data. In addition, the method adopts the same data enhancement method as Drosophila EM data experiment.

Using the F1 score and Object Hausdorff as evaluation indicators, wherein the F1 score is used to evaluate gland detection, and a segmented individual is considered true positive if it has at least 50% overlap with its gold standard, otherwise it is considered false positive, and false negative if it has no segmented individual overlap with it or overlap area less than 50%;

the Hausdorff distance can be used for measuring the shape similarity of the segmentation individual and the gold standard individual, and the smaller the index is, the greater the shape similarity of the segmentation result and the gold standard is, and the better the segmentation effect is.

As shown in table 3, the F1 scores of the segmentation result of the present invention on the two test sets were improved by 0.023 and 0.027, respectively, compared with the segmentation result of the Freiburg method, indicating that the present invention method is more excellent in gland detection. Note that all the methods showed far worse results in test set B than test set a, mainly because the malignant disease cases in test set B account for 80%, and their irregular and complex structure makes gland detection more difficult. By observing the table 3, the Object Hausdorff on the test set A is promoted most by the method, which shows that the segmentation result has higher shape similarity with the gold standard, and proves that the deformable convolution has stronger learning capability on the deformation of the target.

TABLE 3 comparison of the results of experiments on the Warwick-QU dataset by the method of the invention with other methods

The experiments show that the method is not only effective, but also has obvious advantages compared with the prior similar method.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: ROM, RAM, magnetic or optical disks, and the like.

The above-mentioned embodiments, which further illustrate the objects, technical solutions and advantages of the present invention, should be understood that the above-mentioned embodiments are only preferred embodiments of the present invention, and should not be construed as limiting the present invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The new method for automatically segmenting the biomedical image based on the U-net network structure is characterized by comprising the following steps of:

s1: dividing a biomedical data set into a training set and a testing set, carrying out data amplification treatment on the training set, and carrying out normalization pretreatment on the testing set and the training set after the amplification treatment;

s3: calculating the error between the classification probability map and the gold standard through focal loss, and obtaining the weight parameter of the improved U-net network model through a gradient back propagation method;

s5: according to the class probability in the classification probability map, taking the class with the highest probability as the class of the pixel position to obtain a segmentation result map of the image;

the improved U-net network model comprises a deformable encoder and a decoder network with a reconstructed up-sampling structure, wherein the deformable encoder sequentially comprises an input layer, a first deformable convolution layer, a second deformable convolution layer, a first maximum pooling layer, a third deformable convolution layer, a fourth deformable convolution layer, a second maximum pooling layer, a fifth deformable convolution layer, a sixth deformable convolution layer, a third maximum pooling layer, a seventh deformable convolution layer, an eighth deformable convolution layer, a fourth maximum pooling layer and a ninth deformable convolution layer; the decoder network with the reconstruction upsampling structure comprises a first conventional convolutional layer, a first reconstruction upsampling layer, a second conventional convolutional layer, a third conventional convolutional layer, a second reconstruction upsampling layer, a fourth conventional convolutional layer, a fifth conventional convolutional layer, a third reconstruction upsampling layer, a sixth conventional convolutional layer, a seventh conventional convolutional layer, a fourth reconstruction upsampling layer, an eighth conventional convolutional layer, a ninth conventional convolutional layer and a tenth conventional convolutional layer, namely an output layer; the first conventional convolution layer is connected with the ninth deformable convolution layer, the first reconstructed up-sampling layer is spliced with the eighth deformable convolution layer, the second reconstructed up-sampling layer is spliced with the sixth deformable convolution layer, the third reconstructed up-sampling layer is spliced with the fourth deformable convolution layer, and the fourth reconstructed up-sampling layer is spliced with the second deformable convolution layer; before the activation function of each deformable convolution layer and the conventional convolution layer, adding group standardization;

the operation of reconstructing the upsampled layer includes:

2. The new method for automatically segmenting biomedical images based on U-net network structure according to claim 1, wherein the step S1 specifically comprises:

I＝(I-Mean)/Std；

3. The new method for automatic segmentation of biomedical images based on U-net network structure according to claim 1, characterized in that the operation of deformable convolution layer comprises:

4. The new method for biomedical image automatic segmentation based on U-net network structure according to claim 1, characterized by the centralized loss function L_focalExpressed as:

wherein, alpha is a constant and is a factor for coping with class imbalance; gamma is a parameter for controlling the contribution gap of the difficult and easy samples to the loss function, and gamma is greater than 0; y (x) represents the input feature map x; p (x) represents a pixel point of the input feature map x, and Ω is a value range of the input feature map x.