CN111598095B

CN111598095B - Urban road scene semantic segmentation method based on deep learning

Info

Publication number: CN111598095B
Application number: CN202010156966.XA
Authority: CN
Inventors: 宋秀兰; 魏定杰; 孙云坤; 何德峰; 余世明; 卢为党
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2020-03-09
Filing date: 2020-03-09
Publication date: 2023-04-07
Anticipated expiration: 2040-03-09
Also published as: CN111598095A

Abstract

A deep learning-based urban road scene semantic segmentation method comprises the following steps: 1) Acquiring an image of the front end of the vehicle; 2) And expanding the input data of the marked image and the original image: randomly cutting, splicing or adding different types of noise to the image, transforming the image through an image affine matrix, and finally, maintaining the original resolution of the image through transformation such as filling and cutting to obtain a data set; 3) The image after data expansion and the marked image are used for network training, and the residual U-net network comprises a down-sampling part, a bridge part, an up-sampling part and a classification part; 4) And modifying the time interval T of the acquisition module, inputting the subsequently obtained images into a trained deep learning model, outputting predicted semantic segmentation images, and transmitting different gray levels in the images back to the processor. The invention uses a smaller data set, can prevent the gradient from descending too fast and can ensure that the overfitting problem does not occur during training.

Description

Urban road scene semantic segmentation method based on deep learning

Technical Field

The invention belongs to the field of intelligent vehicles, and discloses an urban road scene semantic segmentation method based on deep learning.

Background

In recent years, with the continuous progress of urbanization, urban road conditions become more and more complex, and pedestrians, traffic lights, zebra crossings and different vehicles all influence the speed of the intelligent vehicle and obstacle avoidance measures. The environment around the vehicle can be well recognized through a deep learning semantic segmentation method, and different feedbacks are made. The semantic segmentation is to assign a preset category to each pixel point of the image, so that the intelligent vehicle can understand the surrounding environment in real time when driving, and the traffic accidents can be reduced. Therefore, research on deep learning of urban road environment has been a research focus in the field of vehicle intelligence. The existing deep learning semantic segmentation method researches neural networks such as Segnet, fcn, resnet and the like. Although these neural networks do not require a conventional object recognition process, can automatically learn features, are not designed manually by engineers, and can obtain a suitable model through a large amount of image training and output a result of semantic segmentation, the following problems can be encountered during the existing network training: 1. overfitting is caused by too many weights; 2. the problem of gradient rapid reduction may occur due to more network layers; 3. the training time is long due to the fact that a data set required by training is large. These problems make it difficult for the deep learning network to output accurate semantic segmentation results, so that it is difficult for the intelligent vehicle to obtain feedback of the surrounding environment in real time under complex road conditions, i.e. there is a potential safety hazard. It would therefore be valuable to design a network that uses a smaller data set, while preventing the gradient from dropping too quickly, and ensuring that over-fitting problems do not occur during training.

Disclosure of Invention

In order to overcome the defects of the prior art and to consider that the intelligent vehicle can better recognize the surrounding environment in complex environments such as urban roads and the like, the invention provides a method for carrying out semantic segmentation on urban road scenes based on deep learning.

The technical scheme adopted by the invention for solving the technical problem is as follows:

a deep learning-based urban road scene semantic segmentation method comprises the following steps:

1) And image acquisition of the front end of the vehicle: regularly collecting urban road images, setting a time interval as T, and inputting the images with the resolution of h multiplied by w into an image detection module to obtain effective images; then the image is input into a labeling module for labeling, the system adopts label software Labelme3.11.2 of a public image interface for labeling, vehicles, pedestrians, bicycles, traffic lights and neon objects on the image are framed and labeled into different categories through the scene segmentation labeling function of the system, the generated labeled image reflects the different categories of objects through different gray levels, and a gray list and an object category K stored in the image are obtained from the different gray levels of the labeled image;

2) And expanding the input data of the marked image and the original image: randomly clipping, splicing or adding different types of noise to the image, and then transforming the image by using an affine matrix of the image, wherein the affine transformation is shown in a formula (1):

in affine matrix s _x Representing the sum s of lateral translations _y Denotes the amount of longitudinal translation, c ₁ Representing magnification or reduction of the image abscissa, c ₄ Indicating the magnification or reduction of the ordinate, c ₂ And c ₃ Controlling image cutting transformation, (a, b represents original pixel position, (a ', b') is transformed position, and finally, the image cutting transformation is carried outPerforming conversion such as over filling and cutting, and keeping the original resolution of the image to obtain a data set;

3) The image after data expansion and the marked image are used for network training, and the residual U-net network consists of four parts, namely a down-sampling part, a bridge part, an up-sampling part and a classification part;

image length h, image width w, loss function size L, network iteration number epochs, batch size batch _ size, and validation set proportion rate. Dividing the data set into a training set and a verification set through rate, inputting the data set into a residual U-net network in batches according to batch _ size for training during training, calculating L through a predicted image output by the network and an actual label image, reversely propagating and adjusting parameters in the network to enable L output to tend to be minimized, repeatedly training the network to iteration times, and adjusting network parameters through the verification set in the iteration process. And finally, obtaining an optimal network model.

4) Road condition classification: and modifying the time interval T of the acquisition module, inputting the subsequently obtained images into a trained deep learning model, outputting the predicted semantic segmentation images, and transmitting different gray levels in the images back to the processor, so that the vehicle can well identify which types of objects exist at the front position to make subsequent different reactions.

Further, in the step 3), the down-sampling part is divided into four stages, each stage is composed of a residual error network, and the residual error networks are respectively from the first stage to the fourth stage. The connection sequence of each layer in the first-stage residual error network is as follows: the method comprises the steps of combining a convolution layer, a batch normalization layer, a softmax function layer, a convolution layer and a fusion layer, and finally fusing an input image and a processed characteristic image in the fusion layer in an identity connection mode. The forms of the second-level to the fourth-level residual error network are the same, and the connection sequence is as follows: the method comprises the steps of batch normalizing layer, softmax function layer, convolution layer, batch normalizing layer, softmax function layer, convolution layer and fusion layer, and finally fusing the input characteristic image and the processed characteristic image in the fusion layer in an identity connection mode; the convolution layer is composed of convolution kernels of 3 x 3, the dimensions of two convolution kernels of each stage are respectively 64, 128, 256 and 512, and finally, each stage is connected through a pooling layer with 2 x 2 step length of 2, and the dimension change of the pooling layer is the same as that of the convolution layer of each stage

Further, in the step 3), the bridge part is prepared for splicing network high-bottom dimension information, and the bridge part is composed of two batch normalization layers, two softplus function layers and two convolution layers with dimensions of 1024 of 3 multiplied by 3, and is free of a fusion layer, so that constant connection is not needed, and the connection sequence of each layer is the same as that of the second-level residual error network. And finally, adjusting the characteristic image to the spliced size through an up-sampling layer.

Furthermore, in step 3), the upsampling part is also composed of four stages of residual error networks, which are respectively a fifth stage to an eighth stage of residual error networks, the form of the residual error networks and the connection order of each layer are basically the same as those of each stage of residual error networks of the downsampling part, only the identity connection of the fifth stage to the seventh stage of residual error networks is replaced by a 1 × 1 convolutional layer, the eighth stage of residual error networks is not changed, the dimensions of the convolutional layers in the upsampling residual error networks of each stage are respectively 512, 256, 128 and 64, the stages are connected by the upsampling layer and the splicing layer, and the splicing layer splices the high-low dimension information with corresponding dimensions, and the splicing measure is as follows:

(3.1) splicing the feature image output by the fourth-level residual error network after passing through the pooling layer with the feature image output by the bridge part;

(3.2) splicing the characteristic image output by the third-level residual error network after passing through the pooling layer with the characteristic image output by the fifth-level residual error network after passing through the upper sampling layer;

(3.3) splicing the characteristic image of the second-level residual error network after the output of the second-level residual error network passes through the pooling layer with the characteristic image of the sixth-level residual error network after the output of the sixth-level residual error network passes through the up-sampling layer;

(3.4) splicing the characteristic image of the first-level residual error network after the output of the first-level residual error network passes through the pooling layer with the characteristic image of the seventh-level residual error network after the output of the seventh-level residual error network passes through the up-sampling layer;

the dimension of the spliced feature images changes, the dimension of the feature images is adjusted by using the 1 × 1 convolutional layers instead of the identity connection, the dimension of the four 1 × 1 convolutional layers is respectively 512, 256, 128 and 64, and finally the feature images are fused in the fusion layers.

In the step 3), the classification part is composed of a 1 × 1 convolution layer and a softmax layer, since the urban road image segmentation relates to six classes of vehicles, pedestrians, bicycles, traffic lights, neon lights and backgrounds, the feature images of 6 channels are obtained through the 1 × 1 convolution layer, but the pixel values of the original feature images are not probability values, so that the output is converted into probability distribution through the softmax layer, and the softmax function is shown in formula (2):

wherein d is _k (x) The representation pixel x is a value on a channel K, K represents the number of item classes, g _k (x) Indicates the probability that pixel x belongs to class k, g _k (x)∈[0，1]The highest probability in each channel is the corresponding class;

the deviation of the prediction from the actual is then evaluated using a cross-entropy loss function, see equation (3):

where t (x) represents the class to which pixel x corresponds, so g _t(x) (x) The probability of the class is represented by,

the probability that the pixel x corresponding to the annotated image belongs to the k classes is represented, so that the smaller the value of the loss function is, the closer the predicted image and the annotated image are. Through the reverse transfer of the loss function, the internal parameters of the neural network are continuously optimized, so that the loss function is continuously reduced and tends to an ideal value;

finally, the iteration time epochs, the batch processing size batch _ size and the verification set proportion rate of the network are determined when the model is trained. And dividing the obtained image set into a training set and a verification set according to the proportion of the verification set, inputting the images in the training set into the network in batches according to the batch processing size until the images in the training set are all input, completing one iteration, and finally repeatedly training the model through the determined iteration times to obtain the optimal neural network model.

The main execution parts of the invention are the acquisition and processing of images, the training of neural networks and the recognition of the images by using a recognition model. The implementation process of the method can be divided into the following three stages:

firstly, acquiring image data: setting a time interval T of an acquisition module, selecting different urban environment road sections to acquire images, and inputting the images into a detection module to obtain an effective image set; then, labeling the image by using labeling software labelme3.11.2, framing a target object in the image by an example scene segmentation labeling function, labeling the type of the object, generating a labeled image by software, and labeling different objects in the image by using different gray levels. Marking different gray levels of an image, and obtaining a gray list = [ ] and an article category number K; and finally, expanding the image and the marked image through a data expansion module to obtain a data set.

Secondly, parameters and training of the network: image length h, image width w, loss function size L, network iteration number epochs, batch size batch _ size, and validation set proportion rate. And dividing the data set into a training set and a verification set through rate, inputting the data set into a residual U-net network in batches according to batch _ size for training during training, calculating L through a predicted image output by the network and an actual label image, reversely propagating and adjusting parameters in the network to enable the L output to tend to be minimized, repeatedly training the network to iteration times, and adjusting network parameters through the verification set in the iteration process. And finally, obtaining an optimal network model.

Thirdly, road condition classification: and modifying the time interval T of the acquisition module, inputting the subsequently obtained images into a trained deep learning model, outputting the predicted semantic segmentation images, and transmitting different gray levels in the images back to the processor, so that the vehicle can well identify which types of objects exist at the front position to make subsequent different reactions.

The invention has the following beneficial effects: 1. in the network design, the problems of too fast gradient descent, too large required data set and overfitting possibly occurring in the deep learning network during training are comprehensively considered, so that the method of batch normalization, residual error network and high-bottom information splicing is added in the network, and the problems of gradient descent and image information loss are effectively reduced; the accuracy of semantic segmentation is improved; 2. the road condition detection system for deep learning is simple in design, convenient to understand, few in used data set, high in real-time performance, and high in practicability and adaptability.

Drawings

Fig. 1 is a flow of implementation of an urban road scene semantic segmentation system for deep learning.

FIG. 2 is an overall model design of a residual U-net network used in a deep learning urban road scene semantic segmentation system.

FIG. 3 is a network form of second-level to fifth-level residual error networks in a residual error U-net network used by the deep learning urban road scene semantic segmentation system.

FIG. 4 is a diagram showing the semantic segmentation effect of deep learning urban road scenes.

Detailed Description

The method of the present invention is described in further detail below with reference to the accompanying drawings.

Referring to fig. 1 to 4, a deep learning-based urban road scene semantic segmentation method includes the following steps:

1) And image acquisition of the front end of the vehicle: collecting urban road images at regular time, setting a time interval as T, and inputting the images with the resolution of h multiplied by w into an image detection module to obtain effective images; then the image is input into a labeling module for labeling, the system adopts labeling software Labelme3.11.2 of a public image interface for labeling, vehicles, pedestrians, bicycles, traffic lights, neon objects and the like on the image are framed and labeled into different categories through the scene segmentation labeling function of the system, the generated labeled image reflects the objects of different categories through different gray levels, and a gray list and the object category K stored in the image are obtained from the different gray levels of the labeled image;

s in affine matrix _x Representing the sum of the lateral translations and s _y Representing the amount of longitudinal translation, c ₁ Representing magnification or reduction of the image abscissa, c ₄ Indicating the magnification or reduction of the ordinate, c ₂ And c ₃ Controlling image cutting transformation, (a, b) representing original pixel position, (a ', b') being transformed position, and finally keeping original resolution of image through transformation such as filling and cutting to obtain data set;

4) Road condition classification: and modifying the time interval T of the acquisition module, inputting the subsequently obtained images into the trained deep learning model, outputting the predicted semantic segmentation images, and returning different gray levels in the images to the processor, so that the vehicle can well identify which types of objects exist in the front position to make subsequent different responses.

Further, in the step 3), the down-sampling part is divided into four stages, each stage is composed of a residual error network, and the residual error networks are respectively from the first stage to the fourth stage. The connection sequence of each layer in the first-stage residual error network is as follows: the method comprises the steps of a convolution layer, a batch normalization layer, a softmax function layer, a convolution layer and a fusion layer, and finally fusing an input image and a processed characteristic image in the fusion layer in an identical connection mode. The forms of the second-level to the fourth-level residual error network are the same, and the connection sequence is as follows: and finally, fusing the input characteristic image and the processed characteristic image in the fusion layer in an identity connection mode. The convolutional layer is composed of 3 × 3 convolutional kernels, and the two convolutional kernels of each stage have dimensions of 64, 128, 256, and 512. Finally, all the levels are connected through the pooling layer with 2 multiplied by 2 step length being 2, and the dimensional change of the pooling layer is the same as that of the convolution layer of each level.

The bridge part is prepared for splicing network high-bottom dimension information and comprises two batch normalization layers, two softplus function layers and two convolution layers with the dimension of 3 multiplied by 3 being 1024, and a fusion-free layer is not needed, so that constant connection is not needed, and the connection sequence of each layer is the same as that of a second-stage residual error network. And finally, adjusting the feature image to a size suitable for splicing through an upsampling layer.

The up-sampling part is also composed of four stages of residual error networks which are respectively a fifth stage residual error network to an eighth stage residual error network, the form of the residual error network and the connection order of each layer are basically the same as that of each stage residual error network of the down-sampling part, only the identity connection of the fifth stage residual error network to the seventh stage residual error network is replaced by a 1 multiplied by 1 convolution layer, and the eighth stage residual error network is not changed. The dimension of the convolution layer in the upsampled residual network at each stage is 512, 256, 128 and 64 respectively. Connect through upsampling layer and concatenation layer between each level, the concatenation layer splices the height dimension information of corresponding size, the concatenation measure:

and (3.1) splicing the feature image output by the fourth-level residual error network after passing through the pooling layer with the feature image output by the bridge part.

And (3.2) splicing the characteristic image output by the third-level residual error network after passing through the pooling layer with the characteristic image output by the fifth-level residual error network after passing through the up-sampling layer.

And (3.3) splicing the characteristic image of the second-level residual error network after the output of the second-level residual error network passes through the pooling layer with the characteristic image of the sixth-level residual error network after the output of the sixth-level residual error network passes through the up-sampling layer.

And (3.4) splicing the characteristic image of the first-level residual error network after the output of the first-level residual error network passes through the pooling layer with the characteristic image of the seventh-level residual error network after the output of the seventh-level residual error network passes through the up-sampling layer.

The classification part is composed of a 1 × 1 convolution layer and a softmax layer, since the urban road image segmentation relates to six classes of vehicles, pedestrians, bicycles, traffic lights, neon lights and backgrounds, the feature images of 6 channels are obtained through the 1 × 1 convolution layer, but the pixel values of the original feature images represent probability values, so the output is converted into a probability distribution through the softmax layer, the softmax function, see formula (2):

wherein d is _k (x) The representation pixel x is a value on a channel K, K represents the number of item classes, g _k (x) Representing the probability that pixel x belongs to class k, g _k (x)∈[0，1]. The highest probability in each channel is the corresponding class.

The deviation of the predicted result from the actual is then evaluated using a cross-entropy loss function, see equation (3):

indicating that the corresponding pixel x of the annotated image belongs to class kTherefore, a smaller value of the loss function indicates a closer proximity between the predicted image and the annotated image. Through the reverse transfer of the loss function, the internal parameters of the neural network are continuously optimized, so that the loss function is continuously reduced and tends to an ideal value.

Finally, the iteration number epochs, the batch processing size batch _ size and the verification set proportion rate of the network are determined when the model is trained. And dividing the obtained image set into a training set and a verification set according to the proportion of the verification set, inputting images in the training set into the network in batches according to the batch processing size until the images in the training set are all input, finishing one iteration, and finally repeatedly training the model through the determined iteration times to obtain the optimal neural network model.

The main execution parts of the embodiment are image acquisition and processing, neural network training and image recognition by using a recognition model. The implementation process of the method can be divided into the following three stages:

firstly, acquiring image data: setting the time interval T =4s of an acquisition module, selecting different urban environment road sections to acquire images, and inputting the images into a detection module to obtain 1000 effective images; then, labeling software labelme3.11.2 is used for labeling the image, various targets in the image are framed through an example scene segmentation labeling function, the types of the targets are labeled, the software generates a labeled image, and different target types are labeled by using different gray levels in the image. A gray list of list = [0, 20, 80, 140, 180, 230] represents pixel values of different objects, which respectively include background, neon, traffic light, vehicle, pedestrian, and bicycle, and the total number of categories K =6; and finally, expanding the image and the marked image through a data expansion module to obtain a data set.

Secondly, inputting network parameters on a network parameter setting interface, wherein the network parameters comprise the following steps: image length h =224, image width w =224, loss function L; the number of network iterations epochs =30, the batch size batch _ size =4 and the validation set ratio rate =0.1;3000 image sets are divided into 2700 training sets and 300 verification sets, 4 images are input into a residual U-net network for training each time according to batch _ size during training, until the training sets are completely trained, the size of a loss function L is calculated through predicted images output by the network and actual label images, parameters in the network are reversely propagated and adjusted to enable the output of the L to tend to be minimized, one iteration is completed, the network is iteratively trained for 30 times, and network parameters are adjusted through the verification sets in the iteration process; finally, a proper network model is obtained.

And thirdly, modifying the time interval T =0.2s of the acquisition module, inputting the subsequently obtained images into the trained deep learning model, outputting a real-time semantic segmentation result, and returning different gray levels in the images to the processor, so that the vehicle can well identify which types of objects exist in the front position to make subsequent different responses.

The actual system design form, the network establishment process and the results are shown in fig. 1, fig. 2, fig. 3 and fig. 4, and fig. 1 is a flow of implementation of the deep learning urban road scene semantic segmentation system. FIG. 2 is an overall model design of a residual U-net network used in a deep learning urban road scene semantic segmentation system. FIG. 3 is a network form of second-level to fifth-level residual error networks in a residual error U-net network used by the deep learning urban road scene semantic segmentation system. FIG. 4 is a diagram showing the semantic segmentation effect of deep learning urban road scenes.

The above illustrates the excellent deep learning urban road scene semantic segmentation effect exhibited by one embodiment of the present invention. It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that any modifications made within the spirit and scope of the appended claims are intended to be within the scope of the invention.

Claims

1. A deep learning-based urban road scene semantic segmentation method is characterized by comprising the following steps:

1) And image acquisition of the front end of the vehicle: collecting urban road images at regular time, setting a time interval as T, and carrying out image detection on the images with the resolution of h multiplied by w to obtain effective images; then marking the obtained effective image, marking by adopting marking software Labelme3.11.2 of a public image interface, framing and marking the objects of vehicles, pedestrians, bicycles, traffic lights and neon lights on the image into different categories by the scene segmentation marking function, reflecting the objects of different categories by the generated marked image through different gray levels, and obtaining a gray list and the object category K stored in the image from the different gray levels of the marked image;

in affine matrix s _x Representing the sum of the lateral translations and s _y Representing the amount of longitudinal translation, c ₁ Representing magnification or reduction of the image abscissa, c ₄ Denotes the magnification or reduction of the ordinate, c ₂ And c ₃ Controlling image cutting transformation, (a, b) representing original pixel position, (a) ^′ ,b ^′ ) The original resolution of the image is maintained through the transformation of filling, cutting and the like to obtain a data set;

the method comprises the steps of image length h, image width w, loss function size L, network iteration times epochs, batch processing of batch _ size and verification set proportion rate, dividing a data set into a training set and a verification set through the rate, inputting batch _ size into a residual U-net network for training according to the batch _ size during training, calculating L through predicted images output by the network and actual label images, reversely propagating and adjusting parameters in the network to enable the output of the L to tend to be minimized, repeatedly training the network to the iteration times, adjusting network parameters through the verification set in the iteration process, and finally obtaining an optimal network model;

4) Road condition classification: modifying the acquisition time interval T, inputting the subsequently obtained images into a trained deep learning model, outputting predicted semantic segmentation images, transmitting different gray levels in the images back to a processor, and identifying the object types existing in the front position by the vehicle;

in the step 3), the down-sampling part is divided into four stages, each stage is composed of a residual error network, namely a first-stage to fourth-stage residual error network, and the connection sequence of each layer in the first-stage residual error network is as follows: the method comprises the following steps of (1) merging the input image and the processed characteristic image in a merging layer by an identity connection mode, wherein the merging layer comprises a convolution layer, a batch normalization layer, a softmax function layer, a convolution layer and a merging layer, the input image and the processed characteristic image are merged in the merging layer, the forms of all layers of a second-level residual error network and a fourth-level residual error network are the same, and the connection sequence is as follows: the method comprises the steps of firstly, integrating a plurality of feature images into a fusion layer, wherein the feature images are input into a batch normalization layer, a softmax function layer, a convolution layer, a batch normalization layer, a softmax function layer, a convolution layer and a fusion layer, and finally, the input feature images and the processed feature images are fused in the fusion layer in an identical connection mode; the convolution layer is composed of convolution kernels of 3 x 3, the dimensionalities of two convolution kernels of each level are respectively 64, 128, 256 and 512, and finally, each level is connected through a pooling layer with 2 x 2 step length being 2, and the dimensionality change of the pooling layer is the same as that of each level;

in the step 3), the bridge part is prepared for splicing network high-bottom dimension information and comprises two batch normalization layers, two softplus function layers and two convolution layers with dimensions of 3 multiplied by 3 being 1024, wherein no fusion layer exists, the connection sequence of each layer is the same as that of the second-stage residual error network, and finally the feature image is adjusted to be spliced through an up-sampling layer;

in the step 3), the upsampling part is also composed of four levels of residual error networks, which are respectively residual error networks from the fifth level to the eighth level, the form of the residual error networks and the connection mode of each layer are basically the same as that of the residual error networks from the fifth level to the seventh level, the permanent connection of the residual error networks from the fifth level to the seventh level is replaced by a 1 × 1 convolutional layer, the residual error network from the eighth level is not changed, the dimensions of the convolutional layers in the upsampling residual error networks from the various levels are 512, 256, 128 and 64 respectively, the layers are connected through the upsampling layer and the splicing layer, and the splicing layer splices the high-dimensional information and the low-dimensional information with corresponding dimensions, wherein the splicing measure is as follows:

(3.2) splicing the characteristic image of the output of the third-level residual error network after passing through the pooling layer with the characteristic image of the output of the fifth-level residual error network after passing through the upper sampling layer;

the dimensionality of the spliced characteristic images changes, the dimensionality of the characteristic images is adjusted by using a 1 × 1 convolutional layer for replacing constant connection, the dimensionalities of the four 1 × 1 convolutional layers are respectively 512, 256, 128 and 64, and finally the characteristic images are fused in a fusion layer;

wherein d is _k (x) The representation pixel x is a value on a channel K, K represents the number of item classes, g _k (x) Representing the probability that pixel x belongs to class k, g _k (x)∈[0,1]The highest probability in each channel is the corresponding class;

and the probability that the pixels x corresponding to the labeled image belong to k classes is represented, so that the smaller the value of the loss function is, the closer the predicted image and the labeled image are, and the internal parameters of the neural network are continuously optimized through reverse transfer of the loss function, so that the loss function is continuously reduced and tends to be an ideal value. />