CN113011429B

CN113011429B - Real-time street view image semantic segmentation method based on staged feature semantic alignment

Info

Publication number: CN113011429B
Application number: CN202110295657.5A
Authority: CN
Inventors: 严严; 翁熙; 王菡子
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2021-03-19
Filing date: 2021-03-19
Publication date: 2023-07-25
Anticipated expiration: 2041-03-19
Also published as: CN113011429A

Abstract

A real-time street view image semantic segmentation method based on staged feature semantic alignment relates to a computer vision technology. An encoder is first constructed using a lightweight image classification network ResNet-18 and an efficient space-channel attention module, and a decoder is constructed using a plurality of differently designed feature alignment module modules with a global averaging pooling layer. Next, using the encoder and decoder obtained as described above, a semantic division network model based on the encoder-decoder network structure is constructed. Finally, the characteristics in the encoder and the output characteristics of the decoder are aggregated and sent to a semantic segmentation result generation module to obtain a final semantic segmentation result. The corresponding segmentation results can be efficiently generated at a real-time rate without degrading the image resolution while maintaining a high-resolution input image. Compared with the existing real-time semantic segmentation method, the method can obtain more excellent segmentation precision and better balance between speed and precision.

Description

Real-time street view image semantic segmentation method based on staged feature semantic alignment

Technical Field

The invention relates to a computer vision technology, in particular to a real-time street view image semantic segmentation method based on staged feature semantic alignment.

Background

Semantic segmentation is one of the key technologies for scene understanding, and it needs to predict each pixel point in an image to implement the task of classifying the pixel-level semantic class of the image. In recent years, the use of autopilot and intelligent transportation has attracted considerable attention. In these applications, a problem that needs to be addressed is how to provide a comprehensive understanding of traffic conditions at the semantic level. Thus, it is of exceptional importance for these applications to study the street view image semantic segmentation method and provide pixel-level street view scene understanding.

In recent years, a large number of semantic segmentation methods based on deep learning methods have been proposed, benefiting from the development of convolutional neural networks. The method obtains excellent segmentation results by capturing rich semantic information and spatial detail information. However, the underlying network part network of these methods employs complex deep neural networks to capture semantic information in the input image. As commonly employed, network ResNet-101 (K.He, X.Zhang, S.Ren, and J.Sun, "Deep residual learning for image recognition," in Proc.IEEE int.Conf.Comput.Vis.Pattern Recogit. (CVPR), jun.2016, pp.770-778.) provides powerful semantic information extraction capabilities, but the number of layers and width of its bulky network also makes it inefficient. In general, applications such as autopilot and intelligent transportation require not only high resolution input images to cover a wide field of view, but also efficient interaction or response speeds. Therefore, research into semantic segmentation methods capable of maintaining high segmentation accuracy under real-time conditions has received extensive attention from researchers.

Up to now efforts have been made to achieve efficient or real-time semantic segmentation. These methods typically employ methods that reduce the resolution of the input image or use a lightweight base network to increase the efficiency of the network. Although these methods greatly reduce the computational complexity of semantic segmentation, the context information or spatial details are lost to some extent, resulting in a significant decrease in accuracy. Therefore, how to achieve a good balance between network prediction speed and segmentation accuracy becomes a key challenge for real-time semantic segmentation.

Based on the technical background, the invention provides a real-time street view image semantic segmentation method based on staged feature semantic alignment. The representation capabilities of the features used are enhanced while only employing a lightweight infrastructure network. Thus, the semantic segmentation network model is enabled to maintain excellent segmentation accuracy while maintaining real-time network prediction speed.

Disclosure of Invention

The invention aims to solve the technical problems in the prior art and provide a real-time street view image semantic segmentation method based on staged feature semantic alignment, which can efficiently generate corresponding segmentation results at a real-time rate and has high segmentation precision.

The invention comprises the following steps:

A. dividing the street view image semantic segmentation data set into a training set, a verification set and a test set;

B. based on a lightweight image classification network structure, a specially designed efficient space-channel attention module is combined to construct a basic network of a semantic segmentation network model;

C. b, designing a feature semantic alignment module with different network structures according to the self characteristics of the features at different stages in the basic network obtained in the step B;

D. b, taking the basic network obtained in the step B as an encoder, combining a global average pooling layer and a plurality of characteristic alignment modules designed in the step C as decoders, and constructing a semantic segmentation network model based on a symmetrical encoder-decoder network structure;

E. the final stage output characteristics of the network structure obtained in the step D are aggregated with the characteristics of the first stage of the encoder, and are sent to a semantic segmentation result generation module to form a prediction result;

F. e, training parameters in the semantic segmentation network obtained in the step E by using a semantic segmentation training set;

G. in training, selectively sending the output features of the partial feature alignment modules into mutually independent semantic segmentation result generation modules to generate different prediction results, and jointly updating network parameters by using the prediction results so as to explicitly solve the target multiscale problem in the street view image;

H. inputting the test set into the trained network to obtain the semantic segmentation result of the corresponding street view image.

In step a: the street view image semantic segmentation data set can adopt a public data set Cityscapes, contains 25000 street view images and is divided into a fine annotation subset (5000) and a rough annotation subset (20000) according to the fineness degree of semantic annotation; the fine annotation subset is further divided into a training set (containing 2975 images), a validation set (containing 500 images) and a test set (containing 1525 images); each image has a size of 1024×2048 resolution and each pixel is labeled as 19 categories defined in advance, including one of a road (road), a sidewalk (sidewalk), a building (wall), a fence (fence), a pillar (pole), a traffic light (traffic light), a traffic sign (traffic sign), a vegetation (vegetation), a terrain (terrain), a sky (sky), a person (person), a rider (rider), a car (car), a truck (truck), a bus (bus), a train (train), a motorcycle (motorcycle), and a bicycle (bicycle).

In the step B, the construction method of the basic network of the semantic segmentation network model includes the following two sub-steps:

B1. adopting a lightweight image classification network ResNet-18 as a base, wherein ResNet-18 is the lightest version in the ResNet network, the ResNet-18 cannot be directly used because the semantic segmentation is a pixel-class classification network, and all network layers after the last basic residual block of the ResNet-18 are removed so as to obtain a basic network of a preliminary semantic segmentation network model; the basic network contains 8 basic residual blocks in total, and the network is divided into four stages by taking 2 continuous basic residual blocks as a group: res-1, res-2, res-3 and Res-4;

B2. embedding a high-efficiency space-channel attention module between two residual modules of Res-2, res-3 and Res-4 to improve the characteristic representation capability of a basic network of a semantic segmentation network model and reduce the information loss caused by downsampling operation; obtaining a basic network part in the semantic segmentation network model; the efficient space-channel attention module contains two branch paths, the space branch contains a 1 x 1 standard convolution and a Sigmoid activation function, the channel branch contains a global average pooling operation, a 1 x 1D convolution and a Sigmoid activation function.

In step C, the feature semantic alignment modules having different network structures, each semantic alignment module containing two input features and one output feature, the two input features having different sizes, the small-size input feature being from the previous module connected to the module, and the large-size input feature being from the corresponding stage of the base network obtained in step B); to increase the speed of the network, features from the underlying network are passed through an additional CBR module to reduce the number of channels of the features; the CBR module contains a 3 x 3 standard convolution operation, a normalization operation (Batch Normalization) and a ReLU activation function;

then, the large-size input feature passes through a feature enhancement module with different designs and a high-efficiency space-channel attention module, so that the feature can enhance the feature representation capability according to the self characteristics; the feature enhancement module (FEB) enables the input features to undergo a series of convolution layers and normalization operations to enhance semantic information or space detail information of the features, and then the enhanced features are aggregated with the input features and pass through a ReLU activation function; for features from Res-4, the convolution layer in the feature enhancement module (FEB-4) is a plurality of depth separable convolutions with different hole rates to enhance semantic information; for features from Res-2, the feature enhancement Module (FEB-2) employs standard convolution to improve the capture of spatial detail information in the features; for the feature enhancement module (FEB-3) uses depth separable volume sets without hole ratios to balance between enhanced feature representation forces and control module computational complexity; for features from Res-1, the feature enhancement modules are not used in alignment due to their large size;

simultaneously, the small-size input features are processed through a CBR module and up-sampling operation to obtain the same size and channel number as the processed large-size input features, then the two processed input features are spliced together and fed into a 3X 3 standard convolution operation to learn a semantic offset field between the two features; and carrying out semantic alignment operation on the processed small-size input features by using the learned semantic offset field, finally, aggregating the processed large-size input features and the small-size input features subjected to semantic alignment, and sending the aggregated large-size input features and the small-size input features into another efficient space-channel attention module to generate output features of the feature semantic alignment module.

In the step D, the specific construction method of the semantic segmentation network model includes: taking the base network obtained in the step B) as an encoder, thereby obtaining four characteristics from four stages of Res-1, res-2, res-3 and Res-4 of the encoder; then, a feature semantic alignment module-1 to a feature semantic alignment module-4 designed according to feature characteristics obtained by Res-1 to Res-4 are obtained from the step C, and finally, a global tie pooling layer, a feature semantic alignment module-4, a feature semantic alignment module-3, a feature semantic alignment module-2 and a feature semantic alignment module-1 are sequentially added at the end of a basic network, and the newly added modules form a decoder of a semantic segmentation network model, so that a symmetrical coding and decoding network structure is formed; corresponding branch paths are established between the Res-1 to Res-4 and the feature semantic alignment module-1 to the feature semantic alignment module-4, and output features of corresponding stages of the basic network are transmitted for subsequent use of the corresponding feature semantic alignment modules.

In step E, the specific implementation of the polymerization is as follows: performing channel splicing on the final output of the semantic segmentation network model obtained in the step D and the output characteristics obtained in the Res-1, and sending the spliced characteristics into a semantic segmentation result generation module which comprises a CBR operation, a standard convolution of 3 multiplied by 3 and an up-sampling operation; the one CBR operation reduces the number of channels to 64, a 3 x 3 standard convolution product continues to reduce the number of channels of 64 to the number of categories of the semantically segmented dataset (19), and an upsampling operation restores the features that ultimately have the same number of channels as the dataset to the same size as the original input image to obtain the final semantically segmented result.

In the step F, the training adopts three methods of random overturn, random scaling (the scaling is 0.5-2.0) and random clipping (768 multiplied by 1536) to carry out data enhancement operation on the original data set; the initial learning rate of the network is set to be 0.005, the weight decay parameter is set to be 0.0005, the momentum factor is set to be 0.9, and a random gradient descent method (SGD) is adopted as an optimizer of the network; the learning rate strategy adopts a poly learning strategy, and the learning rate of the network is updated by a polynomial power (power) of 0.9; the training time of the whole network is 120000 iterations, and the sample number of each iteration is 12.

In step G, the specific method for selectively sending the output features of the partial feature alignment module to the mutually independent semantic segmentation result generation module to generate different prediction results, and updating the network parameters by using the prediction results together may be: d, outputting the semantic segmentation network model feature semantic alignment module-1 to the feature semantic alignment module-4 obtained in the step D, and selectively inputting the same semantic segmentation result into a semantic segmentation result generation module as in the step E; selecting the outputs of the feature semantic alignment module-3 and the feature semantic alignment module-4, and respectively using a semantic segmentation result generation module to obtain auxiliary semantic segmentation results; the whole network contains three final output results, and each result is used for comparing with the labeling image provided by the data set so as to obtain a corresponding cross entropy loss function result; and finally, adding the obtained three cross entropy loss function results, and then updating network parameters by utilizing a direction propagation algorithm together with the step F.

In the step H, the test set is input into the trained network, and the images used for testing are directly input into the network without any skill, so as to obtain the semantic segmentation result of the corresponding size.

The invention firstly utilizes a lightweight image classification network ResNet-18 and a high-efficiency space-channel attention module to construct an encoder, and uses a plurality of characteristic alignment module modules with different designs and a global average pooling layer to construct a decoder. Next, using the encoder and decoder obtained as described above, a semantic division network model based on the encoder-decoder network structure is constructed. Finally, the characteristics in the encoder and the output characteristics of the decoder are aggregated and sent to a semantic segmentation result generation module to obtain a final semantic segmentation result.

Compared with the prior art, the invention has the following outstanding advantages:

the present invention can efficiently generate corresponding segmentation results at a real-time rate without degrading image resolution while maintaining a high-resolution (1024×2048) input image. Meanwhile, compared with the existing real-time semantic segmentation method, the method can obtain more excellent segmentation precision and better balance between speed and precision.

Drawings

Fig. 1 is a flowchart of the entire implementation of an embodiment of the present invention.

Fig. 2 is a diagram of the entire network structure according to an embodiment of the present invention. In the figure, 'C' denotes a channel splicing operation, and 'UP' denotes an UP sampling operation.

Fig. 3 (a) is a network structure diagram of a feature semantic alignment module according to an embodiment of the present invention. (b) Feature enhancement modules designed for different network architectures. '+' in the figure represents an element-wise addition operation.

Fig. 4 is a network configuration diagram of a high efficiency spatial-channel attention module according to an embodiment of the present invention.

Detailed Description

The following examples are provided to further illustrate the present invention with reference to the accompanying drawings, and the present examples are provided to illustrate the embodiments and specific operation procedures based on the technical scheme of the present invention, but the scope of the present invention is not limited to the following examples.

Referring to fig. 1, the implementation of the embodiment of the present invention includes the following steps:

A. a semantic segmentation training set, a verification set and a test set of street view images are prepared.

The dataset used in the invention was Cityscapes, which is a large-scale street view image dataset in which data was collected from fifty different cities in germany. The dataset contains 25000 street view images and is divided into fine annotation subsets (containing 5000 images with fine semantic annotations) and coarse annotation subsets (containing 20000 images with coarse semantic annotations) according to the fineness of the semantic annotations. Each image has a size of 1024×2048 resolution and each pixel is marked as one of 19 categories (road), sidewalk, building, wall, fence, pillar, traffic light, traffic sign, vegetation, terrain, sky, person, rider, car, truck, bus, train, motorcycle, and bicycle (bicyclide) which are predefined. In the present invention, only a subset of fine annotations is used. And further dividing the fine annotation subset into three parts for the invention to use according to the division mode of the data set provider: training set (containing 2975 images), validation set (containing 500 images), and test set (containing 1525 images).

B. Based on the lightweight image classification network structure, a specially designed efficient space-channel attention module is combined to construct a basic network of the semantic segmentation network model.

The construction mode of the basic network of the semantic segmentation network model mainly comprises the following two sub-steps:

step B1. Based on a lightweight image classification network ResNet-18, resNet-18 is the most lightweight version of the ResNet network, which is faster and has fewer model parameters than other ResNet networks. In addition, since the semantic segmentation is a pixel-level classification network, the ResNet-18 network cannot be directly used, and therefore all network layers after the last basic residual block of the ResNet-18 are removed in the invention to obtain a basic network of the preliminary semantic segmentation network model. The basic network contains 8 basic residual blocks in total, and the network is divided into four stages by taking 2 continuous basic residual blocks as a group: res-1, res-2, res-3 and Res-4.

Step B2. The basic network obtained in step B1 still leaves an operation detrimental to the semantic segmentation task, mainly embodying the downsampling operation of the first residual block in Res-2, res-3 and Res-4. Although this operation can extract high-level semantic information, the lack also results in the loss of spatial detail information that is equally important for semantic segmentation. Therefore, to reduce the information loss from the downsampling operation, a specially designed high-efficiency space-channel attention module is embedded between two residual modules of Res-2, res-3 and Res-4 to improve the feature representation capability of the underlying network of the semantic segmentation network model. Thereby deriving the underlying network part in the semantic segmentation network model used in the invention. Referring to fig. 4, the high efficiency spatial-channel attention module contains two branch paths, the spatial branch contains a 1 x 1 standard convolution and a Sigmoid activation function, the channel branch contains a global average pooling operation, a 1 x 1D convolution and a Sigmoid activation function.

C. And C, designing a feature semantic alignment module with different network structures according to the self characteristics of the features at different stages in the basic network obtained in the step B.

Referring to fig. 3, feature semantic alignment modules with different network structures are provided, and the feature semantic alignment modules with different structures are designed according to the self characteristics of input features of specific operation of the modules, so that the problem of misalignment among features with different levels can be effectively solved, and the feature representation capability can be enhanced. Each semantic alignment module contains two input features and one output feature. The two input features have different sizes, the small-sized feature comes from the previous module connected to this module, and the large-sized input feature comes from the corresponding stage of the basic network obtained in step B. To increase the speed of the network, features from the underlying network have been passed through an additional CBR module (containing a 3 x 3 standard convolution operation, a normalization operation (Batch Normalization) and a ReLU activation function) to reduce the number of channels of the features.

The large-sized input feature is then passed through a feature enhancement module and a high-efficiency space-channel attention module with different designs so that the feature can enhance the feature representation capabilities according to its own characteristics. The feature enhancement module (FEB) subjects the input features to a series of convolution layers and normalization operations to enhance semantic information or spatial detail information of the features, and then aggregates the enhanced features with the input features and activates the functions through the ReLU. For features from Res-4, the convolution layer in the feature enhancement module (FEB-4) is a plurality of depth separable convolutions with different hole rates to enhance semantic information. For features from Res-2, the feature enhancement module (FEB-2) employs standard convolution to improve the capture of spatial detail information in the features. For the Res-3, the feature enhancement Module (FEB-3) uses a depth separable volume set without hole ratios to balance the enhancement feature representation forces with the control Module computational complexity. Whereas for features from Res-1, the feature enhancement modules are not aligned for use because of their own size.

At the same time, the small-sized input features pass through a CBR module and up-sampling operation to obtain the same size and channel number as the processed large-sized input features. The two processed input features are then stitched together and fed into a 3 x 3 standard convolution operation to learn the semantic offset field between the two features. And performing semantic alignment operation on the processed small-size input features by using the learned semantic offset field. And finally, the processed large-size input features and the semantically aligned small-size input features are aggregated together and are sent into another efficient space-channel attention module to generate output features of the feature semantic alignment module.

D. And B, taking the basic network obtained in the step B as an encoder, combining a global average pooling layer and a plurality of characteristic alignment modules designed in the step C as decoders, and constructing a semantic segmentation network model based on a symmetrical encoder-decoder network structure.

Referring to fig. 2, the semantic segmentation network model is specifically constructed by the following steps: taking the base network obtained in step B as an encoder, four features can be obtained from the four phases Res-1, res-2, res-3 and Res-4 of the encoder. Next, a feature semantic alignment module-1 to a feature semantic alignment module-4 designed according to the feature characteristics obtained by Res-1 to Res-4 are obtained from the step C. Finally, a global tie pooling layer, a feature semantic alignment module-4, a feature semantic alignment module-3, a feature semantic alignment module-2 and a feature semantic alignment module-1 are sequentially added into the basic network, and the newly added modules form a decoder of a semantic segmentation network model so as to form a symmetrical coding and decoding network structure. In addition, corresponding branch paths are established between the Res-1 to Res-4 and the feature semantic alignment module-1 to the feature semantic alignment module-4, and output features of corresponding stages of the basic network are transferred for subsequent use by the corresponding feature semantic alignment modules.

E. And D, aggregating the output characteristics of the final stage of the network structure obtained in the step D with the characteristics of the first stage of the encoder, and sending the aggregated characteristics into a semantic segmentation result generation module to form a prediction result.

The specific practice of the polymerization operation in the invention is as follows: and D, performing channel splicing on the final output of the semantic segmentation network model obtained in the step D and the output characteristics obtained in the Res-1. The spliced features are fed into a semantic segmentation result generation module that contains a CBR operation, a 3 x 3 standard convolution and an upsampling operation. Wherein the CBR operation reduces the number of channels to 64, then the 3 x 3 standard convolution continues to reduce the number of channels of 64 to the number of categories of the semantically segmented dataset (19), and finally the upsampling operation restores it to the same size as the original input image to obtain the final semantically segmented result.

F. And E, training parameters in the semantic segmentation network obtained in the step E by using the semantic segmentation training set.

In the training process, three methods of random overturn, random scaling (the scaling ratio is 0.5-2.0) and random clipping (768 multiplied by 1536) are adopted to carry out data enhancement operation on the original data set. The initial learning rate of the network is set to 0.005, the weight decay parameter is set to 0.0005, the momentum factor is set to 0.9, and a random gradient descent method (SGD) is adopted as an optimizer of the network. Whereas for the learning rate strategy, a popular "poly" learning strategy is employed, updating the learning rate of the network with a polynomial power (power) of 0.9. The training time of the whole network is 120000 iterations, and the sample number of each iteration is 12.

G. In training, the output features of the partial feature alignment modules are selectively sent into mutually independent semantic segmentation result generation modules to generate different prediction results, and network parameters are updated by the prediction results together so as to explicitly solve the target multi-scale problem in the street view image.

And D, selectively inputting the outputs of the semantic segmentation network model feature semantic alignment module-1 to the feature semantic alignment module-4 obtained in the step D into the same semantic segmentation result generation module as that obtained in the step E. In the invention, the outputs of the feature semantic alignment module-3 and the feature semantic alignment module-4 are selected, and an auxiliary semantic segmentation result is obtained by using a semantic segmentation result generating module respectively. Thus, the whole network contains three final output results, each of which is used to compare with the annotated image provided by the dataset, thereby obtaining a corresponding cross entropy loss function result. And finally, adding the obtained three cross entropy loss function results, and then updating network parameters together by using a direction propagation algorithm and matching with the step F).

And directly inputting the image with the original size into a network without any skill on the image used for testing, and obtaining a semantic segmentation result with the corresponding size.

Table 1 shows the performance and speed of the present invention versus some other semantic segmentation methods on the test dataset of Cityscapes. As can be seen from table 1, the present invention obtains a segmentation accuracy of 78.0% miou and a prediction speed of 37fps while using an input image of high resolution (1024×2048). Compared with most methods, the method has better precision and prediction speed. In particular, the present invention achieves optimal segmentation accuracy in methods that meet real-time (greater than 30 fps) requirements. The present invention can maintain similar division accuracy even with a faster reasoning speed than it is about 47 times faster than PSPNet, which pursues accuracy. From this, the present invention can maintain excellent segmentation accuracy while maintaining real-time network prediction speed.

TABLE 1

Deep lab corresponds to the method proposed by l.c. chen et al (L.C.Chen, G.Papandreou, I.Kokkinos, K.Murphy, and a.l. yuille, "Semantic image segmentation with deep convolutional nets and fully connected CRFs," in proc.int.conf.learn.represent. (ICLR), may 2015.);

PSPNet corresponds to the method proposed by H.Zhao et al (H.Zhao, J.Shi, X.Qi, X.Wang, and J.Jia, "Pyramid scene parsing network," in Proc.IEEE int.Conf.Comput.Vis.Pattern Recogit. (CVPR), jul.2017, pp.2881-2890.);

SegNet corresponds to the method proposed by V.Badrinarayanan et al (V.Badrinarayanan, A.Kendall, and R.Cipola, "SegNet: A deep convolutional encoder-decoder architecture for image segmentation," IEEE Trans. Pattern Anal. Mach. Intll., vol.39, no.12, pp.2481-2495, dec.2017.);

ENet corresponds to the method proposed by A.Paszke et al (A.Paszke, A.Chaurasia, S.Kim, and E.Curtiello, "ENet: A deep neural network architecture for real-time semantic segmentation," Jun.2016, arXiv:1606.02147.[ Online ]. Available: https:// arxiv.org/abs/1606.02147);

ESPNet corresponds to the method proposed by S.Mehta et al (S.Mehta, M.Rastegari, A.Caspi, L.Shapiro, and H.Hajishirzi, "ESPNet: efficient spatial pyramid of dilated convolutions for semantic segmentation," in Proc.Eur.Conf.Comput.Vis. (ECCV), sep.2018, pp.552-568.);

ERFNet corresponds to the method proposed by E.Romera et al (E.Romera, J.M.lvarez, L.M.Bergasa, and R.Arroyo, "ERFNet: efficient residual factorized ConvNet for real-time semantic segmentation," IEEE Trans.Intell.Transp.Syst., vol.19, no.1, pp.263-272, jan.2018.);

ICNet corresponds to the method proposed by H.Zhao et al (H.Zhao, X.Qi, X.Shen, J.Shi, andJ.Jia, "ICNetforreal-timesemantic segmentation on high-resolution images," in Proc.Eur.Conf.Comput.Vis. (ECCV), sep.2018, pp.405-420.);

DABNet corresponds to the method proposed by G.Li et al (G.Li, I.Yun, J.Kim, and J.Kim, "DABNet: depth-wise asymmetric bottleneck for real-time semantic segmentation," in Proc.Brit. Mach. Vis. Conf. (BMVC), sep.2019, pp.1-12.);

GUN corresponds to the method proposed by D.Mazzini (D.Mazzini, "Guided upsampling network for real-time semantic segmentation," in Proc.Brit. Mach.Vis.Conf. (BMVC), sep.2018, p.117.);

EDANet corresponds to the method proposed by S.Y. Lo et al (S.Y.Lo, H.M.Hang, S.W.Chan, andJ.J.Lin, "Efficientdensemodules of asymmetric convolution for real-time semantic segmentation," in Proc.ACM Multimedia Asia (MMAsia), dec.2019, pp.1-6.);

LEDNet corresponds to the method proposed by y.wang et al (y.wang et al., "LEDNet: A lightweight encoder-decoder network for real-time semantic segmentation," in proc.ieee int.conf.image process (ICIP), aug.2019, pp.1860-1864.);

DFANet corresponds to the method proposed by h.li et al (H.Li, P.Xiong, H.Fan, and j. Sun, "DFANet: deep feature aggregation for real-time semantic segmentation," in proc.ieee int.conf.comp.vis.pattern recognit (CVPR), jun.2019, pp.9522-9531.);

DF1-Seg corresponds to the method proposed by X.Li et al (X.Li, Y.Zhou, Z.Pan, and J.Feng, "Partial order pruning: for best speed/accuracy track-off in neural architecture search," in Proc.IEEE int. Conf. Comput. Vis. Pattern Recognit. (CVPR), jun.2019, pp.9145-9153.);

DF2-Seg corresponds to the method proposed by X.Li et al (X.Li, Y.Zhou, Z.Pan, and J.Feng, "Partial order pruning: for best speed/accuracy track-off in neural architecture search," in Proc.IEEE int. Conf. Comput. Vis. Pattern Recognit. (CVPR), jun.2019, pp.9145-9153.);

LRNNet corresponds to the method proposed by w.jiang et al (W.Jiang, Z.Xie, Y.Li, C.Liu and h.lu, "LRNNet: a light-weighted network with efficient reduced non-local operation for real-time semantic segmentation," in proc.ieee int.conf.multimedia and Expo Workshops (ICMEW), jul.2020, pp.1-6.);

RTHP corresponds to the method proposed by G.Dong et al (G.Dong, Y.Yan, C.Shen, and H.Wang, "Real-time high-performance semantic image segmentation of urban street scenes," IEEE Trans.Intell.Transp.Syst., pp.1-17, jan.2020.);

SwiftNet corresponds to the method proposed by M.Orsic et al (M.Orsic, I.Kreso, P.Bevandic, and S.Segvic, "In defense of pretrained ImageNet architectures for real-time semantic segmentation of road-driving images," in Proc.IEEE int.Conf.Comput.Vis.Pattern Recognit. (CVPR), jun.2019, pp.12607-12616.);

SwiftNet-ens corresponds to the method proposed by M.Orsic et al (M.Orsic, I.Kreso, P.Bevandic, and S.Segvic, "In defense of pretrained ImageNet architectures for real-time semantic segmentation of road-driving images," in Proc.IEEE int. Conf. Comput. Vis. Pattern Recognit. (CVPR), jun.2019, pp.12607-12616.);

SFNet (DF 2) corresponds to the method proposed by X.Li et al (X.Li et al., "Semantic flow for fast and accurate scene parsing," inproc.Eur.Conf.Comput.Vis. (ECCV), nov.2020, pp.775-793.);

SFNet (ResNet-18) corresponds to the method proposed by X.Li et al (X.Li et al., "Semantic flow for fast and accurate scene parsing," inproc.Eur.Conf.Comput.Vis. (ECCV), nov.2020, pp.775-793.);

BiSeNet (ResNet-18) corresponds to the method proposed by C.Yu et al (C.Yu, J.Wang, C.Peng, C.Gao, G.Yu, and N.Sang, "BiSeNet: bilateral segmentation network for real-time semantic segmentation," in Proc.Eur.Conf.Comput.Vis. (ECCV), sep.2018, pp.325-341.);

BiSeNetV2 corresponds to the method proposed by C.Yu et al (C.Yu, C.Gao, J.Wang, G.Yu, C.Shen, and N.Sang, "BiSeNetV2: bilateral network with guided aggregation for real-time semantic segmentation," Apr.2020, arXiv:2004.02147.[ Online ]. Available: https:// arxiv.org/abs/2004.02147).

Claims

1. A real-time street view image semantic segmentation method based on staged feature semantic alignment is characterized by comprising the following steps:

B. based on a lightweight image classification network structure, a specially designed space-channel attention module is combined to construct a basic network of a semantic segmentation network model;

the construction mode of the basic network of the semantic segmentation network model comprises the following two substeps:

B1. removing all network layers after the last basic residual block of the ResNet-18 by taking a lightweight image classification network ResNet-18 as a basis to obtain a basic network of a preliminary semantic segmentation network model; the basic network contains 8 basic residual blocks in total, and the network is divided into four stages by taking 2 continuous basic residual blocks as a group: res-1, res-2, res-3 and Res-4;

B2. embedding a space-channel attention module between two residual modules of Res-2, res-3 and Res-4 to improve the feature representation capability of a basic network of a semantic segmentation network model and reduce the information loss caused by downsampling operation; obtaining a basic network part in the semantic segmentation network model; the space-channel attention module comprises two branch paths, wherein a space branch comprises a standard convolution of 1×1 and a Sigmoid activation function, and a channel branch comprises a global average pooling operation, a 1D convolution of 1×1 and a Sigmoid activation function;

D. b, taking the basic network obtained in the step B as an encoder, combining a global average pooling layer and a plurality of characteristic semantic alignment modules designed in the step C as decoders, and constructing a semantic segmentation network model based on a symmetrical encoder-decoder network structure;

G. in training, selectively sending the output features of the partial feature semantic alignment module into the mutually independent semantic segmentation result generation module to generate different prediction results, and jointly updating network parameters by using the prediction results so as to explicitly solve the target multiscale problem in the street view image;

2. The method for segmenting the semantic image of the real-time street view based on the semantic alignment of the staged features as claimed in claim 1, wherein in the step A: the street view image semantic segmentation dataset adopts a public dataset Cityscapes, contains 25000 street view images and is divided into 5000 fine labeling subsets and 20000 Zhang Culve labeling subsets according to the fineness degree of semantic labeling; further dividing 5000 fine annotation subsets into 2975 training sets, 500 verification sets and 1525 test sets; each image has a size of 1024 x 2048 resolution and each pixel is labeled as 19 categories defined in advance, including road, sidewalk, building, wall, fence, pole, traffic light, traffic sign, vegetation, terrain, sky, person, rider, car, truck, bus, train, motorcycle, bicycle.

3. A real-time street view image semantic segmentation method based on staged feature semantic alignment as claimed in claim 1, wherein in step C, the feature semantic alignment modules having different network structures, each feature semantic alignment module contains two input features and one output feature, the two input features have different sizes, the small-size input feature comes from the previous module connected to the module, and the large-size input feature comes from the corresponding stage of the basic network obtained in step B; features from the underlying network pass through an additional CBR module to reduce the number of channels of the features; the CBR module comprises a standard convolution operation of 3 multiplied by 3, a normalization operation and a ReLU activation function;

then, the large-size input feature passes through a feature enhancement module and a space-channel attention module which are of different designs, so that the feature can enhance the feature representation capability according to the self characteristics; the feature enhancement module FEB enables the input features to undergo a series of convolution layers and normalization operations to enhance semantic information or space detail information of the features, and then the enhanced features are aggregated with the input features and pass through a ReLU activation function; for features from Res-4, the convolution layer in feature enhancement module FEB-4 is a plurality of depth separable convolutions with different hole rates to enhance semantic information; for features from Res-2, the feature enhancement module FEB-2 employs standard convolution to improve the capture of spatial detail information in the features; for the feature enhancement module FEB-3 from Res-3, the depth separable volume set without hole rate is used to balance between the enhanced feature representation force and the control module computational complexity; for features from Res-1, the feature enhancement module is not used in alignment;

simultaneously, the small-size input features are processed through a CBR module and up-sampling operation to obtain the same size and channel number as the processed large-size input features, then the two processed input features are spliced together and fed into a 3X 3 standard convolution operation to learn a semantic offset field between the two features; and carrying out semantic alignment operation on the processed small-size input features by using the learned semantic offset field, finally, aggregating the processed large-size input features and the small-size input features subjected to semantic alignment, and sending the aggregated large-size input features and the small-size input features into another space-channel attention module to generate output features of the feature semantic alignment module.

4. The method for semantic segmentation of a real-time street view image based on staged feature semantic alignment according to claim 1, wherein in step D, the specific construction method of the semantic segmentation network model is as follows: taking the basic network obtained in the step B as an encoder, thereby obtaining four characteristics from four stages of Res-1, res-2, res-3 and Res-4 of the encoder; then, a feature semantic alignment module-1 to a feature semantic alignment module-4 designed according to feature characteristics obtained by Res-1 to Res-4 are obtained from the step C, and finally, a global tie pooling layer, a feature semantic alignment module-4, a feature semantic alignment module-3, a feature semantic alignment module-2 and a feature semantic alignment module-1 are sequentially added at the end of a basic network, and the newly added modules form a decoder of a semantic segmentation network model, so that a symmetrical coding and decoding network structure is formed; corresponding branch paths are established between the Res-1 to Res-4 and the feature semantic alignment module-1 to the feature semantic alignment module-4, and output features of corresponding stages of the basic network are transmitted for subsequent use of the corresponding feature semantic alignment modules.

5. The method for segmenting the street view image according to claim 1, wherein in the step E, the specific implementation of aggregation is as follows: performing channel splicing on the final output of the semantic segmentation network model obtained in the step D and the output characteristics obtained in the Res-1, and sending the spliced characteristics into a semantic segmentation result generation module which comprises a CBR operation, a standard convolution of 3 multiplied by 3 and an up-sampling operation; the CBR operation reduces the channel number to 64, a 3X 3 standard convolution product reduces the channel number of 64 to the class number of the semantically segmented data set, and an up-sampling operation restores the features which have the same class number as the data set to the same size as the original input image to obtain a final semantically segmented result.

6. The method for segmenting the real-time street view image semantic based on the periodical feature semantic alignment according to claim 1, wherein in the step F, the training is performed by adopting three methods of random overturn, random scaling and random clipping to perform data enhancement operation on a raw data set; the scaling ratio of the random scaling is 0.5-2.0, and the size of the random clipping is 768 multiplied by 1536; the initial learning rate of the network is set to be 0.005, the weight decay parameter is set to be 0.0005, the momentum factor is set to be 0.9, and a random gradient descent method is adopted as an optimizer of the network; the learning rate strategy adopts a poly learning strategy, and the learning rate of the network is updated by a polynomial power of 0.9; the training time of the whole network is 120000 iterations, and the sample number of each iteration is 12.

7. The method for segmenting the real-time street view image based on the staged feature semantic alignment according to claim 1, wherein in the step G, the specific method for selectively sending the output features of the partial feature semantic alignment module into the mutually independent semantic segmentation result generating module to generate different prediction results, and jointly updating the network parameters by using the prediction results may be as follows: d, outputting the semantic segmentation network model feature semantic alignment module-1 to the feature semantic alignment module-4 obtained in the step D, and selectively inputting the same semantic segmentation result into a semantic segmentation result generation module as in the step E; selecting the outputs of the feature semantic alignment module-3 and the feature semantic alignment module-4, and respectively using a semantic segmentation result generation module to obtain auxiliary semantic segmentation results; the whole network contains three final output results, and each result is used for comparing with the labeling image provided by the data set so as to obtain a corresponding cross entropy loss function result; and finally, adding the obtained three cross entropy loss function results, and then updating network parameters by utilizing a direction propagation algorithm together with the step F.

8. The method for segmenting the real-time street view image semantic based on the periodical feature semantic alignment according to claim 1, wherein in the step H, the test set is input into a trained network, and the images used for the test are directly input into the network without any skill, so that the semantic segmentation result of the corresponding size is obtained.