CN116051977A

CN116051977A - Multi-branch fusion-based lightweight foggy weather street view semantic segmentation algorithm

Info

Publication number: CN116051977A
Application number: CN202211523734.9A
Authority: CN
Inventors: 刘丽伟; 王芮; 王玲; 杜磊; 赵强; 候德彪; 侯阿临; 李秀华; 梁超; 杨冬
Original assignee: Changchun University of Technology
Current assignee: Changchun University of Technology
Priority date: 2022-12-01
Filing date: 2022-12-01
Publication date: 2023-05-02

Abstract

The invention provides a lightweight foggy weather street view semantic segmentation algorithm based on multi-branch fusion, and aims to improve city street view segmentation accuracy in foggy weather and speed up network training. Aiming at the characteristics of numerous target features and difficult distinction of city streetscapes in foggy days, the classical model algorithm for road semantic segmentation cannot meet the requirements of accuracy and instantaneity of city streetscape identification in practical application. The lightweight foggy weather street view semantic segmentation algorithm based on multi-branch fusion uses a frame of coding and decoding and MobileNet V2 as a main network, and an advanced feature map extracted by the main network is transmitted into a hole space pyramid pooling layer and a global information extraction layer for fusion and addition, and the channel number is adjusted by 1X1 convolution for output and then up-sampling operation, so that the algorithm is convenient for shallow feature map fusion and splicing. The network model can divide the city street view more accurately, can greatly shorten the training time, and provides effective help for the unmanned field.

Description

Multi-branch fusion-based lightweight foggy weather street view semantic segmentation algorithm

Technical Field

The invention provides a light foggy weather street view semantic segmentation algorithm based on multi-branch fusion, which adopts a segmentation algorithm based on deep V < 3+ > network model improvement; the network main body is similar to the deep labV < 3+ > network structure, the network main body adopts a lightweight network, a global information extraction layer is designed and constructed and connected with a cavity space pyramid pooling layer, and two different branches respectively extract global information and multi-scale information and then are fused; the improved deep Ven3+ network model can be used for more accurately segmenting foggy-day fuzzy object images, can improve the characteristic multiplexing efficiency while guaranteeing the overall segmentation accuracy, solves the problem of slow training time caused by complex original deep Ven3+ network structure, and effectively captures foggy-day low visibility to cause low information blurring and segmentation accuracy.

Background

Streets are important components in cities, and understanding of urban street scenes is an important basis for realizing emerging applications of smart cities such as automatic driving, intelligent navigation and intelligent monitoring; in recent years, with the rapid development of equipment such as vehicle-mounted cameras, monitoring cameras and the like, the quantity and quality of acquired urban street scene images are greatly improved; the object size difference in the road scene is large, the object types are various, the weather environment is changeable, the scene is complex, and the problems of inaccurate road scene segmentation and slow segmentation speed are caused; therefore, under the condition of limited training data volume, the method research of improving the performance and generalization capability of the city street view segmentation model in foggy days by comprehensively utilizing various methods such as model improvement, data enhancement and data generation is carried out, the model has higher robustness under different weather conditions, and the method has very important significance for the practical application of the city street view segmentation algorithm.

With the progress of deep learning technology and the development of large-scale data sets, semantic segmentation tasks are developed rapidly; particularly, the appearance of the deep series network has greatly progressed to the semantic segmentation of streetscapes; deep labv3+ is a context information aggregation semantic segmentation network based on a spatial feature pyramid, and the context information is acquired by using expansion convolution; however, deep labv3+ often generates a large amount of parameters during operation, consumes a large amount of operation time, only considers the segmentation precision and does not consider the real-time performance of the network, but the intelligent driving field with the greatest application of street view segmentation not only requires the segmentation precision, but also is very sensitive to the real-time performance of the algorithm, and the semantic segmentation algorithm is required to have real-time processing speed and rapid interaction and response capability, so how to ensure the precision while improving the operation speed is the key of the urban street view segmentation algorithm.

Disclosure of Invention

Aiming at the problems of low segmentation precision, large network parameter quantity and low running speed caused by incomplete semantic information and insufficient context information connection of a deep V & lt3+ & gt network, the invention provides a light-weight foggy weather street view semantic segmentation algorithm based on multi-branch fusion, and a designed global information extraction layer solves the problems of incomplete semantic information and insufficient context information connection of the deep V & lt3+ & gt network, so that the foggy weather city street view target segmentation effect is better; the lightweight backbone network MobileNet V2 solves the problem of slow training speed caused by complex network structure, and simultaneously ensures the target segmentation precision.

In order to achieve the above object, the technical scheme of the present invention is as follows:

a lightweight foggy weather street view semantic segmentation algorithm based on multi-branch fusion comprises the following steps:

step one: data preprocessing, namely changing a data set into a network trainable size according to requirements and enhancing data;

step two: constructing a lightweight backbone network MobileNet V2 and a global information extraction layer;

step three: training the training set by using the model to obtain a foggy weather street view image segmentation result, and storing the best network model;

step four: and loading a network model, and testing the test set to obtain a foggy weather street view image segmentation result.

The specific process in the first step is as follows:

(1) Collecting a semantic segmentation data set of a foggy weather street view image, and dividing the semantic segmentation data set into a training set, a verification set and a test set;

(2) Selecting a synthetic Foggy Driving data set, wherein the data set uses a road scene image with a Cityscape fine label, and adopts an atmospheric scattering model to simulate Foggy situations (the visibility is 200, 100 and 50 meters respectively) with 3 different concentrations in total, namely, a Foggy attenuation coefficient value of 0.005-0.02, so as to construct a mixed Foggy road scene data set with a plurality of different concentrations; selecting a FoggyCityscapes data set to be divided into a mist scene, a medium mist scene and a thick mist scene, wherein each part comprises 19 categories such as automobiles, people, roads and the like, and the total number of the three parts is 13900;

(3) In order to improve the robustness of the neural network and expand the data set, carrying out data enhancement operation on the foggy day data set; performing geometric transformation on the image on the basis of the existing image, and performing various operations such as image overturning, random rotation, translational transformation, random clipping, deformation scaling, noise disturbance and the like; the invariance of the whole network in the direction is increased, and the misjudgment probability of a network model is reduced; and three foggy images with different concentrations are used as training samples, so that the generalization capability of the model is improved.

The specific case in the second step is as follows:

(1) Constructing a lightweight network MobileNetV2 as a backbone network to extract high-level semantic features:

(1) the core operation of the MobileNet V2 network is to introduce a depth separable convolution replacement standard, and the depth separable convolution is more ideal and efficient in controlling the network parameters and the speed;

(2) the depth separable convolution includes Depthwise and Pointwise convolutions 2 parts; depthwise convolution is a convolution operation performed entirely in a two-dimensional plane, and the channels and convolution kernels are in one-to-one correspondence; the Pointwise convolution is a common convolution with the convolution kernel size of 1 multiplied by 1, is positioned after the Depthwise convolution and is used for fusing information of a plurality of channels and enhancing the network expression capability;

in the convolution operation process, if the number of input channels is C _i The convolution kernel has a size of k×k, and the number of output channels is C _o The output feature size is H W, then the depth separable convolution is shown by the following equation with the reference number of standard convolutions:

the calculated amount is shown in the following formula:

as can be seen from the two formulas, the calculation complexity of the depth separable convolution is greatly reduced compared with that of the standard convolution, and the requirements of less parameters and high calculation speed are met;

the MobileNet V2 network is formed by stacking a plurality of residual pouring modules, and the residual pouring modules are beneficial to improving the precision and constructing a deeper network; firstly, increasing the channel number of the feature map by using 1X1 convolution, realizing the expansion of the feature map, enriching the feature quantity and improving the precision; secondly, extracting the characteristics of each channel by using 3 x 3 depth separable convolution, so that the operand is reduced; finally, reducing the channel number by using 1X1 convolution; the activation function used after the expansion convolution and depth separable convolution processes is a Relu6 function, and the activation function after the compression convolution is a Linear function, so that the Relu6 function is prevented from further damaging the compressed characteristics;

(2) Building a global information extraction layer:

(1) the global information extraction layer has the functions of supplementing target edge information, carrying out edge prediction and improving the small target segmentation performance of the model;

(2) the global information extraction layer consists of a convolution layer and a polarization attention mechanism; transmitting the high-level semantic feature map generated by the backbone network into a 1X1 convolution, and reserving the integrity of target information; then, a polarization attention mechanism is transmitted, and the high channel resolution and the high spatial resolution are improved while the low parameter quantity is ensured by using an orthogonal mode; adding a mixture of softmax and sigmoid to the channel branches and the space branches to increase nonlinearity, so as to fit more real and finer output distribution;

the polarization attention mechanism is divided into two branches, a channel branch and a space branch; the weight calculation formula of the channel branches is as follows:

the input feature X is first transformed into Q and V with a 1X1 convolution, where the Q channel is fully compressed, while the V channel dimension remains at a relatively high level (i.e., C/2); because the channel dimension of Q is compressed, information enhancement by HDR is required, and thus Q information is enhanced with Softmax; then, carrying out matrix multiplication on Q and K, and then carrying out convolution on the Q and K by 1x1, and increasing the dimension of C/2 on the channel to C by LN; finally, using a Sigmoid function to keep all parameters between 0 and 1;

the formula for the spatial branch calculation weights is as follows:

similar to channel branching, input features are first converted into Q and V by convolution of 1x1, wherein for Q features, space dimension compression by Global pooling is also used, and the Q features are converted into the size of 1x 1; while the spatial dimension of the V feature is maintained at a relatively large level (h×w); because the space dimension of Q is compressed, the information of Q is enhanced by Softmax; then, Q and K are matrix multiplied, and reshape and Sigmoid are connected so that all parameters are kept between 0 and 1.

The specific case in the third step is as follows:

(1) Constructing a multi-branch fusion light-weight network model;

(2) The method is improved on an original semantic segmentation deep V < 3+ > model, and a coding and decoding framework and a MobileNet V2 are used as backbone networks; in the encoder stage, the image firstly extracts complete information from a backbone network, and the generated advanced feature images are respectively sent to a cavity space pyramid pooling layer and a global information extraction layer; the cavity space pyramid pooling layer consists of 3 cavity convolutions with cavity rates of 6,12 and 18 respectively, 1 volume of 1 multiplied by 1 and 1 global average pooling layer; then, directly cascading the obtained 5 feature images on the channel to complete a multi-scale sampling process; the method can effectively extract key information and enlarge receptive fields by utilizing cavity convolution with different scales and an additional global average pooling; the global information extraction layer supplements the edge information loss caused by multi-scale expansion convolution; the fused feature map is connected with a 1 multiplied by 1 convolution to reduce the channel number, and finally the fused feature map and the low-level feature map are output to the next layer; the low-level feature map provides detail information, and the high-level feature map provides semantic information;

(3) A decoder stage, which adopts a simple and efficient algorithm module; firstly, performing bilinear upsampling on an advanced feature layer output by a decoder, and amplifying the bilinear upsampling to be 4 times of an original image; then, carrying out 1X1 convolution on the corresponding low-level feature layers with the same features of the feature extraction network backbone to reduce the number of channels; the two feature layers obtained are then concatenated together, the features are refined by a 3 x 3 convolution, and finally the decoding operation is completed by taking 4-fold upsamples.

The specific cases in the fourth step are as follows:

(1) The training process adopts a random gradient descent optimization algorithm, momentum is set to be 0.9, the exponential decay rate of second moment estimation is 0.999, the initial learning rate is 0.01, and the weight decay of learning rate is 5 multiplied by 10 ^-4 Selecting poly by a learning rate reduction method, wherein the learning rate reduction index is 0.9; the loss function uses a cross entropy loss function based on a softmax function, wherein the cross entropy function is a loss function commonly used in processing classification problems, and a specific formula is as follows:

the softmax function processes the output result to ensure that the sum of the predictive values of a plurality of classifications is 1, and then the loss is calculated through cross entropy;

(2) The data set is put into a network for training and evaluation to obtain an optimal network segmentation result, and a network model of the optimal network segmentation result is stored;

(3) And testing the test set, and reserving test results and the generated street view segmentation map.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

(1) The invention is based on a convolutional neural network, uses a lightweight backbone network to replace an original backbone network Xattention, and solves the problems of large quantity of deep V < 3+ > network model parameters and slow operation;

(2) According to the invention, through designing the global information extraction layer module, the edge information of the fuzzy target is subjected to subdivision extraction and is fused with the multi-scale information, so that the integrity of the target information is reserved, the phenomena of inaccurate segmentation and missed segmentation are solved, and the accuracy of the network is improved.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a global information extraction layer module constructed in accordance with the present invention;

FIG. 3 is a network model of the improved lightweight deep v3+.

Detailed Description

It will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted; the technical scheme of the invention is further described below with reference to the accompanying drawings and the examples;

the invention provides a light foggy weather street view semantic segmentation algorithm based on multi-branch fusion, which realizes foggy weather city street view semantic segmentation and provides a more accurate street view segmentation map for the automatic driving field;

FIG. 1 is a flow chart of a method of the invention, which provides a lightweight foggy weather street view semantic segmentation algorithm based on multi-branch fusion;

FIG. 2 is a global information extraction layer constructed by the invention, wherein the global characteristics are reserved by using 1X1 convolution, a PSA polarization attention mechanism is connected, edge information is accurately extracted, and the loss of target edge details caused by multi-scale extraction is supplemented;

FIG. 3 shows a lightweight deep V3+ network constructed in accordance with the present invention, and used to train data, preserving the best network weights; and finally, testing the test set by the trained network model to realize the segmentation task.

The specific implementation steps are as follows:

step1.1, obtaining a city street view image, inputting the city street view image into a lightweight backbone network MobileNet V2, and extracting a shallow layer feature map output by the front four layers of the MobileNet V2 and a high layer feature map output by the rear multiple layers;

step1.2, respectively transmitting the extracted high-level feature images into a global information extraction layer and a cavity space pyramid pooling layer, and carrying out fusion operation on the two output feature images;

the feature map obtained after the step1.3 fusion is subjected to 1X1 convolution to adjust the channel number, 4 times of up-sampling is firstly carried out on the obtained feature map, the feature map is spliced and fused with the shallow layer feature obtained in the previous step, and 4 times of up-sampling is carried out to adjust the channel number, so that a final prediction map is obtained;

step2.1, constructing a global information extraction layer;

step2.1.1, extracting a high-level semantic feature map generated by a backbone network, respectively transmitting the high-level semantic feature map into a global feature extraction layer and a cavity space pyramid pool to obtain two layers of feature maps, and fusing the two layers of feature maps;

step2.1.2 combines the 1x1 convolution with the PSA polarization attention mechanism to form a global information extraction layer; the high-level semantic information obtained by the backbone network is put into a 1X1 convolution layer, so that the integrity of target information is ensured;

the PSA polarized attention mechanism is used for extracting target key edge information, the two-channel parallel structure enables the feature map to keep higher information integrity in space and channel dimension, and a nonlinear function combined by Softmax-Sigmoid is adopted, so that a model with the polarized self-attention mechanism can obtain better performance on pixel-level tasks;

extracting the attention weight on the channel of the high-level feature map by a self-attention mechanism of the channel dimension Step2.1.4, and multiplying the attention weight by the input high-level feature map element by element to obtain an image feature map on the channel dimension;

extracting the attention weight on the space of the high-level feature map by a self-attention mechanism of the space dimension of Step2.1.5, and multiplying the attention weight by the input high-level feature map element by element to obtain an image feature map on the space dimension;

step2.1.5, fusing the spatial domain feature map and the channel domain feature map to obtain a feature map output by the PSA polarization attention module;

step2.2, simultaneously obtaining a characteristic diagram with the channel number of 2048 through a backbone network MobileNet V2, respectively carrying out 1X1 convolution, carrying out hole convolution with the hole rate of {6,12,18} and carrying out global averaging pooling to obtain 5 characteristic diagrams with the channel number of 256, and after the obtained 5 characteristic diagrams are spliced and fused in the channel dimension, obtaining the characteristic diagram generated by a hole space pyramid pooling module;

the step2.3 global information extraction layer is fused with the feature map obtained by the cavity space pyramid pooling layer in the channel dimension, and the feature map is transmitted into a 1 multiplied by 1 convolution to carry out channel number dimension reduction;

step3.1 training the dataset using the modified lightweight deep v3+ network model;

step3.1.1 inputs a fixed-size foggy weather city street view image into the improved lightweight deep v3+ network;

the step3.1.2 MobileNet V2 network invokes the trained model weight, preprocesses the image, extracts useful information of the image and generates a characteristic image; respectively transmitting to the modified global information extraction layer, ASPP layer and decoder part;

step3.1.3 respectively enters the improved modules, and the feature map enters the global information extraction layer; the number of channels is adjusted by 1X1 convolution, so that the information integrity of the feature map is ensured; then, extracting target edge detail information by a PSA polarization attention mechanism, deeply describing the edge information, and supplementing information loss caused by a plurality of expansion convolution layers;

the feature map entering the ASPP module from step3.1.4 is divided into 5 parts for carrying out cavity convolution and global average pooling operation to extract features, the extracted 5 layers of features are spliced, deep feature information is continuously extracted by shunting, finally multi-scale fusion is carried out through 1X1 convolution to obtain a feature map with the size of 1/16 of the original city street image, and the feature map is input to a decoder part;

performing 4-time upsampling operation on the feature map processed by the encoder structure by using the Step3.1.5, splicing and fusing the feature map with the shallow feature map, and further extracting features by using 3X 3 convolution on the generated feature map to obtain a fused feature map;

step3.1.6, 4 times up-sampling the fused characteristic image to restore the original city street image size, outputting a prediction image, and completing image segmentation;

step4.1, setting super parameters of a network, and setting a learning rate by using a Poly training strategy;

step4.2, putting the data set into a network for training to obtain an optimal network segmentation result, and storing the weight of the optimal network segmentation result; the cross entropy loss function CrossEntropyLoss is adopted for training, and is a loss function commonly used in processing classification problems;

step4.3, loading trained network model weights, putting the data set into a network for training and verification, and obtaining a network segmentation result; saving the best primary segmentation weight for training the test set; and obtaining segmentation result data and a foggy city street segmentation map.

Claims

1. A lightweight foggy weather street view semantic segmentation algorithm based on multi-branch fusion is characterized by comprising the following steps:

step 1: data preprocessing, namely changing data into a network trainable size according to requirements;

step 2: constructing an improved deep V3+ network structure;

step 3: setting network training parameters, training a training set by using the model, obtaining a foggy weather street view image segmentation result, and storing the best network model;

step 4: and loading a network model, and testing the test set to obtain foggy weather street view image segmentation data and a segmentation map.

2. The multi-branch fusion-based lightweight foggy weather street view semantic segmentation algorithm according to claim 1, wherein the specific process in Step1 is as follows:

the step1.1 simulates the mist attenuation coefficient value according to the atmospheric scattering model, the mist degree can be divided into three different concentration mist days, and the mist attenuation values are 0.005, 0.01 and 0.02, namely mist, medium mist and thick mist respectively;

step1.2, dividing the data set into a training set, a verification set and a test set;

step1.3, increasing the number of data set samples by using data enhancement, performing geometric transformation on an image based on an original data image, and performing various operations such as image overturning, random rotation, translation transformation, random cutting, deformation scaling, noise disturbance and the like;

because the original image has oversized picture scale, the step1.4 performs clipping operation on the picture according to network requirements. The cut-out picture is 512×512.

3. The multi-branch fusion-based lightweight foggy weather street view semantic segmentation algorithm according to claim 1, wherein the specific process in Step2 is as follows:

step2.1, inputting the processed city street view image in foggy days to a backbone network MobileNet V2, and extracting a low-level characteristic image output by a shallow layer network of the backbone network MobileNet V2 and a high-level characteristic image output by a deep layer network;

step2.2, respectively inputting the extracted high-level feature map into a cavity space convolution pooling pyramid module and a global information extraction module, and adding elements to the outputs of the cavity space convolution pooling pyramid module and the global information extraction module;

step2.2.1 wherein the global information extraction module is composed of a 1×1 convolution and a PSA polarization attention mechanism, the 1×1 convolution ensures the information integrity of the high-level feature map output by the backbone network;

extracting the attention weight on the channel of the high-level feature map by a self-attention mechanism of the channel dimension of Step2.2.2, and multiplying the attention weight by the input high-level feature map element by element to obtain an image feature map on the channel dimension;

extracting the attention weight on the space of the high-level feature map by a self-attention mechanism of the space dimension of Step2.2.3, and multiplying the attention weight by the input high-level feature map element by element to obtain an image feature map on the space dimension;

step2.2.4, fusing the spatial domain feature map and the channel domain feature map to obtain a feature map output by the PSA polarization attention module;

step2.3 inputs the low-layer characteristic diagram into a first 1×1 convolution layer of the decoding module, fuses the high-layer characteristic diagram generated by the encoder module with the low-layer characteristic diagram after 4 times up-sampling, carries out 4 times up-sampling operation after being transmitted into a next 3×3 convolution layer, and outputs the image after semantic segmentation enhancement.

4. The multi-branch fusion-based lightweight foggy weather street view semantic segmentation algorithm according to claim 1, wherein the specific process in Step3 is as follows:

step3.1, setting super parameters of a network, and setting a learning rate by using a Poly training strategy;

step3.2, putting the data set into a network for training to obtain an optimal network segmentation result, and storing the weight of the optimal network segmentation result; the training uses the cross entropy loss function cross entropyloss, which is a loss function commonly used in dealing with classification problems.

5. The multi-branch fusion-based lightweight foggy weather street view semantic segmentation algorithm according to claim 1, wherein the specific process in Step4 is as follows:

step4.1, loading trained network model weights, putting the data set into a network for training and verification, and obtaining a network segmentation result;

step4.2 saves the best one-time segmentation weights for training the test set. And obtaining segmentation result data and a foggy city street segmentation map.