CN113902915B

CN113902915B - Semantic segmentation method and system based on low-light complex road scene

Info

Publication number: CN113902915B
Application number: CN202111190065.3A
Authority: CN
Inventors: 王海; 陈妍妍; 蔡英凤; 陈龙; 李祎承; 刘擎超; 孙晓强
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2021-10-12
Filing date: 2021-10-12
Publication date: 2024-06-11
Anticipated expiration: 2041-10-12
Also published as: CN113902915A

Abstract

The invention discloses a semantic segmentation method and a semantic segmentation system based on a low-light complex road scene, which are used for respectively carrying out synthetic data acquisition and data style migration work under good light through an automatic driving simulation platform and a generating countermeasure network, so as to construct different low-light data sets; the invention provides a new semantic segmentation algorithm to improve segmentation performance in a low-light scene on the basis of SFNet networks, wherein the improved Resnet & lt50+ & gt structure is mainly used as a backbone network for feature extraction, when each Resnet block is introduced into a channel-space attention mechanism to enhance the characterization capability of pixels and up-sampling is carried out, a feature alignment module FAM is introduced in consideration of differences caused by different resolutions, the module can learn the pixel offset from high-layer low-resolution features to shallow-layer high-resolution features so as to realize accurate change of the pixels, thereby avoiding the problem of detail loss to the greatest extent, and finally, the segmentation performance is further improved by introducing a multi-scale attention module. The invention adopts an offline method to pretrain the semantic segmentation network so as to improve the safety of the system.

Description

Semantic segmentation method and system based on low-light complex road scene

Technical Field

The invention relates to the technical field of intelligent automobile automatic driving, in particular to a semantic segmentation method and system based on a low-light complex road scene.

Background

Thanks to the rapid development of the deep neural network, semantic segmentation has made great progress in the fields of unmanned aerial vehicle autonomous landing, medical imaging, automatic driving and the like, and especially the perception capability of unmanned automobiles has shown an exponential increase. Regarding intelligent vehicle environment awareness, semantic segmentation unifies different detection tasks in an efficient manner, thereby avoiding the complex problem of multi-sensor fusion. Image segmentation is essentially a fine pixel-by-pixel regression task, mainly classifying each pixel in a picture, such as mapping the background to 0, and the foreground to other N-1 categories.

The existing CNN algorithm (such as Deeplab series or HRNet-OCR) can execute the segmentation task with higher accuracy and also can give consideration to certain real-time requirements, but most of the algorithms are operated on pictures shot by a visible light camera under good illumination and weather conditions, and in adverse conditions such as low illumination, rain and fog, the segmentation performance is obviously reduced due to the reasons of reduced overall contrast of images, blurred object semantic boundaries and the like. However, the actual scene can hardly get rid of such severe working conditions, so that the application range of the scene is widened, and the problem to be solved is that the unmanned automobile can be applied early. The method is mainly focused on solving the problem of semantic segmentation in a low-light scene.

Because of the characteristics of underexposure, noise, motion blur and the like of a picture shot by a visible light camera in a low-illumination scene, the features extracted from the picture and the picture acquired under good illumination conditions by convolution have larger differences in structure and texture. Thus models trained on authoritative datasets (e.g., CITYSCAPES) during the day are not directly applicable to low-light scenes. Deep learning is essentially based on a data-driven approach, and the current standard strategy to achieve high performance semantic segmentation is to train a large number of labeled neural networks of low-light real scene pictures. However, such low-light data sets are collected and annotated, which would result in extremely high labor costs. The use of synthetic data seems to be a solution, so we use an autopilot simulation platform to simulate low-light scenes in different cities and under different weather conditions and use the on-board visible light cameras of the simulator to collect corresponding synthetic data. In fact, the composite data and the actual scene data are still somewhat different in terms of the characteristics of structure, color, etc. In view of this, we also render low-light patterns with fully preserved real scene features by generating a public dataset against the network to style-convert existing good light.

Under the interference of insufficient illumination and artificial light sources, the characteristics of the low-illumination data set are not obviously different from each other, so that the designed neural network needs to have stronger characteristic extraction capability. One possible approach is to make full use of the pixel's context information to improve its pixel characterization capability. Therefore, a relation context is introduced into a backbone network for extracting the features, the relation among pixels is fully considered from the aspects of the space dimension and the channel dimension of the features, and the corresponding intensive feature graphs with enhanced attention are learned. In general, superimposing multiple convolutions results in powerful features with a concomitant reduction in resolution, which results in loss of image detail, which is unacceptable for night-time images with fewer features. The semantic segmentation algorithm under good illumination can up-sample the image through the decoder to recover the resolution, but the lost detail information cannot be compensated by up-sampling. One possible approach is to introduce a shallow high resolution feature map in the encoder in the high level decoder and efficiently learn the offset of the upsampled pixels by the feature alignment module. The feature alignment modules and the flow alignment modules in SFNet have great similarity. A general segmentation algorithm also uses a multi-scale approach to improve segmentation performance at the time of reasoning. Similarly, we use a multi-scale reasoning method, but our multi-scale method can achieve flexible scale addition in reasoning by learning weights between adjacent scales.

Disclosure of Invention

In order to solve the technical problems, the invention provides a semantic segmentation method and a semantic segmentation system based on a low-light complex road scene, which aim at the problem of data missing of the low-light scene, and respectively perform synthetic data acquisition and data style migration work under good light through an automatic driving simulation platform and a generating countermeasure network, so as to construct different low-light data sets. In addition, aiming at the difficult problem of boundary blurring caused by low semantic contrast of a low-illumination image, an improved algorithm based on SFNet is designed, the pixel characterization capability is enhanced by introducing a dual-attention mechanism, and a feature alignment module is introduced in an up-sampling stage of a decoder, so that the problem of detail loss of pixels is avoided. We then introduce a multi-scale attention mechanism at the end part of the network to learn the relative weights between adjacent scales efficiently to further improve segmentation performance. Finally, we perform end-to-end training on the segmentation network in the low illumination scene to obtain a training weight, and finally obtain a real-time road scene picture according to the vehicle-mounted camera, and take the real-time road scene picture as the input of the neural network and obtain the segmentation result.

The invention discloses a semantic segmentation system based on a low-illumination complex road scene, which comprises the following technical scheme: the system comprises a low-light data set construction module, a semantic segmentation improvement algorithm SFNet-N, an offline end-to-end training module and a vehicle-mounted camera real-time segmentation module.

The low-light data set construction module is used for acquiring complex low-light road scene pictures, and provides two construction methods, wherein the two construction methods comprise synthesizing virtual data and style conversion of real scene data based on a simulation platform, and constructing corresponding data sets from different angles of the data. The low-light data set required for the experiment is acquired by the simulation platform CRALA in view of the time and huge tag costs required to construct an accurate and efficient low-light data set in real scenes and the limitations of the hardware resources of the laboratory. Meanwhile, the diversity of the data set is guaranteed, and the CYCLEAGAN algorithm is also used for converting the low-illumination style of the existing authoritative daytime data set CITYSACPES.

The semantic segmentation improvement algorithm SFNet-N module is used for obtaining the final label graph, namely, giving corresponding class labels to the pixels according to the classes to which the pixels belong, and obtaining a segmentation result of the pixel level. The module directly adopts an improved Resnet50 0+ as a backbone feature extraction network, and a Feature Alignment Module (FAM) is added during back up-sampling, so that the motion direction of each pixel point is learned, the high resolution of an image is restored layer by layer while the details are kept as much as possible, the problem of detail loss of the pixels is avoided, and finally the semantic segmentation result is further improved through a multi-scale attention module.

The offline end-to-end training module is used for training the built semantic segmentation network according to the pixel-level labeling picture, so that the loss function is minimum, and the optimal segmentation weight is obtained.

The real-time segmentation module of the vehicle-mounted camera obtains real-time road scene pictures through the vehicle-mounted camera and sends the real-time road scene pictures into the trained neural network to obtain real-time low-light road scene segmentation results.

The invention discloses a picture semantic segmentation method based on a low-illumination complex road scene, which sequentially comprises the following steps:

and 1) respectively carrying out synthetic data acquisition and data style migration work under good illumination through an automatic driving simulation platform and a generation countermeasure network, thereby constructing two different low-illumination data sets.

And 2) constructing a semantic segmentation neural network structure, taking a low-illumination data set image as an input part, and outputting a pixel-level label image, namely predicting the category of the pixel according to a given pixel category label graph.

And 3) constructing a neural network algorithm structure by using Pytorch deep learning frames.

And 4) performing end-to-end training on the built semantic segmentation deep learning algorithm network frame by using the obtained low-light data set, and obtaining a weight value for minimizing a loss function.

The training method in this step utilizes a small batch gradient descent back propagation method of multiple GPUs.

Step 5) obtaining a real-time image of the low-light road scene by using a vehicle-mounted camera, wherein the vehicle-mounted camera can be a network camera, a USB camera or go-pro.

And 6) classifying the low-light road scene images acquired in real time by using the pre-trained weights, and positioning different categories to form a segmentation result graph.

Aiming at the problem of data missing of the low-light scene in the step 1), different low-light data sets are constructed by respectively carrying out synthetic data acquisition and data style migration work under good light through an automatic driving simulation platform CRALA and a generation countermeasure network.

The invention provides a picture semantic segmentation method based on a low-illumination complex road scene by constructing a data set and a semantic segmentation deep learning algorithm improvement frame, and expands the application range of the semantic segmentation scene.

The invention has the beneficial effects that:

1. Aiming at the problem of data missing of a low-light scene, the invention constructs two low-light automatic driving road scene data sets by respectively carrying out synthetic data acquisition and data style migration under good light through an automatic driving simulation platform and a generating countermeasure network: SYNTHESISCARLA and CYCLECITYSACPES.

2. The invention provides an improved low-light image semantic segmentation model SFNet-N based on SFNet, the pixel characterization capability is enhanced by introducing a dual-attention mechanism, a feature alignment module is introduced to solve the problem of detail loss of low-light image pixels, and a multi-scale attention module is designed to further improve the segmentation performance.

3. According to the invention, the semantic segmentation network is trained by adopting an offline method, so that the safety of the system is improved.

Drawings

FIG. 1 is an exemplary diagram of two structured low-light data set samples.

FIG. 2 is a diagram of an overall framework of a semantic segmentation network based on low-light scenes.

Fig. 3 is a diagram of an improved encoder network architecture.

FIG. 4 is a specific network diagram of the Residual block+.

Fig. 5 feature alignment module.

FIG. 6 is a flow chart of semantic segmentation in low light scenes.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

As shown in fig. 1, (a) is an example graph of different road scenes and different weather conditions acquired through simulation platform CRALA 0.9.9 software, and (b) is an example graph after style conversion of CITYSACPES using CycleGAN algorithm.

As shown in fig. 2, the semantic segmentation neural network overall frame diagram, the encoder part consists of an improved backbone network (Resnet 50 0+) and Pyramid Pooling (PPM), the picture resolution is reduced layer by layer, a higher-layer feature diagram is obtained at the same time, and global context information is obtained by expanding receptive fields through PPM; the decoder part consists of 4 decoders (Dec) with Feature Alignment (FAM) modules, and according to a given high-dimensional feature map and low-dimensional feature map, the network learns the motion direction of each pixel point through the feature alignment modules, and restores the high resolution of the image layer by layer while retaining details as much as possible; the encoder and decoder parts of the figure are collectively referred to as Trunk, i.e. the gray areas of the figure. The original picture is used as the network input, passes through the Trunk, and finally obtains the initial segmentation result through the segmentation head part which consists of a plurality of convolution layers and has the number of the convolution channels of the last layer as the category number. In order to avoid the common problems of category confusion and detail loss in the segmentation results, another scale picture is introduced as an original input part of the network during training, the relative attention weight between adjacent scales is allowed to be learned by the network through an attention mechanism, and then the segmentation results of a plurality of scales are optimally fused. Notably, since the module takes layered action, only one extra dimension is required to be trained alone, the extra dimension is selected as r=0.5, and the training process can be expressed mathematically as:

L”(r＝1)＝U_p(L(r＝0.5))×A(r＝0.5)+(1-A(r＝0.5))×L(r＝1) (1)

Since the training learns the relative weights between adjacent scales, the reasoning of the model can flexibly add a plurality of scales, and the reasoning process is not limited to the extra size, and can be expressed as follows mathematically:

Where r is a scaling factor, r=0.5 represents a reduction by a factor of 2, and r=2 represents a magnification by a factor of two; u _p (·) represents upsampling, D _o (·) represents downsampling; attn (α), attn (β) is a learned attention map; a (-) represents an attention diagram at a certain scale, which is a certain dimension of Attn (-); norm (Z) represents the relative weight of Z to Attn (. Beta.); l "(·) and L'" (·) represent the logistic probability values at two and three scales, respectively, preceding the Softmax function; x (-) represents a feature map before the semantic header under a certain scale; f _3×3(·)、F_1×1 (. Cndot.) represents 3X 3 and 1X 1 convolutions, respectively.

As shown in fig. 3, the encoder section is mainly composed of a backbone network that extracts high-level features of an image through a plurality of convolution operations and PPM that further optimizes the high-level features. In detail, shallow feature extraction (step) is mainly included, the resolution of the picture is reduced to 1/4 of the original image, the resolution is reduced to 1/32 of the original image while the high-level semantic features are extracted through 4 stages, and finally, in order to obtain more abstract semantic features, the context information is fused by using PPM.

In order to obtain a better balance between parameter quantity and extracted feature performance, we choose Resnet as backbone network. The Resnet network has a total of 50 layers including 49 convolutional layers and a final fully-connected layer. The last full-join layer is removed and we use only the previous convolutional layer. The convolutional layer of the network consists of a Stem and 4 stages, each of which contains 3,4,6,3 residual blocks (blocks), each of which takes a residual structure. To reduce the number of parameters of the convolution layer and increase the computational efficiency, we use 3 small convolution kernels of size 3×3 instead of 1 large convolution kernel of 7×7. The use of multiple small stacks of convolution kernels may also capture more context information with the receptive field size maintained, while the increase in convolution kernels means having more activation functions, more nonlinearities, and greater discrimination. To further enhance the feature extraction capabilities of backbone networks for complex urban traffic systems, we also use a dual attention module in stages to refine the block, similar to CBAM, to optimize the backbone network, the refined block being referred to as Residual block+, and the refined overall backbone network being referred to as Resnet50+.

As shown in fig. 4, a specific network frame diagram of Residual block+, where (a) is the original Residual block, (b) is the modified Residual block+, (c) is the channel-space attention.

The core idea of the residual block is mainly to implement identity mapping of features from input to output through a jump connection, as shown in (a). X represents the input features, WEIGHT LAYER represents the weight layer, mainly including the convolution layer and the batch normalization layer. In order to reduce the calculation amount and increase the nonlinear transformation, the residual block adopts a bottleneck module, so that the first weight layer uses a 1×1 convolution check channel to perform dimension reduction, then uses a 3×3 convolution kernel to extract features, and finally performs dimension lifting operation in the third weight layer. relu denotes the activation function taken and F (X) denotes the residual function that the network needs to learn. Assuming H (X) is the final network mapping from input X to summation, then the residual function that needs to be learned becomes F (X) =h (X) -X. Because the residual function is more focused on learning small changes, compared with direct learning of identity mapping, the method is easier to optimize and alleviate the gradient vanishing problem caused by the increase of network layers. The modified Residual block+ (b) is mainly to sequentially draw channel attention and spatial attention after the third weight layer.

The channel-space attention structure is specifically described as follows: intermediate featuresAs input, a channel attention attempt/>, is learned via a pure channel attention moduleAnd multiplying the original feature map to obtain feature/>Channel attention strive to/>The detailed structure of (2) can be expressed as follows:

A_C＝σ(F(f_avg(X)+F(f_max(X)) (6)

the feature Z optimized by the channel attention mechanism is taken as input and is sent into the space attention mechanism to learn the space attention force diagram And multiplying the optimized characteristic Z to obtain the final output/>Spatial attention strive to/>The detailed structure of (2) can be expressed as follows:

Where f _avg (·) represents spatially averaged pooling of the input X, the pooling F _max (·) represents the spatially maximum pooling of input X, pooled/>F (-) represents a network which is fed into after pooling and consists of two-dimensional convolutions with convolution kernel size of 1 multiplied by 1; f' _avg (·) represents the average pooling of input Z over the channel, pooled/>; F "_max (. Cndot.) represents the maximum pooling of input Z over the channel, pooled/>F "(. Cndot.) consists of 1 two-dimensional convolution with a convolution kernel size of 7X 7; sigma (·) represents a sigmoid activation function,/>Representing element-by-element multiplication,/>Denoted as stitching features by channels.

As shown in fig. 5, the feature alignment module learns the pixel transformation offset between feature graphs with different resolutions of adjacent layers, aligns the up-sampled high-level features with context information through coordinate transformation between pixel points, and fuses the shallow-level features to obtain a feature graph with rich semantic and spatial information.

Feature maps of different resolution sizes given adjacent levelsAnd/>And X _l-1 is convolved 1X 1, and X _l is up-sampled, respectively. Then the feature graphs with the same number of channels and resolution are spliced, and the offset field/>, is learned through 3 x 3 convolutionWherein B is the number of batches, C and C' respectively represent the number of channels of the feature map of different sizes, H, W are the height and width of the feature map respectively. Can be expressed mathematically as

Wherein the method comprises the steps ofFeatures are shown spliced according to channels, F _3×3 (·) represents a 3×3 convolution, F _1×1 (·) represents a1×1 convolution, and U _p (·) represents upsampling.

Spatial grid on X _l-1 Is mapped to the spatial grid/>, of X _l according to the offset of Off _l-1 At position P _l. Can be expressed mathematically as

Finally, obtaining the pixel value of P _l-1 by bilinear difference approximation of the position P _l, and adding the pixel value with the original X _l-1 to obtain the final outputMathematically, this can be expressed as:

Where N (P _l) represents the four neighbor context values of P _l and W _p is the weight value estimated for the distance between point P and the neighbor location.

As shown in FIG. 6, the semantic segmentation flow chart in the low-light scene comprises a low-light dataset construction, a modified SFNet-N segmentation network module, an offline end-to-end training module and a vehicle-mounted camera real-time segmentation module.

A picture semantic segmentation method based on a low-illumination complex road scene comprises the following steps:

step 1) low-light data set construction: we collected the low-light dataset required for this experiment using simulation platform CRALA and named SYNTHESISCARLA. The data set is synthesized based on a rendering method, and the pixel-level label can be completed by a partially-automated method, so that the cost of data collection is low. Meanwhile, in order to ensure the diversity of data, the data are collected in 7 different town and rural scenes in CRALA0.9.9 software, including town streets, expressways, tunnels, rural roads with narrow roads and the like of different types. In order to simulate a real scene more realistically, four weather including sunny days, rainy days, foggy days and cloudy days are also considered when data are acquired. We also use CYCLEAGAN algorithm to low-light process the existing authoritative daytime dataset CITYSACPES, the new dataset being named CYCLECITYSCAPES.

SYNTHESISCARLA: the dataset is simply divided into a training set and a validation set. The training set contained 3338 images, the validation set contained 371 pictures, each with 1024x2048 pixels resolution. High-quality pixel-level labeling is performed on 3k pictures according to a CITYSACPES data set format, and low-light scenes are roughly divided into 8 categories, 13 categories, namely pedestrians, traffic labels, other, lane lines, columns, vehicles, fences, walls, sidewalks, buildings, vegetation, roads and unlabeled backgrounds.

CYCLECITYSACPES: the dataset uses CYCLEAGAN algorithm to style-convert CITYSACPES dataset pictures acquired under good daytime illumination to segment the labels with their exact semantics. CITYSACPES dataset is a large urban scene dataset containing high quality pixel level label maps of 5k images and a coarse label map of 20 k. Wherein the 5k images subjected to fine labeling are further divided into a training set, a verification set and a test set. The training set contains 2975 pictures, the verification set contains 500 pictures, the test set contains 1525 test pictures, and the resolution is 1024x2048 pixels. A total of 30 categories were included, 19 of which were used for the training and validation process.

Step 2) training the acquired low-light pictures using SFNet-N algorithm segmentation. The method consists essentially of modified Resnet50 0+, FAM and Multi-Scale modules (Multi-Scale). We use classical Resnet as a feature extraction module in the encoder and then introduce a spatial dimension and a channel dimension attention mechanism in each residual block to maximally fuse the context information and thus enhance the characterizability of the pixels. Meanwhile, in order to avoid the problem of detail loss caused by rough upsampling in the decoder as much as possible, a feature alignment module is introduced in the encoder so as to efficiently learn pixel offset to realize high-resolution features. Finally, in order to improve the segmentation performance to the maximum extent, a multi-scale mechanism is introduced during training, and the maximum performance during reasoning is ensured by learning the relative weights between adjacent scales. The improved SFNet-N mainly comprises a data importing module, a data preprocessing module, a forward propagation module, an activation function, a loss function, a reverse propagation module and an optimization module.

Wherein, the loss function adopts a softmax+cross entropy loss function:

(1) softmax function: and mapping the output vector with the length of k predicted by the neural network into another vector and normalizing the vector into a probability value, wherein the value of each element in the vector is in a (0, 1) interval and the sum is 1. The output function of the jth neuron, i.e., the probability that the sample belongs to the jth class, can be expressed as:

Where K is the number of classes predicted before entering the Softmax function and Z _i is the predicted value of the ith neuron node.

(2) Cross entropy loss function: l= - Σp (x) log q (x) (14) where p (x) is a true value and q (x) is a predicted value, i.e. a value obtained by the Softmax function described above.

And sending the constructed data set into a constructed neural network model for end-to-end training, and transplanting the trained model into an intelligent vehicle through ROS software.

And 3) acquiring a real-time low-light road scene picture through the vehicle-mounted camera, inputting the real-time low-light road scene picture into a trained segmentation model integrated on the intelligent vehicle, and acquiring a segmentation result of the real-time scene. In order to ensure that the vehicle-mounted camera is not affected by the environment, the quality of collected pictures is reduced, and the camera is arranged at a windshield in the vehicle.

The above list of detailed descriptions is only specific to practical embodiments of the present invention, and they are not intended to limit the scope of the present invention, and all equivalent manners or modifications that do not depart from the technical scope of the present invention should be included in the scope of the present invention.

Claims

1. The semantic segmentation system based on the low-light complex road scene is characterized by comprising a low-light data set construction module, a semantic segmentation network module, an offline end-to-end training module and a vehicle-mounted camera real-time segmentation module;

The low-illumination data set construction module is used for acquiring a complex low-illumination road scene picture, and the low-illumination data set constructed by the module comprises virtual data synthesized based on a simulation platform and data obtained by converting the style of the real scene data, wherein the virtual data synthesized based on the simulation platform is the low-illumination data acquired by using a simulation platform CRALA, and the data obtained by converting the style of the real scene data is obtained by converting the style of the existing daytime data set CITYSACPES by using a CYCLEAGAN algorithm;

The semantic segmentation network module is used for acquiring a final label graph, namely, endowing the final label graph with corresponding class labels according to classes to which the pixels belong to, so as to obtain segmentation results of pixel levels; the module adopts an improved Resnet50 0+ as a backbone feature extraction network, a feature alignment module is added during back up-sampling, the motion direction of each pixel point is learned, the high resolution of the image is restored layer by layer while the details are maintained, the problem of detail loss of the pixels is avoided, and finally the semantic segmentation result is further improved through a multi-scale attention module;

The offline end-to-end training module is used for training the built semantic segmentation network according to the pixel-level labeling picture so as to minimize a loss function and obtain the optimal segmentation weight;

The vehicle-mounted camera real-time segmentation module acquires a real-time road scene picture through the vehicle-mounted camera and sends the real-time road scene picture into the trained semantic segmentation neural network to acquire a real-time low-light road scene segmentation result;

The semantic segmentation network module comprises: the encoder part comprises an improved backbone network part and a pyramid pooling part, the resolution of the picture is reduced layer by layer, a higher-layer characteristic diagram is obtained, and global context information is obtained by expanding a receptive field through PPM; the decoder part comprises 4 decoder modules with characteristic alignment modules, according to a given high-dimensional characteristic diagram and a given low-dimensional characteristic diagram, the network learns the motion direction of each pixel point through the characteristic alignment modules, and the high resolution of the image is restored layer by layer while retaining details as much as possible;

After the original picture passes through an encoder and a decoder of the semantic segmentation network module, an initial segmentation result is obtained through a segmentation head part which consists of a plurality of convolution layers and has the number of the convolution channels of the last layer as the category number;

the attention module of the multi-scale semantic segmentation network introduces another scale picture as an original input part of the network module during training, allows the network to learn relative attention weights between adjacent scales through an attention mechanism, and then optimally fuses segmentation results of a plurality of scales;

for the layering operation adopted by the module, only one extra size is required to be trained independently during training, the extra size is selected as r=0.5, and the training process is expressed as:

L”(r＝1)＝U_p(L(r＝0.5))×A(r＝0.5)+(1-A(r＝0.5))×L(r＝1) (1)

Since training learns the relative weights between adjacent scales, the reasoning process of the model is expressed as:

2. The semantic segmentation system based on low-light complex road scene according to claim 1, wherein the encoder structure comprises: the shallow layer feature extraction part (Stem) reduces the resolution of the picture to 1/4 of the original image, then reduces the resolution to 1/32 of the original image while extracting the high layer semantic features through 4 stages, and finally fuses the context information by using PPM in order to obtain more abstract semantic features;

The network structure of the encoder selects Resnet as a backbone network, only uses a convolution layer in front of Resnet, the convolution layer consists of a Stem and 4 stages, each stage respectively comprises 3,4,6,3 blocks, each block adopts a residual structure, and 3 small convolution kernels with the size of 3 multiplied by 3 are used for replacing 1 large convolution kernel with the size of 7 multiplied by 7; the dual attention module is used in the stage to improve the block and thereby optimize the backbone network, the improved block is called Residual block+, and the improved overall backbone network is called Resnet50 0+.

3. The semantic segmentation system based on low-light complex road scene according to claim 2, wherein the Residual block+ mainly introduces a channel attention and spatial attention module after the third weight layer of the original Residual block; the weight layer comprises a convolution layer and a batch normalization layer; in order to increase the nonlinear capability of the network, the residual block adopts a bottleneck module, namely, a1×1 convolution check channel is used for dimension reduction in a first weight layer, then features are extracted through a3×3 convolution kernel, and finally corresponding dimension increase is carried out in a third weight layer;

The channel attention and spatial attention module is specifically designed as follows: intermediate features As input, a channel attention attempt/>, is learned via a pure channel attention moduleAnd multiplying the original feature map to obtain feature/>Channel attention strive to/>Is represented as follows:

A_C＝σ(F(f_avg(X)+F(f_max(X)) (6)

the feature Z optimized by the channel attention mechanism is taken as input and is sent into the space attention mechanism to learn the space attention force diagram And multiplying the optimized characteristic Z to obtain the final output/>Spatial attention diagramThe expression is as follows:

Where f _avg (·) represents spatially averaged pooling of the input X, the pooling F _max (·) represents the spatially maximum pooling of input X, pooled/>F (-) represents a network which is fed into after pooling and consists of two-dimensional convolutions with convolution kernel size of 1 multiplied by 1; f' _avg (·) represents the average pooling of input Z over the channel, pooled/>F "_max (. Cndot.) represents the maximum pooling of input Z over the channel, pooled/>F "(. Cndot.) consists of 1 two-dimensional convolution with a convolution kernel size of 7X 7; sigma (·) represents a sigmoid activation function,/>Representing element-by-element multiplication,/>Denoted as stitching features by channels.

4. The semantic segmentation system based on low-light complex road scene according to claim 1, wherein the feature alignment module learns pixel transformation offset between feature graphs with different resolutions of adjacent layers, aligns up-sampled high-level features with context information through coordinate transformation among pixels, and fuses shallow features to obtain feature graphs with rich semantic and spatial information, and the semantic segmentation system is specifically designed as follows:

Feature maps of different resolution sizes given adjacent levels And/>And respectively carrying out 1X 1 convolution on X _l-1 and up-sampling operation on X _l; then the feature graphs with the same number of channels and resolution are spliced, and the offset field/>, is learned through 3 x 3 convolutionWherein B is the number of batches, C and C' respectively represent the number of channels of feature images with different sizes, H, W respectively represent the height and width of the feature images, and the expression is

Wherein the method comprises the steps ofRepresenting the concatenation of features according to channels, F _3×3 (·) representing a 3×3 convolution, F _1×1 (·) representing a 1×1 convolution, and U _p (·) representing upsampling;

spatial grid on X _l-1 Is mapped to the spatial grid/>, of X _l according to the offset of Off _l-1 At position P _l, expressed as:

Finally, obtaining the pixel value of P _l-1 by bilinear difference approximation of the position P _l, and adding the pixel value with the original X _l-1 to obtain the final output Expressed as:

5. The semantic segmentation method based on the low-illumination complex road scene is characterized by comprising the following steps of:

S1, respectively carrying out synthetic data acquisition and data style migration work under good illumination through an automatic driving simulation platform and a generation countermeasure network, thereby constructing two different low-illumination data sets;

S2, constructing a semantic segmentation neural network structure by using a Pytorch deep learning frame, taking a low-illumination dataset image as an input part, outputting a pixel-level label image, namely predicting the category of a pixel according to a given pixel category label image; performing end-to-end training on the built semantic segmentation deep learning algorithm network frame by using the obtained low-light data set, and acquiring a weight value for minimizing a loss function;

the training method in the step utilizes a back propagation method of gradient descent of a small batch of multiple GPUs;

In the step S2, the semantic segmentation neural network model is realized by adopting an improved SFNet-N algorithm, and comprises an improved Resnet & lt 50 & gt+, FAM and a multi-scale module; taking Resnet as a feature extraction module in the encoder, introducing a attention mechanism of space dimension and channel dimension into each residual block to furthest merge context information so as to enhance the characterization capability of pixels, and simultaneously, introducing a feature alignment module in the encoder so as to efficiently learn pixel offset to realize high-resolution features in order to avoid the problem of detail loss caused by rough upsampling in the decoder as much as possible; finally, in order to furthest improve the segmentation performance, a multi-scale mechanism is introduced during training, and the maximum performance during reasoning is ensured by learning the relative weights between adjacent scales, wherein a loss function adopts a softmax+cross entropy loss function:

(1) softmax function: mapping an output vector with the length of k predicted by the neural network into another vector and normalizing the vector into a probability value, wherein the value of each element in the vector is in a (0, 1) interval and the sum is 1; the output function of the jth neuron, i.e., the probability that the sample belongs to the jth class, can be expressed as:

Where K is the number of categories predicted before entering the Softmax function, and Z _i is the predicted value of the ith neuron node;

(2) Cross entropy loss function: l= - Σp (x) log q (x) (14) where p (x) is a true value and q (x) is a predicted value, i.e. a value obtained by the Softmax function described above;

the data set constructed in the step S1 is sent to the neural network model constructed in the step S2 for end-to-end training, and the trained model is transplanted into the intelligent vehicle through ROS software;

And S3, acquiring a real-time image of the low-light road scene by using an on-board camera, classifying the real-time acquired image of the low-light road scene by using a pre-trained semantic segmentation neural network model, and positioning different categories to form a segmentation result graph.

6. The semantic segmentation method based on the low-light complex road scene according to claim 5, wherein the specific implementation of S1 comprises:

The simulation platform CRALA is used for collecting a low-illumination dataset which is named SYNTHESISCARLA and is synthesized based on a rendering method, the pixel-level labels of the dataset can be completed through a partial automation method, and meanwhile, in order to ensure the diversity of the data, the collected data in 7 different town and rural scenes are collected, wherein the collected data comprise town streets, expressways, tunnels and narrow rural roads of different types; in order to simulate a real scene more realistically, four weather including sunny days, rainy days, foggy days and cloudy days are taken into consideration when data are acquired; the existing daytime dataset CITYSACPES was low-light processed using CYCLEAGAN algorithm, the new dataset was named CYCLECITYSCAPES;

SYNTHESISCARLA: the data set is divided into a training set and a verification set, wherein the training set comprises 3338 images, the verification set comprises 371 pictures, and the resolution is 1024x2048 pixels; performing high-quality pixel-level labeling on 3k pictures according to CITYSACPES data set format, and dividing the low-light scene into 8 categories, 13 categories, wherein the categories are pedestrians, traffic labels, other, lane lines, columns, vehicles, fences, walls, sidewalks, buildings, vegetation, roads and unlabeled backgrounds respectively;

CYCLECITYSACPES: the dataset uses CYCLEAGAN algorithm to make style conversion on CITYSACPES dataset pictures collected under good illumination in daytime so as to use accurate semantic segmentation labels; CITYSACPES the dataset is a large urban scene dataset comprising a high quality pixel level label map of 5k images and a coarse label map of 20k, wherein the finely annotated 5k images are further divided into a training set, a validation set and a test set, the training set comprises 2975 pictures, the validation set comprises 500 pictures, the test set comprises 1525 test pictures, and the resolution is 1024x2048 pixels.

7. The semantic segmentation method based on the low-light complex road scene according to claim 5, wherein in the step S3, the vehicle-mounted camera is installed at a windshield in a vehicle, and the vehicle-mounted camera can be a network camera, a USB camera or go-pro.