CN113705453A

CN113705453A - Driving scene segmentation method based on thermal infrared attention mechanism neural network

Info

Publication number: CN113705453A
Application number: CN202111001405.3A
Authority: CN
Inventors: 桂媛媛; 李伟; 陶然; 陈正超
Original assignee: Beijing Institute of Technology BIT; Aerospace Information Research Institute of CAS
Current assignee: Beijing Institute of Technology BIT; Aerospace Information Research Institute of CAS
Priority date: 2021-08-30
Filing date: 2021-08-30
Publication date: 2021-11-26

Abstract

The invention discloses a driving scene segmentation method based on a thermal infrared attention mechanism neural network, which comprises the steps of firstly, acquiring visible light and thermal infrared images with the same scene and the same resolution, and forming a composite image training data set through marking; then constructing a thermal infrared information attention network, and training the thermal infrared information attention network by using a composite image training set; after a better training model is obtained, segmenting the driving image by using a thermal infrared information attention network; the driving environment image can be stably segmented under different environments, and the segmentation precision is high.

Description

Driving scene segmentation method based on thermal infrared attention mechanism neural network

Technical Field

The invention relates to the technical field of driving environment image segmentation, in particular to a driving scene segmentation method based on a thermal infrared attention mechanism neural network.

Background

When driving a car, a car driver needs to judge the surrounding driving environment, so that safe driving is performed. In recent years, the application of vehicle-mounted vision sensors in the automobile industry greatly helps drivers to judge complex driving environments. The vehicle-mounted vision sensor is rich in types, the vision sensor behind the vehicle can facilitate a driver to observe the situation behind the vehicle, the driver is assisted in reversing, the vision sensor on the side of the vehicle can facilitate the driver to know the situation of the blind area on the front side during driving, traffic accidents are avoided, and besides, the rapid development of the unmanned vehicle enables the vision sensors to become important bases for the judgment of the vehicle on the surrounding environment. Therefore, the image captured by the vision sensor is quickly analyzed, the function of the vision sensor can be played, the judgment of a driver is facilitated, and the occurrence of driving accidents is reduced.

The model selection scheme of the vehicle vision sensor is mainly divided into two types at present, one type is mainly to use laser radar (LiDAR), establish 3D map, confirm the real-time stereoscopic scene around the car, thus judge; the other type takes a camera shooting tool as a main part, and can sense a feasible lane and an obstacle and then judge the environment by acquiring a large number of images and carrying out rapid analysis and processing. The latter scheme has lower overall cost than the former scheme, and has mature hardware conditions, which attracts general attention of automobile manufacturers.

The driving scene semantic segmentation technology based on the visual sensor can perform pixel-level semantic classification on images generated by the visual sensor, so that objects in a traffic scene can be quickly identified, a driving system can conveniently perform subsequent judgment, and the characteristics of high speed and large information amount attract much attention. However, since the scene in the driving environment of the vehicle is complex, the images contain a lot of objects, and the image quality in different environments is very different, especially the visible light image is almost useless at night compared with the daytime, which makes the semantic segmentation of the driving scene difficult.

Disclosure of Invention

The invention provides a driving scene segmentation method based on a thermal infrared attention mechanism neural network aiming at the defects in the prior art, and solves the defects in the prior art.

In order to realize the purpose, the technical scheme adopted by the invention is as follows:

a driving scene segmentation method based on a thermal infrared attention mechanism neural network comprises the following steps:

the method comprises the steps of 1, acquiring a plurality of pairs of visible light images and thermal infrared images with the same resolution and the same driving environment provided by a vehicle-mounted camera, and performing category marking to obtain a composite image training data set;

step 2, constructing a thermal infrared information attention network;

step 3, training the thermal infrared information attention network according to the composite image training data set to obtain a trained thermal infrared information attention network;

and 4, segmenting the driving environment image by using a thermal infrared information attention network, wherein the driving environment image is a visible light image and a thermal infrared image pair with the same resolution as the composite image training data set.

Further, step 1 comprises the following sub-steps:

step 1.1: and acquiring the driving environment visible light image and the thermal infrared image which are provided by the vehicle-mounted camera and have the same resolution and the same scene, wherein the contents of corresponding pixels in the two images are the same. The visible light image is an image formed by red, yellow and blue spectral bands shot by a common color camera, the thermal infrared image is an image shot by a thermal infrared imager, and the two images form a complex image pair. The image pairs with the same resolution need to be prepared into a plurality of pairs, and need to comprise a plurality of scenes and a plurality of environments;

step 1.2: cutting the complex image pair into the same size, the same image direction, the same length and the same width to form a composite image training data set;

further, step 2 comprises the following sub-steps:

step 2.1: determining basic parameters of the thermal infrared information attention network according to factors such as quality and quantity of the complex image pair;

step 2.2: building a thermal infrared information attention network structure based on a deep learning environment, and setting a network according to the parameters determined in the step 2.1;

further, step 3 comprises the following sub-steps:

step 3.1: setting the training times of the thermal infrared information attention network according to the quality and the quantity of the composite image training data set;

step 3.2: training a thermal infrared information attention network by using a composite image training data set;

step 3.3: saving the network parameters of the trained thermal infrared information attention network;

further, step 4 comprises the following sub-steps:

step 4.1: acquiring a driving environment composite image pair which is provided by a vehicle-mounted camera and has the same resolution as the composite image training data set, and normalizing the composite image pair to ensure that the length and the width of the composite image pair are the same as those of the image pair in the training data set;

step 4.2: inputting the normalized composite image to be segmented into the trained thermal infrared information attention network to obtain the segmentation result of the network;

step 4.3: and sorting and storing the driving scene segmentation result of the thermal infrared information attention network.

Compared with the prior art, the invention has the following beneficial effects:

1. the method provides a systematic driving environment semantic segmentation method, and the driving environment can be judged by using the advantages of high resolution of visible light images, rich color information, wide environment application range of thermal infrared images and high brightness. Moreover, the driving environment segmentation system provided by the method is suitable for various vehicles, and has the advantages of low cost, simplicity in operation and good segmentation effect.

2. The method provides a new semantic segmentation network based on deep learning, namely a thermal infrared information attention network, wherein the network fuses the extracted characteristics of a visible light image and a thermal infrared image, the extracted thermal infrared characteristics can supervise the learning of the network by using a thermal infrared information attention block, and the weight of the visible light image in the segmentation process is determined according to basic driving environment information in the thermal infrared image. Compared with most semantic segmentation networks, the thermal infrared information attention network has a simple structure, can quickly process a large number of images generated by the driving sensor, and improves the segmentation efficiency.

Drawings

FIG. 1 is a flow chart of a driving scene segmentation method based on a thermal infrared attention mechanism neural network according to the present invention;

FIG. 2 is a schematic view of a visible-thermal infrared binocular camera according to the present invention;

FIG. 3 is a diagram of a thermal infrared information attention network constructed in accordance with the present invention;

FIG. 4 is a block diagram of a thermal infrared message attention block designed in accordance with the present invention;

FIG. 5 is a schematic diagram of thermal infrared information attention network training designed by the present invention;

FIG. 6 is a schematic diagram illustrating the segmentation of a thermal infrared information attention network in a driving environment according to the present invention;

FIG. 7 is a schematic diagram of a driving scene segmentation result using the public data set according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail below with reference to the accompanying drawings by way of examples.

The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.

As shown in fig. 1, a driving scene segmentation method based on a thermal infrared attention mechanism neural network includes the following steps:

step 1: collecting visible light images and thermal infrared images with the same scene and the same resolution ratio to manufacture a composite image training data set;

step 2: constructing a thermal infrared information attention network based on deep learning, and setting network parameters;

and step 3: training a thermal infrared information attention network by using a composite image training data set;

and 4, step 4: and segmenting the composite image of other driving environments by using the thermal infrared information attention network.

Step 1 of the present embodiment comprises the steps of:

step 1.1: and acquiring a composite image generated by the vehicle-mounted vision sensor with better image quality. Comprising a visible light image and a thermal infrared image, forming a composite image pair. The two images in the composite image pair have the same resolution and the same driving scene. Generally, a visible light-thermal infrared binocular camera can be used for acquiring an image pair so as to meet the requirements of the same resolution and the same scene, and an imaging schematic diagram of the visible light-thermal infrared binocular camera is shown in fig. 2;

step 1.2: on the basis of step 1.1, the composite image pair is marked pixel by pixel, and the image pair is marked as seven classes (vehicle, pedestrian, bicycle, traffic ground mark, roadblock, railing, other). The labels need to be semantically labeled at the pixel level, and each pixel in the image needs to have a corresponding semantic label.

Step 1.3: on the basis of step 1.1 and step 1.2, all image pairs are cut to the same size, and the length and the width of each image pair are ensured to be the same. Under the condition of 256 pixels in length and 256 pixels in width, at least 200 pairs of labels are needed, and the number and quality of the labels directly influence the training effect of the deep learning network. In addition, the number of the various marked objects should be as close as possible to avoid the phenomenon of class imbalance caused by different training effects. The marked images correspond to weather conditions (sunny, cloudy, rainy, snowy, foggy, etc.) and different time conditions (morning, midday, evening, night, etc.) as much as possible.

Step 2 of this embodiment comprises the steps of:

step 2.1: and (3) constructing a deep learning Network-Thermal infrared information attention Network (TIAttNet). The thermal infrared information attention network designed by the invention is a semantic segmentation model based on deep learning, and is different from most semantic segmentation networks. The structure of the thermal infrared information attention network is shown in fig. 3, and the thermal infrared information attention network adopts an end-to-end network structure and adopts a basic architecture of an encoder-decoder. Structurally, the thermal infrared information attention network has a visible light image encoder, a composite image decoder and a thermal infrared down-sampling system.

The visible light image encoder is composed of a series of downstream blocks. The visible light downlink block comprises a convolution operation, a batch normalization operation and an activation function operation, and the other blocks except the last downlink block comprise a down-sampling operation. By this method, the network will extract shape and texture features in the visible light image. The operation formula of the descending block of the visible light image encoder is

In the formula, X_inFor input of downstream blocks of a visible-light image encoder, X_out2Conv represents a convolution operation with a convolution kernel size of 3 × 3 for the output of the downstream block of the visible-light image encoder, BN is a batch normalization operation, and the formula of the batch normalization operation is

LeaklyReLU is leakage type linear rectification activation operation, and the formula is

f(x)＝max(0.01x,x) (6)

The leaky linear rectification enables operation so that negative regions are not discarded by the network at the time of computation, averagepowing stands for down-sampling operation using the average.

The thermal infrared downsampling system is not a standard encoder structure, but a downsampling feature extraction system, and mainly aims to extract features of thermal infrared images in different sizes. The thermal infrared down-sampling system comprises several thermal infrared down-sampling blocks, each of which comprises down-sampling operations with different sizes, a convolution operation, a batch normalization operation and an activation function operation, and the formula is

In the formula, x_inRepresenting the input of a thermal infrared downlink block, the input being a thermal infrared image, x_out2The method comprises the steps of representing output of a hot infrared downlink block, Maxboosting representing maximum pooling operation, different downlink blocks corresponding to different downsampling sizes, Conv representing convolution operation with convolution kernel size of 3 x 3, BN representing batch normalization operation, and LeaklyReLU representing leakage type linear rectification activation operation.

The composite image decoder includes a series of upstream blocks and a series of thermal infrared information attention blocks. The up-line block of the decoder is used for recovering the characteristics and the size of the image and consists of an up-sampling operation, two convolution operations, two batch normalization operations and two activation functions. The operation formula of the decoder uplink block is

In the formula, X_inRepresenting multispectral remote sensing image characteristics, X ', extracted from downlink block of corresponding visible light encoder'_outRepresenting the output (0 if not) of the thermal infrared information attention block located before the upstream block, Concatenate representing the matrix join operation, and UpSampling representing the UpSampling operation, may expand the length and width of the feature to twice the input.

The thermal infrared information attention block of the decoder takes the extracted features of the thermal infrared image as reference, so that the network focuses on more obvious features in the thermal infrared image. The input of the thermal infrared information attention block is a visible light characteristic and a thermal infrared image characteristic with the same size. The thermal infrared information attention block firstly carries out convolution and batch normalization operation on the two characteristics and then uses an activation function for activation. Then, a gate operation is used for the visible light image features to judge the quality of the visible light image, so that the proportion of the thermal infrared image features participating in training is determined. For example, if the quality of the visible light image is basically better in a clear daytime environment, the proportion of the thermal infrared image participating in training is lower, and the network mainly learns the feature expression in the visible light image; if the quality of the visible light image is basically poor in a driving environment on cloudy days or at night, the quality of the thermal infrared image is better than that of the visible light due to the stability of the thermal infrared image in different environments, and therefore the network mainly learns the feature expression in the thermal infrared image. Next, the two features are subjected to operations of addition, activation, convolution, and the like, to obtain an output. The attention specification formula in the thermal infrared information attention block is as follows

In the formula, x_inRepresenting the output, X, of the corresponding hot infrared downlink block_inRepresenting the output, X, of the preceding upstream block structure corresponding to the thermal infrared information attention block_out2Represents the output of the thermal infrared information attention block, and Gate represents the Gate operation, and the formula is

G(X,x)＝sigmoid(W₁X+W₂x) (10)

In the formula，W₁And W₂Is a weight value. Sigmoid is Sigmoid function, and the formula is

The specific structure of the thermal infrared information attention block is shown in fig. 4;

step 2.2: setting information such as input size, output size, loss function, optimizer function and learning rate of the thermal infrared information attention network according to the network structure and the parameters of the composite image training data set in the step 2.1, wherein generally, the input size can be set to be 256 × 256, the output size needs to be consistent with the input size, the loss function can be a cross entropy loss function, the optimizer function can be a self-adaptive moment estimation function (Adam function), and the learning rate can be set to be 0.001.

Step 3 of this embodiment comprises the steps of:

step 3.1: and setting training times, and training the thermal infrared information attention network by using a composite image training data set. When the training times are set, the number and the quality of the composite image pairs can be judged. Generally, the greater the number of image pairs, the greater the number of training times; the worse the image pair quality, the more training times. For a normal training data set, 200 rounds may be set. In the training, the thermal infrared information attention network aims to minimize the loss value of the loss function, so that the training can be stopped if the loss value of the network does not decrease for a plurality of times.

Step 3.2: and inputting the composite image training data set into the network in batches during each training turn. The number of image pairs input per batch can be set according to the performance of the computer, and generally, the better the performance of the computer, the greater the number of image pairs input per batch. During each round of training, the thermal infrared information attention network learns according to the direction which enables the loss value to be minimum, and therefore parameters of the network are adjusted. Training requires saving network parameters that minimize the loss value.

Step 3.3: and (3.1) repeating the step 3.1 and the step 3.2, training the thermal infrared information attention network for multiple times, comparing different training results, and storing a group of network parameters which can minimize the loss value. A schematic diagram of the training of the thermal infrared information attention network is shown in fig. 5.

Step 4 of the present embodiment includes the following substeps:

step 4.1: a composite image test data set is prepared. The composite image test set and the training set have the same resolution and size, and the driving environments shot by the composite image test set and the training set are approximately the same. The image pair in the composite image test data set also needs the visible light image and the thermal infrared image with the same resolution and the same scene, and a visible light-thermal infrared binocular camera can be used for image acquisition. (ii) a

Step 4.2: and inputting the image pairs of the composite image test data set into a trained thermal infrared information attention network, and outputting the driving environment segmentation result of each image pair by the network. A schematic diagram of the division of the thermal infrared information attention network in the driving environment is shown in fig. 6. Fig. 7 shows the segmentation results generated by the thermal infrared information attention network for the real pair of running images, wherein the left image is the visible light image, the middle image is the thermal infrared image, and the right image is the segmentation result. In the segmentation result graph, black is a background, and the other parts are various types of segmented objects.

Claims

1. A driving scene segmentation method based on a thermal infrared attention mechanism neural network is characterized by comprising the following steps:

step 2, constructing a thermal infrared information attention network;

2. The driving scene segmentation method based on the thermal infrared attention mechanism neural network as claimed in claim 1, wherein the step 1 comprises the following sub-steps:

step 1.1: acquiring a driving environment visible light image and a thermal infrared image which are provided by a vehicle-mounted camera and have the same resolution and the same scene, wherein the contents of corresponding pixels in the two images are the same, the visible light image is an image formed by red, yellow and blue spectral bands shot by a common color camera, the thermal infrared image is an image shot by a thermal infrared imager, the two images form a complex image pair, and the image pair with the same resolution needs to be prepared into a plurality of pairs and needs to comprise a plurality of scenes and a plurality of environments;

step 1.2: and cutting the complex image pair into the same size, the same image direction, the same length and the same width, and forming a composite image training data set.

3. The driving scene segmentation method based on the thermal infrared attention mechanism neural network as claimed in claim 1, wherein the step 2 comprises the following sub-steps:

step 2.2: and (3) building a thermal infrared information attention network structure based on the deep learning environment, and setting a network according to the parameters determined in the step 2.1.

4. The driving scene segmentation method based on the thermal infrared attention mechanism neural network as claimed in claim 1, wherein the step 3 comprises the following sub-steps:

step 3.3: and saving the network parameters of the trained thermal infrared information attention network.

5. The driving scene segmentation method based on the thermal infrared attention mechanism neural network as claimed in claim 1, wherein the step 4 comprises the following sub-steps: