CN115511759A

CN115511759A - Point cloud image depth completion method based on cascade feature interaction

Info

Publication number: CN115511759A
Application number: CN202211167454.9A
Authority: CN
Inventors: 梁韵基; 陈能真; 刘磊; 於志文
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2022-09-23
Filing date: 2022-09-23
Publication date: 2022-12-23

Abstract

The invention discloses a point cloud image depth complementing method based on cascade feature interaction, which belongs to the field of automatic driving and comprises the following steps: the method comprises the steps of obtaining an automatic driving scene three-dimensional point cloud and a scene two-dimensional RGB image, constructing an encoder according to a plurality of residual modules, constructing a decoder according to a plurality of up-sampling modules, constructing neural networks of the point cloud and the image respectively, constructing a plurality of cascade feature interaction modules between the neural networks of the point cloud and the image, constructing a feature interaction point cloud and image double-branch neural network model, inputting the scene three-dimensional point cloud and the scene two-dimensional RGB image into the feature interaction point cloud and image double-branch neural network model, outputting a scene dense depth map, and fusing the scene depth map output by two branches in a confidence map weighting mode to obtain a depth map with higher reliability. Compared with other models based on image and point cloud fusion, the method has better depth perception performance under the condition of taking the image and the low-beam laser radar point cloud as input.

Description

Point cloud image depth completion method based on cascade feature interaction

Technical Field

The invention relates to the technical field of automatic driving, in particular to a point cloud image depth complementing method based on cascade feature interaction.

Background

The depth perception is a very basic and important perception technology in an automatic driving system, the purpose of the depth perception technology is to acquire accurate and dense depth information of surrounding scenes, and based on the acquired dense depth information, a plurality of high-level perception tasks of automatic driving, such as semantic segmentation, target detection, three-dimensional scene reconstruction and the like, can acquire performance improvement to a great extent. The automatic driving at the present stage mainly depends on two sensors, namely a camera and a laser radar, to perform depth perception, the two sensors respectively have advantages and disadvantages, image data collected by the camera sensor can acquire rich texture and color information of a scene, but the influence of illumination conditions is large, point cloud data collected by the laser radar sensor can acquire accurate depth information of the scene and is not influenced by illumination, but the point cloud data is very sparse, and sufficient effective information cannot be provided.

In the prior art, a depth perception scheme based on a pure image and a depth perception scheme based on an image and a laser radar point cloud exist.

Among depth perception schemes based on pure images, there is a depth perception scheme of monocular depth estimation, which, as the name implies, estimates the distance of each pixel in an image from a shooting source by using an RGB image at one or only one viewing angle. The monocular depth estimation method based on supervised learning directly takes a two-dimensional image as input, takes a depth map as output, takes a ground truth depth map as supervision information, and trains a depth model; in addition, because the acquisition difficulty of the depth label data is high, many algorithms are based on unsupervised models at present, namely, binocular image data acquired by two cameras are only used for carrying out combined training. The binocular data can mutually predict each other, so that corresponding parallax data is obtained, and then evolution is carried out according to the relation between parallax and depth, or the corresponding problem of each pixel point in the binocular image is regarded as a stereo matching problem to be trained.

In a depth perception scheme based on image and laser radar point cloud, considering that two sensors of a camera and a laser radar have respective advantages and disadvantages, an existing automatic driving perception system is usually based on a scheme of multi-sensor perception fusion, and the advantages of the two sensor data are complemented by fusing the image and the point cloud sensor data, so that the purpose of improving the depth perception capability is achieved. According to the fusion stage, the existing heterogeneous multi-sensor fusion sensing scheme can be divided into three fusion modes, namely early fusion, middle fusion and later fusion, wherein the early fusion is also called data layer fusion, and two kinds of sensing data are fused on an original data layer; the method is simple to implement, but has obvious defects, information interaction and advantage complementation between two modal data are not fully realized, the fusion effect is improved to a limited extent, and sometimes the fusion result is even worse than that under the condition of a single perception mode. The feature layer fusion is also called middle-stage fusion, and the two sensing data are respectively extracted and then the extracted features are fused, so that the advantage is that the network can be designed for single sensing modal data to fully extract the features, but the defect also exists, and the full interaction of the two sensing data cannot be effectively realized.

The current depth estimation methods based on pure images can be divided into conventional methods, machine learning-based methods, and deep learning-based methods. The traditional method is based on binocular or multi-view images, adopts a stereo matching technology, converts parallax information between two images into depth information by using a triangulation method, and estimates scene depth information from the images. Monocular image depth estimation based on machine learning, a probability map model is built for a depth relation by using a Markov Random Field (MRF), and image depth estimation is realized by minimizing an energy function. The deep learning-based method is also a relatively large method used at present, and a model is trained to learn the mapping relation from an image to a depth map by inputting an RGB image. The disadvantage of this method is that the model performance depends heavily on the data quality, and thus the model performance may be degraded severely in situations with poor care conditions, such as at night, in tunnels, etc.

The scheme based on the point cloud fusion of the image and the laser radar is a mainstream scheme of automatic driving depth perception at the present stage, and the defect based on a pure image scheme is overcome. In the current point cloud image fusion depth perception technology, although a scheme based on pre-fusion can retain original information of data to the greatest extent, the existing technology is difficult to realize fine-grained heterogeneous perception data spatial alignment and fusion, which often results in poor fusion effect; the scheme based on post-fusion fuses the sensing results of the data of the two sensors in a decision-making level, the implementation is simple, but the two sensors are limited respectively, interaction is lacked between the two modes, and advantage complementation between the two modes cannot be realized, so that the fusion effect is poor, and sometimes, the sensing effect is worse because the sensing results of the two sensors are mutually contradictory. At present, a more used fusion scheme is multi-modal fusion perception based on a feature layer, and the method has the advantages that spatial alignment of data does not need to be considered, but the current various feature layer fusion-based technologies are still not fine enough in fusion granularity, and one mode is often used as auxiliary supplementary information of the other mode or only fused in modes such as simple addition, so that interaction between the two modes is insufficient, and fusion is not sufficient.

Disclosure of Invention

In order to solve the problems of insufficient depth perception precision and poor fusion effect of heterogeneous perception data in the scheme, fine-grained fusion and sufficient interaction of point cloud and image sensor data are achieved, a dual-branch heterogeneous perception data cascade interaction network is provided, corresponding features of two modes are fused on multiple scales, the fused features are input into branch networks corresponding to the respective modes, the information richness and the depth perception capability of the two branch networks are improved, in addition, the idea of an auxiliary task is introduced, and scene structure information in a model learning image is guided by introducing an image reconstruction task, so that the structure information of an output depth map is more complete. And finally, outputting the high-reliability depth values in the output depth maps of the two branch networks as a final model through the confidence map to obtain a fusion perception result.

The embodiment of the invention provides a point cloud image depth completion method based on cascade feature interaction, which comprises the following steps:

acquiring an automatic driving scene three-dimensional point cloud and a scene two-dimensional RGB image;

constructing two encoders for extracting characteristics of scene three-dimensional point cloud and scene two-dimensional RGB images according to a plurality of Resnet34 residual modules;

constructing two decoders for performing feature restoration on the scene three-dimensional point cloud and the scene two-dimensional RGB image according to the plurality of up-sampling modules;

connecting an encoder and a decoder of a scene three-dimensional point cloud extraction and reduction branch to construct a scene three-dimensional point cloud branch neural network;

connecting an encoder of a scene two-dimensional RGB image extraction and reduction branch with a decoder to construct a scene two-dimensional RGB image branch neural network;

setting levels of residual modules of two encoders in a scene three-dimensional point cloud branched neural network and a scene two-dimensional RGB image branched neural network in a mutually corresponding manner;

constructing a plurality of cascade feature interaction modules, wherein the input of each cascade feature interaction module is connected with the corresponding hierarchical output of the residual error modules of the two encoders, and the output of each cascade feature interaction module is connected with the next corresponding hierarchical of the two encoders, so as to construct a point cloud and image double-branch neural network model of feature interaction;

inputting a scene three-dimensional point cloud and a scene two-RGB dimensional image on a point cloud and image double-branch neural network model with characteristic interaction, and outputting a scene depth map;

and fusing the scene depth map by using a confidence map weighting mode to obtain a new scene depth map.

Preferably, the two encoders for extracting the features of the scene three-dimensional point cloud and the scene two-dimensional RGB image each include five cascaded residual error modules, and the two decoders for performing the feature restoration of the scene three-dimensional point cloud and the scene two-dimensional RGB image each include five cascaded upsampling modules, and in the encoder for extracting the features of the scene three-dimensional point cloud, the convolutional neural network of the residual error modules adopts a sparse convolutional neural network, and the convolutional kernel is 3 × 3; in an encoder for extracting the features of a scene two-dimensional RGB image, a convolution neural network of a residual error module adopts a standard convolution neural network, and a convolution kernel is 3x3.

Preferably, the three-dimensional point cloud branched neural network and the scene two-dimensional RGB image branched neural network each comprise a plurality of different convolutional layers, pooling layers, activation layers, transpose convolutional layers and cross-scale feature connection layers.

Preferably, each decoder comprises five cascaded upsampling modules, each comprising one transposed convolution, one batch normalization layer, one pooling layer.

Preferably, the number of the cascade feature interaction modules is five, and each cascade feature interaction module comprises a 1x1 convolution, three hole convolutions with hole rates of 1, 2 and 4 respectively, and a 1x1 convolution;

and the output of the last cascade feature interaction module is used as the input of a first up-sampling layer of the point cloud and image double-branch neural network model of feature interaction.

Preferably, the method further comprises the following steps:

and taking the reconstructed image output by the last up-sampling module as an auxiliary task, calculating the difference between the reconstructed image and the input scene two-dimensional RGB image according to the L2 loss function, and training the model to learn the structural information of the image.

Preferably, the L2 loss function comprises:

wherein D is _i A depth value representing the ith position of the predicted depth map,

and the depth value of the ith position of the ground truth depth map is represented.

Preferably, training of the point cloud and image double-branch neural network model of feature interaction is further included, which includes:

taking the point cloud and image data pair as a training data set;

performing enhancement processing on the image in the data set, wherein the enhancement processing comprises turning processing, cutting processing, brightness adjustment and normalization processing, and converting into tensor map processing;

initializing the model parameters by random Gaussian distribution;

and setting a loss function of model training and a loss function of a reconstructed image, adding the two loss functions, setting respective coefficients, taking the minimized loss function as an optimization target, and training the model through a gradient updating strategy to obtain optimal model parameters.

Preferably, the fusion of the scene maps by using the confidence map weighting method to obtain a new scene depth map includes:

respectively acquiring two estimated depth values of corresponding positions of two scene depth maps output by a cloud and image double-branch neural network;

calculating the confidence coefficients of two estimated depth values of the corresponding positions of the two scene depth maps;

respectively calculating the products of the confidence degrees and the depth values of the corresponding positions of the two scene depth maps to obtain a picture to be fused;

and respectively outputting the depth values of the corresponding positions of the picture to be fused by the point cloud and the image double-branch neural network, adding the depth values, and fusing the depth values into the scene depth map to obtain a new scene depth map.

The embodiment of the invention provides a point cloud image depth complementing method based on cascade feature interaction, which has the following beneficial effects compared with the prior art:

according to the point cloud image depth complementing method based on cascade feature interaction, the interaction degree of two heterogeneous sensing data is greatly improved through fine-grained fusion of the multi-scale point cloud image features, the advantage complementation of the two sensing data is realized, the fused features are input into corresponding branch networks again, the information amount of two branches is enriched, the sensing capability of the two branch networks is improved, finally, the output of the two branch networks is fused, and the depth value with higher confidence coefficient of the corresponding position of the two output depth maps is taken as the depth value of the final output depth map. In addition, the outputs of the two modes are independent from each other, and are not mainly provided with a branch, so that the robustness of the model to noise is improved. Finally, compared with other models based on image and point cloud fusion, the model has better performance under the condition of taking the image and the low-beam laser radar point cloud as input, which also proves that the model can be applied to equipment with limited resources of only a camera and a low-beam low-cost laser radar.

Drawings

FIG. 1 is a model structure diagram of a point cloud image depth completion method based on cascade feature interaction according to an embodiment of the present invention;

fig. 2 is a diagram of a cascade feature interaction module according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example (b):

referring to fig. 1-2, the embodiment provides a point cloud image depth complementing method based on cascade feature interaction, and a multi-scale fine-grained fusion model is constructed by using the characteristics that an image has color and texture information, but is influenced by illumination, and point cloud is not influenced by illumination and information is sparse, so that full fusion and advantage complementation of image and point cloud data are realized, and the sensing capability of small objects is greatly improved. In the embodiment, the image reconstruction auxiliary task is introduced to guide the model to learn the structural information in the image, so that the object contour of the output depth map is more complete. And fusing the output depth maps of the two branches by using a confidence map weighting mode to obtain a depth map with higher credibility.

Step1: building a multi-scale double-branch neural network model by using a residual error module of Resnet 34;

step101: the encoder parts of the two branch networks are respectively composed of five residual error blocks, and the decoder parts are respectively composed of five upsampling modules;

step102: the two branch networks comprise a plurality of different convolution layers, a pooling layer, an activation layer, a transposition convolution layer and cross-scale characteristic connection, wherein the encoder part of the point cloud branch, the convolution network of five residual blocks adopt sparse convolution, the encoder part of the image branch, the convolution network of five residual blocks adopt standard convolution networks, and the convolution kernels of all the convolution neural networks are 3x3 in size;

step103: each up-sampling module of the decoder consists of a transposition convolution, a batch normalization layer and a pooling layer;

step104: five cascade feature interaction modules are arranged between the two branch networks, each cascade feature interaction module is composed of a 1x1 convolution, a hole convolution with hole rates of 1, 2 and 4 and a 1x1 convolution according to the building sequence from top to bottom, and the input is a feature diagram of the corresponding hierarchy of the two branch networks;

step105: five upsampling modules are connected behind the last cascade feature interaction module, each upsampling module consists of a convolution network layer, a normalization layer and an activation layer, the output of the last upsampling module is a reconstructed input image, the reconstructed image is used as an auxiliary task, the difference between the reconstructed image and an input RGB image is calculated by utilizing an L2 loss function, and the structural information of the model learning image is trained;

the working process is as follows: each network of the double-branch network has an output depth map, and the depth values of the corresponding positions of the final output depth map are obtained by calculating the confidence degrees of two estimated depth values of the corresponding positions of the two depth maps, multiplying the confidence degrees by the depth values of the corresponding depth maps respectively and adding the two estimated depth values.

Step2: enhancing the images in the data set, wherein the enhancing operation comprises turning, cutting, brightness and the like, carrying out normalization processing, finally converting into a tensor form, and obtaining a training data set convenient for deep learning convolutional neural network processing:

step3: the real scene autopilot dataset used in this example is a KITTI2015 depth estimation and depth completion dataset that includes left and right images of a binocular camera, a lidar point cloud, and a ground truth depth map. The image input into the model is cropped to HxW to 325x1216 resolution. In order to accelerate the training speed of the model, the input image is normalized by zero mean value in the present example. The model parameters are initialized with random Gaussian distribution before training is started, and the model performance can be enhanced by enough randomness. The specific parameter settings for this example during training are as follows:

parameter name	Parameter value
		Batch size (batch size)	16
Input image resolution (H x W)	352x1216
		Number of training rounds (Epochs)	30
Learning rate (learning rate)	1e-4
		Effective depth value range (unit: m)	0-80

Step, 4: according to the established double-branch network model, a loss function of model training and a loss function of image reconstruction are set, the two loss functions are added, respective coefficients are set, the minimum loss function serves as an optimization target, and the model is trained through a gradient updating strategy to obtain the optimal model parameters.

The loss function used in this example is the L2 loss function:

in which D is _i Depth values representing i positions lower in the predicted depth map,

and the depth value of the ith position of the ground truth depth map is represented. In this example, an Adam optimizer is used to optimize the model parameters, achieving the goal of minimizing the loss function. The Adam algorithm optimization process can be summarized as: and dynamically adjusting the learning rate of each parameter by using the sample mean value estimation and the sample square mean value estimation of the gradient once each iteration, so that the parameters are updated more stably during training, and the gradient of the model can be stably reduced.

Step5: and inputting the point cloud and image data pair data into the double-branch network model to obtain a final output depth map.

In the point cloud image fusion depth completion method based on the cascade feature fusion in the embodiment, based on the convolutional neural network, the specifically designed network aiming at two sensing data is utilized to respectively extract the features of two modes, and the cascade feature network is used to realize the fine-grained fusion of the multi-scale point cloud image features, so that the interaction degree of the two modes is fully improved, the depth sensing capability of the two branch networks is enhanced, and the robustness of the model to noise is improved. After an image reconstruction task is introduced, the image reconstruction task guides a model to learn scene structure information in the image, and an output depth map of the model has better integrity on the object contour. Through comparison on the KITTI2015 deep completion and the deep estimation task, the model achieves the best performance on the deep estimation task, the competitive performance on the deep completion task is achieved, the model achieves a result with strong competitive power in a Gaussian noise introduced robustness experiment, and the practicability of the method is proved.

Although the present invention has been described in detail with reference to the specific embodiments, it should be understood that various changes and modifications can be made by those skilled in the art without departing from the spirit and scope of the invention.

Claims

1. A point cloud image depth completion method based on cascade feature interaction is characterized by comprising the following steps:

acquiring a three-dimensional point cloud of an automatic driving scene and a two-dimensional RGB image of the scene;

according to a plurality of cascaded residual modules of Resnet34, two encoders for extracting characteristics of scene three-dimensional point cloud and scene two-dimensional RGB images are constructed;

according to a plurality of cascaded up-sampling modules, two decoders for performing feature restoration on the scene three-dimensional point cloud and the scene two-dimensional RGB image are constructed;

connecting the output of an encoder of the scene three-dimensional point cloud extraction and reduction branch with the input of a decoder to construct a scene three-dimensional point cloud branch neural network;

connecting the output of an encoder of the scene two-dimensional RGB image extraction and restoration branch with the input of a decoder to construct a scene two-dimensional RGB image branch neural network;

setting levels of residual modules of two encoders in a scene three-dimensional point cloud branched neural network and a scene two-dimensional RGB image branched neural network in a one-to-one correspondence manner;

building a feature interaction module according to a 1x1 convolution, three void convolutions of which the void rates are 1, 2 and 4 respectively, and the 1x1 convolution in sequence, and cascading to obtain a plurality of cascading feature interaction modules, wherein the input of each cascading feature interaction module is connected with the corresponding level output of a residual error module of two encoders, the output of each cascading feature interaction module is connected with the next corresponding level of the two encoders, and a point cloud and image double-branch neural network model of feature interaction is built;

inputting a scene three-dimensional point cloud and a scene two-RGB dimensional image in a point cloud and image double-branch neural network model with feature interaction, and outputting two scene depth maps;

and fusing the two scene depth maps in a confidence map weighting mode to obtain a new scene depth map.

2. The point cloud image depth completion method based on cascade feature interaction as claimed in claim 1, wherein the two encoders for extracting features of the scene three-dimensional point cloud and the scene two-dimensional RGB image each comprise five cascade residual error modules, the two decoders for restoring features of the scene three-dimensional point cloud and the scene two-dimensional RGB image each comprise five cascade upsampling modules, and in the encoder for extracting features of the scene three-dimensional point cloud, the convolutional neural network of the residual error modules adopts a sparse convolutional neural network, and the convolutional kernel is 3x3; in an encoder for extracting the features of a scene two-dimensional RGB image, a convolution neural network of a residual error module adopts a standard convolution neural network, and a convolution kernel is 3x3.

3. The method of claim 1, wherein the three-dimensional point cloud branched neural network and the scene two-dimensional RGB image branched neural network each comprise a plurality of different convolutional layers, pooling layers, activation layers, transpose convolutional layers, and cross-scale feature connection layers.

4. The method of claim 1, wherein each decoder comprises five cascaded upsampling modules, each upsampling module comprising a transposed convolution, a batch normalization layer, and a pooling layer.

5. The method of claim 4, wherein the number of the cascade feature interaction modules is five.

6. The method of claim 4, wherein the method of point cloud image depth completion based on cascade feature interaction further comprises:

and taking the reconstructed image output by the last up-sampling module as an auxiliary task, calculating the difference between the reconstructed image and the input scene two-dimensional RGB image according to the L2 loss function, and training a model to learn the structural information of the image.

7. The method of claim 6, wherein the L2 loss function comprises:

8. The method of claim 7, further comprising training a point cloud and image dual-branch neural network model of feature interaction, which comprises:

taking the point cloud and image data pair as a training data set;

initializing the model parameters by random Gaussian distribution;

9. The method of claim 1, wherein the fusing two scene depth maps by using confidence map weighting to obtain a new scene depth map comprises:

calculating the confidence degrees of two estimated depth values of the corresponding positions of the two scene depth maps;