CN114549297A

CN114549297A - Unsupervised monocular depth estimation method based on uncertain analysis

Info

Publication number: CN114549297A
Application number: CN202111185472.5A
Authority: CN
Inventors: 宋传学; 齐春阳; 彭思仑; 宋世欣; 肖峰; 王达
Original assignee: Jilin University
Current assignee: Jilin University
Priority date: 2021-10-12
Filing date: 2021-10-12
Publication date: 2022-05-27

Abstract

The invention discloses an unsupervised monocular depth estimation method based on uncertainty, which firstly provides an unsupervised depth estimation network based on uncertainty for improving the problem of low prediction depth precision in monocular depth estimation, and the uncertainty learning method solves the problem of strong expression capability of a convolutional neural network used for monocular depth estimation at present. The method of the invention trains the deep learning network in an unsupervised mode, avoids the limitation of a deep label, and the loss function of the invention consists of three parts, namely brightness loss, smoothness loss and uncertainty loss, not only can estimate the depth, but also can obtain the confidence coefficient of the estimated depth through predicting the variance.

Description

Unsupervised monocular depth estimation method based on uncertain analysis

Technical Field

The invention relates to the technical field of computer vision, in particular to an unsupervised monocular depth estimation method based on uncertain analysis.

Background

Depth estimation is often critical to various advanced tasks in computer vision, such as autopilot, augmented reality, and the like. The method is also a core technology in an intelligent auxiliary driving system and an intelligent vehicle-mounted vision system of an automobile. By combining the depth information of depth estimation, the running state of the vehicle can be monitored, and a forward collision early warning system is perfected. Generally, the vehicle runs faster, so the requirement for vehicle distance measurement is higher, and the cooperativity between cameras in a binocular system and a multi-view system is not well adjusted, so the invention is developed based on a monocular camera. Monocular vision is more cost effective than binocular and multi-ocular systems from a cost perspective. In addition, the calculated amount of monocular vision in the aspect of data processing is much smaller than that of binocular vision and multi-ocular vision, real-time performance can be better considered, and the method is a research hotspot and frontier field in modern intelligent auxiliary driving systems and intelligent vehicle-mounted vision systems. At present, monocular depth data sets are deficient, and training can be performed only in scenes under specific data sets. In addition, details and depths around a depth map object are unclear in monocular depth estimation, dynamic objects interfere more, and the reliability of an output result cannot be evaluated by a convolutional neural network, which is also an urgent problem to be solved. Based on the problems, the invention provides an unsupervised monocular depth estimation method based on uncertainty.

Disclosure of Invention

In order to overcome the defects in the prior art, the embodiment of the invention provides an unsupervised monocular depth estimation method based on uncertain analysis, which is used for solving the problems in the background art.

The invention discloses: an unsupervised monocular depth estimation method based on uncertain analysis is characterized by comprising the following steps of:

step 1: an unsupervised depth estimation network based on uncertainty is provided, and the depth prediction precision in monocular depth estimation is improved;

step 2: based on the step 1, the confidence coefficient of the estimated depth is predicted through modeling uncertainty, meanwhile, the model prediction precision is improved, and the uncertainty of an output result is quantified;

and step 3: based on the step 1-2, a Retinex illumination theory is used for constructing a brightness loss function, and the interference caused by dynamic objects in the scene is solved through operation conversion according to the basic theory of a Retinex algorithm.

Preferably, the step 1 specifically comprises:

first, a likelihood function is defined:

p(y|y^*)＝N(y^*,σ²)；

wherein y denotes the depth of observation, y^*Representing the depth, σ, of the model output²Representing an observed noise variance;

secondly, solving the likelihood function:

again, an objective function is established:

depth estimation of images is a regression task, L₁Is superior to L₂A penalty function;

then, the loss function for the uncertainty analysis is as follows:

finally, the resulting loss function is:

L＝L_R+L_S+L_U。

preferably, two consecutive frames I sampled from a given video that is unmarked_tAnd I_t-1First, estimate its depth map D using a depth network_tAnd D_t-1；

Then using the pose network P_abTraining relative 6D poses between cameras; depth map D with prediction_tAnd relative camera pose P_abBy differentiable bilinear interpolation_t-1Is formed by deformation

Similarly, an image is obtained

Finally, will

Input into a depth net to obtain

By means of uncertainty analysis, in

And

form a loss function L therebetween_U。

Preferably, for the depth estimation network, the invention improves on DispNet, which takes a single RGB image as input and outputs a depth map.

Preferably, for the posture network, the invention uses a network DispNet without mask predicted branches.

Preferably, the step 3 specifically comprises:

firstly, according to the basic theory of Retinex algorithm, the following expression is obtained:

I(x,y)＝R(x,y)×L(x,y)；

secondly, the process of solving the incident component by convolution operation with a low-pass filter can be expressed as:

L(x,y)＝I(x,y)*G(x,y)；

thirdly, the mathematical solving process of the single-scale Retinex algorithm is shown as follows:

wherein i belongs to R, G and B and represents R, G and B color channels, R_i(x, y) represents the pixel value of the reflected image in the ith color channel, I_i(x, y) represents the pixel value of the original image I at the ith color channel (x, y), which represents a gaussian convolution operation, and G (x, y) represents a gaussian surround function;

where σ represents the standard deviation in a gaussian function, referred to herein as the scale function;

then, the depth estimation is a regression task, and the most commonly used loss functions in the optimization of the regression task comprise an L2 loss function and an L1 loss function;

finally, from the above, the conversion formula yields:

where N represents a pixel in the image, r_i(x, y) represents incident light of an image.

Preferably, smoothness is included before regularizing the estimated depth map, and an edge-aware smoothness penalty is used, which is given by the formula:

the invention has the following beneficial effects: the invention provides an uncertainty-based monocular depth estimation method, which solves the problem that the detail depth around a depth map object is unclear, and the most advanced performance can be obtained on a KITTI data set.

In order to make the aforementioned and other objects, features and advantages of the invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a basic diagram of the Retinex algorithm of the present invention;

FIG. 2 is a schematic diagram of an unsupervised monocular depth estimation method of the present invention;

FIG. 3 is a schematic diagram of the DispNet network structure of the present invention;

FIG. 4 is a graph I of the experimental results of the present invention;

FIG. 5 is a second graph showing the results of the experiment according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

An unsupervised monocular depth estimation method based on uncertain analysis comprises the following steps:

and step 3: based on the step 1-2, a Retinex illumination theory is used for constructing a brightness loss function, and the interference of dynamic objects in the scene is solved through operation and conversion according to the basic theory of a Retinex algorithm.

Further, the step 1 specifically comprises: the uncertainty in neural networks is largely divided into two areas: model uncertainty and random uncertainty; the model uncertainty mainly refers to the uncertainty of model parameters, and when a plurality of models have good effects, the final model parameters need to be selected from the models; when the amount of input data is large enough, the model uncertainty becomes small; in general, training data is enough during training, so random uncertainty in uncertainty analysis accounts for a main part;

first, a likelihood function is defined:

p(y|y^*)＝N(y^*,σ²)；

secondly, solving the likelihood function:

then, an objective function is established:

depth estimation of images is a regression task, L₁Is superior to L₂A penalty function; because L is₁The optimization effect on small prediction errors is good, and the optimization just accords with the characteristic of depth estimation; therefore, the loss function for the uncertainty analysis is as follows:

finally, the resulting loss function is:

L＝L_R+L_S+L_U。

further, referring to FIG. 2, two consecutive frames I sampled from a given unmarked video_tAnd I_t-1First, estimate its depth map D using a depth network_tAnd D_t-1Then using the pose network P_abTraining relative 6D poses between cameras; depth map D with prediction_tAnd relative camera pose P_abThe invention uses differentiable bilinear interpolation to convert I_t-1Is transformed to synthesize

Similarly, an image is obtained

Finally, will

Input into a depth net to obtain

By means of uncertainty analysis, in

And

form a loss function L therebetween_U；

Further, for the depth estimation network, the present invention improves on DispNet, which takes a single RGB image as input and outputs a depth map.

Further, referring to fig. 3, for the posture network, the present invention uses a network DispNet without mask predicted branches.

Further, the step 3 specifically includes: the algorithm of Retinex is based on three assumptions:

(1) the real physical world is colorless and the colors seen by humans are the result of the interaction of natural light with matter in the objective world.

(2) Each color range in the object is composed of red, green, and blue of a certain wavelength.

(3) The color of each region is determined by red, green and blue.

Referring to fig. 1, a Retinex algorithm is different from a linear and nonlinear method that only can enhance a certain feature of an image, Retinex can achieve balance in three aspects of compression, edge enhancement and color normalization in a dynamic range, so that different types of images can be adaptively enhanced, and color fidelity, image edge portion enhancement and dynamic range compression can be obviously improved.

Through analyzing the illumination component and the reflection component, people can judge the illumination information and the reflection information in the image, thereby solving the problem of image brightness and color change caused by illumination change, particularly, the illumination information influencing the vision of people is removed through various conversion methods, the reflection information of an object is retained to the maximum extent, and because the attribute information of the object is contained in the reflection component, a human visual system imitating the human visual system develops a Retinex algorithm all the time, the Retinex algorithm is improved from a single-scale Retinex algorithm to a multiscale weighted average Retinex algorithm, and then the Retinex algorithm is developed into a color recovery multiscale Retinex algorithm;

I(x,y)＝R(x,y)×L(x,y)；

secondly, the main principle of the single-scale Retinex algorithm is to perform convolution operation on three channels of the image and a central surrounding function respectively, and the image after the convolution operation is regarded as the estimation of the illumination component of the original image;

the process of solving for the incident component by convolution operation with a low pass filter can be expressed as:

L(x,y)＝I(x,y)*G(x,y)；

where σ represents the standard deviation in a gaussian function, referred to herein as the scale function; the size of the standard deviation has great influence on the Retinex algorithm; the smaller the sigma is, the better detail information of the enhanced image can be obtained, but image distortion and halo phenomena are easy to occur; the larger the sigma is, the better the color of the enhanced image is kept, but the image has a larger sharpening degree and a poorer contrast enhancement effect;

the square operation makes the L2 loss function sensitive to abnormal values, has good optimization effect on large prediction errors, and has poor capability of further optimization on small prediction errors; the L1 loss function has a good optimization effect on small prediction errors, the optimization effect on large prediction errors is general, and the effect of the L1 loss function is slightly excellent in actual training; the uncertainty Loss function provided by the invention combines L1 Loss and heteroscedastic random uncertainty in a neural network, and consists of a residual regression term and an uncertainty regularization term;

finally, from the above, the conversion formula yields:

Further, because luminosity loss does not provide sufficient information in low-texture or uniform regions of a scene, existing work incorporates smoothness prior to regularizing the estimated depth map; the edge-perceived smoothness loss used is given by:

the invention carries out the experiment of monocular depth estimation, which comprises the following steps: the experimental environment provided by the invention comprises a software environment and a hardware environment, wherein the software environment comprises: windows 1064-bit operating system, CUDA 9.1, cuDNN 7.1, Pythrch deep learning framework, Python 3.7.0 and MATLAB R2018 a; the hardware environment is as follows: intel (R) core (TM) i7-7700 CPU @3.60GHz processor, 32GB RAM and NVIDIA GeForce GTX 1080Ti GPU, 11 GB;

according to the invention, a KITTI data set is adopted to carry out monocular depth estimation experiments, about 42000 pictures are included in a training set, and no ground true value exists; in the training process, a random gradient descent method is adopted for optimization solution, the weight parameters of the training network are continuously updated by using a back propagation algorithm, the initial learning rate is set to be 0.001, the momentum factor is set to be 0.9, and the weight attenuation factor is set to be 0.0005; the learning rate is closely related to the convergence speed of the training network, the network model cannot be converged if the learning rate is too large, and the convergence speed of the network model becomes slow if the learning rate is too small; in the invention, the maximum iteration number of the training network is 20000, the learning rate of the previous 12000 times is set to be 0.001, the learning rate of the previous 12000 times to 16000 times is set to be 0.0001, and the learning rate of the previous 16000 times is set to be 0.00001;

in order to objectively evaluate the proposed monocular depth estimation model, the model is quantitatively analyzed using the following four criteria:

mean absolute relative error (Rel):

root mean square error: (RMS):

mean log10 error (RMSlog):

accuracy within the threshold range:

referring to fig. 4, through the unsupervised depth estimation method provided by the invention, experiments are carried out on random pictures in a test set, the topmost picture in fig. 4 is an original picture, the bottommost picture is the algorithm provided by the invention, and at the position of a dotted line frame in fig. 4, the phenomenon that the depth of the algorithm provided by the invention does not excessively drift around traffic lights is obviously seen, the precision is improved, and the reason of dynamic object interference is also solved;

referring to fig. 5, four pictures of adjacent frames are also realized by the algorithm provided by the present invention, and it can be easily seen that the depth around the traffic light is well improved.

Referring to tables 1 and 2, to further verify the utility of the present invention, the following are respectively: three methods of basic network, basic network and brightness loss, basic network and uncertainty, basic network and brightness loss and uncertainty are used for carrying out experiments, and the three experimental methods have the same smoothness loss; the table data shows that the problem of low precision is solved by the brightness loss and the uncertainty loss in the invention;

TABLE 1 ablation experiment 1 (resolution of input image: 416X128)

TABLE 2 ablation experiment 2 (resolution of input image: 832x256)

The principle and the implementation mode of the invention are explained by applying specific embodiments in the invention, and the description of the embodiments is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. An unsupervised monocular depth estimation method based on uncertain analysis is characterized by comprising the following steps of:

step 2: based on the step 1, the confidence coefficient of the estimated depth is predicted through modeling uncertainty, meanwhile, the model prediction precision is improved, and the uncertainty of an output result is quantized;

2. The unsupervised monocular depth estimation method based on uncertainty analysis of claim 1, further comprising:

the step 1 specifically comprises the following steps:

first, a likelihood function is defined:

；

wherein

Which represents the depth of the observation(s),

the depth of the output of the model is represented,

representing an observed noise variance;

secondly, solving the likelihood function:

；

again, an objective function is established:

；

the depth estimation of the image is a regression task,

is superior to

A penalty function;

then, the loss function for the uncertainty analysis is as follows:

；

finally, the resulting loss function is:

。

3. the unsupervised monocular depth estimation method based on uncertainty analysis of claim 2, further comprising: two consecutive frames sampled from a given unmarked video

And

first, estimate its depth map using a depth network

And

；

then using the attitude network

Training relative 6D poses between the cameras; depth map with prediction

And relative camera pose

By differentiable bilinear interpolation

Is formed by deformation

Similarly, an image is obtained

；

Finally, will (

) Input into a depth net to obtain

By means of uncertainty analysis, in

And

form a loss function between

。

4. The unsupervised monocular depth estimation method based on uncertainty analysis of claim 1, further comprising: for the depth estimation network, the present invention improves on DispNet, which takes a single RGB image as input and outputs a depth map.

5. The unsupervised monocular depth estimation method based on uncertainty analysis of claim 3, further comprising: for the posture network, the invention uses the network DispNet without mask predicted branches.

6. The unsupervised monocular depth estimation method based on uncertainty analysis of claim 1, further comprising: the step 3 specifically comprises the following steps:

；

；

wherein the content of the first and second substances,

indicating the color channels of R, G, B,

indicating a reflected image is on

The pixel values of the individual color channels,

representing an original image

In the first place

Color channel

The pixel values at, represent a gaussian convolution operation,

representing a gaussian surround function;

wherein the content of the first and second substances,

represents the standard deviation in a gaussian function, referred to herein as a scale function;

；

；

finally, from the above, the conversion formula yields:

；

；

wherein N represents a pixel in the image,

the incident light representing an image.

7. The unsupervised monocular depth estimation method based on uncertainty analysis of claim 1, further comprising:

smoothness is included before regularizing the estimated depth map, and the used edge-perceived smoothness loss is expressed by the following formula:

。