CN111325782A

CN111325782A - Unsupervised monocular view depth estimation method based on multi-scale unification

Info

Publication number: CN111325782A
Application number: CN202010099283.5A
Authority: CN
Inventors: 丁萌; 姜欣言; 曹云峰; 李旭; 张振振
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2020-02-18
Filing date: 2020-02-18
Publication date: 2020-06-23

Abstract

The invention belongs to the technical field of image processing, and discloses an unsupervised monocular view depth estimation method based on multi-scale unification, which comprises the following steps: s1: carrying out pyramid multi-scale processing on the input stereo image pair; s2: constructing a network framework of coding and decoding; s3: the features extracted at the coding stage are transmitted to a reverse convolution neural network to realize the feature extraction of input images with different scales; s4: uniformly up-sampling disparity maps of different scales to an original input size; s5: reconstructing an image by using the input original image and a corresponding disparity map; s6: the accuracy of image reconstruction is restrained; s7: training a network model by adopting a gradient descent method; s8: and fitting a corresponding disparity map according to the input image and the pre-training model. The design of the invention does not need to monitor network training by using real depth data, and easily-obtained binocular images are used as training samples, thereby greatly reducing the acquisition difficulty of network training and solving the problem of depth image holes caused by low-scale parallax image blurring.

Description

Unsupervised monocular view depth estimation method based on multi-scale unification

Technical Field

The invention relates to the technical field of image processing, in particular to an unsupervised monocular view depth estimation method based on multi-scale unification.

Background

With the development of science and technology and the explosive growth of information, people's attention to image scenes is slowly converted from two dimensions to three dimensions, and the three-dimensional information of objects is greatly convenient in daily life, wherein the three-dimensional information is most widely applied to an assistant driving system of driving scenes. Due to the abundance of information contained in the images, the visual sensors cover almost all relevant information required for driving, including but not limited to lane geometry, traffic signs, lights, object position and speed, etc. Among all forms of visual information, depth information plays a very important role in a driving assistance system. For example, collision avoidance systems issue collision warnings by calculating depth information between an obstacle and the vehicle. When the distance between the pedestrian and the vehicle is too small, the pedestrian protection system will automatically take measures to decelerate the vehicle. Therefore, the driving assistance system can accurately acquire the connection with the external environment only by acquiring the depth information between the current vehicle and other traffic participants in the driving scene, so that the early warning subsystem can work normally.

Many sensors are currently on the market that can obtain depth information, such as the laser radar of the jack company. Lidar can generate sparse three-dimensional point cloud data, but has the disadvantages of high cost and limited use scenes, so people look to recover three-dimensional structural information of scenes from images.

The traditional method for estimating the depth based on the image mostly is based on the geometric constraint and manual characteristics assumed by the shooting environment, and a wider method such as recovering the structure from motion is applied.

As convolutional neural networks have grown out of color on other visual tasks, many researchers have begun exploring the use of deep learning methods for monocular image depth estimation. People design various models to fully mine the connection between an original image and a depth map by utilizing the strong learning capacity of a neural network, so that the depth of a scene can be predicted according to an input image is trained, but as mentioned above, the real depth information of the scene is very unavailable, which means that people need to separate from the real depth label of the scene, and an unsupervised method is adopted to complete a depth estimation task. One of the unsupervised methods is to use the time sequence information of the monocular video as the surveillance signal, but such unsupervised depth estimation methods have motion of the camera itself due to the video information acquired during the motion process, and the relative pose of the camera between image sequences is unknown, which results in that the method needs to train an additional pose estimation network in addition to the depth estimation network, which undoubtedly increases the difficulty of the originally complex depth estimation task. In addition, due to the scale uncertainty of monocular video, the method can only obtain relative depth results, namely, only can obtain relative distance between each pixel in the image, and cannot obtain the distance from an object in the image to the camera. In addition, the unsupervised depth estimation method has the condition that the texture of the depth map is lost or even a hole is caused by the fuzzy details of the low-scale feature map, and the accuracy of the depth estimation is directly influenced.

Disclosure of Invention

The invention aims to solve the defects of the prior art, and provides an unsupervised monocular view depth estimation method based on multi-scale unification.

In order to achieve the purpose, the invention adopts the following technical scheme:

an unsupervised monocular view depth estimation method based on multi-scale unification comprises the following steps:

step S1: carrying out pyramid multi-scale processing on the input stereo image pair so as to extract features of multiple scales;

step S2: constructing a network framework of coding and decoding to obtain a disparity map which can be used for obtaining a depth map;

step S3: the features extracted in the encoding stage are transmitted to a reverse convolution neural network to realize the feature extraction of the input images with different scales, and the disparity maps of the input images with different scales are fitted in the decoding stage;

step S4: uniformly up-sampling disparity maps of different scales to an original input size;

step S5: reconstructing an image by using an input stereo image original image and a corresponding disparity map;

step S6: the accuracy of image reconstruction is constrained through appearance matching loss, left-right parallax conversion loss and parallax smoothing loss;

step S7: training a network model by using a gradient descent method by using a loss minimization idea;

step S8: in the testing stage, fitting a corresponding disparity map according to an input image and a pre-training model; and calculating a corresponding scene depth map by using a binocular imaging triangulation principle and the disparity map.

Preferably, in step S1, the input image is down-sampled to four sizes 1, 1/2, 1/4, 1/8 of the original image to form a pyramid input structure, and then the pyramid input structure is sent to the coding model for feature extraction.

Preferably, in step S2, a ResNet-101 network structure is used as a network model in the encoding stage, and the ResNet network structure adopts residual design, so that information loss is reduced while the network deepens.

Preferably, in step S3, in the encoding stage, feature extraction is performed on input images of different scales, and the extracted features are transmitted to the inverse convolutional neural network in the decoding stage to implement disparity map fitting, specifically:

step S41: respectively performing feature extraction on the input image with the pyramid structure through a ResNet-101 network in an encoding stage, and reducing the input image to 1/16 in the extraction process relative to input images with different sizes to obtain features of original input images 1/16, 1/32, 1/64 and 1/128;

step S42: inputting the features of four sizes obtained in the encoding stage into a network in the decoding stage, deconvoluting the input features layer by layer in the process to restore the input features to a pyramid structure of the original input image 1, 1/2, 1/4 and 1/8 sizes, and respectively fitting disparity maps of the images of 4 sizes according to the input features and the deconvolution network;

preferably, in the step S4, the disparity maps with the sizes of 1, 1/2, 1/4 and 1/8 of the original input image are collectively up-sampled to the size of the original input image.

Preferably, in step S5, since the 4-size disparity maps are uniformly upsampled to the original input size, the originally input left image I is used^lAnd the right parallax image d^rReconstruct a right image

Original right picture I^rAnd left parallax map d^lReconstruct the left image

Preferably, in step S6, the accuracy of the loss-constrained image reconstruction is calculated by using the original input left and right views and the reconstructed left and right views;

minimizing a loss function by adopting a gradient descent method, and training an image reconstruction network by adopting the method, specifically:

step S71: the loss function is composed of three parts, namely appearance matching loss and appearance loss C_aSmoothing loss C_sAnd parallax conversion loss C_t(ii) a For each term loss, the left and right graphs are computed in the same way, and the final loss function is composed of three terms:

step S72: respectively calculating losses on different parallax maps and the input original image on the original input size to obtain 4 losses C_iI is 1,2,3,4, the total loss function is

Preferably, in step S7, the network model is trained by using a gradient descent method using the concept of minimizing loss.

Preferably, in the step S8, in the test stage, the input single image and the pre-training model are used to fit the disparity map corresponding to the input image, and according to the principle of triangulation of binocular imaging, the disparity map is used to generate a corresponding depth image, specifically:

where (i, j) is the pixel-level coordinate of any point in the image, D (i, j) is the depth value of the point, D (i, j) is the parallax value of the point, b is the known distance between two cameras, and f is the known focal length of the camera.

According to the unsupervised monocular view depth estimation method based on multi-scale unification, when a depth estimation problem is solved by a common depth learning method, a real depth image corresponding to an image needs to be input, but the real depth data is expensive to obtain, only sparse point cloud depth can be obtained, and the application requirements cannot be completely met; under the condition, the training process of the model is supervised by adopting image reconstruction loss, and binocular images which are relatively easy to acquire are used for training instead of real depth, so that unsupervised depth estimation is realized;

the unsupervised monocular view depth estimation method based on multi-scale unification provided by the invention has the advantages that the pyramid multi-scale processing is carried out on the input stereo image pair in the encoding stage, so that the influence of different size targets on the depth estimation is reduced;

according to the unsupervised monocular view depth estimation method based on multi-scale unification, all disparity maps are uniformly sampled to the original input size under the condition that a low-scale depth map is fuzzy, image reconstruction and loss calculation are carried out on the size, and the problem of depth map holes is solved;

the method is reasonable in design, real depth data are not needed to be used for monitoring network training, easily-obtained binocular images are used as training samples, the obtaining difficulty of network training is greatly reduced, and meanwhile the problem of depth image holes caused by low-scale parallax image blurring is solved.

Drawings

FIG. 1 is a flowchart of an unsupervised monocular view depth estimation method based on multi-scale unification according to the present invention;

FIG. 2 is a network model structure diagram of an unsupervised monocular view depth estimation method based on multi-scale unification according to the present invention;

FIG. 3 is a schematic diagram of a bottleneck module of a network structure of an unsupervised monocular view depth estimation method based on multi-scale unification according to the present invention;

FIG. 4 is a unified scale diagram of an unsupervised monocular view depth estimation method based on multi-scale unification according to the present invention;

fig. 5 is an estimation result graph of the unsupervised monocular view depth estimation method based on multi-scale unification on the classic driving data set KITTI, (a) is an input image, and (b) is a depth estimation result graph;

fig. 6 is a generalized effect diagram of a road scene real-time picture taken by an unsupervised monocular view depth estimation method based on multi-scale unification, where (a) is an input image and (b) is a depth estimation result diagram.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.

Referring to fig. 1-6, an unsupervised monocular depth estimation method based on multi-scale unification, wherein an unsupervised depth monocular depth estimation network model is performed on a desktop workstation of a laboratory, a video card adopts NVIDIA GeForceGTX 1080Ti, a training system is ubuntu14.04, and a TensorFlow 1.4.0 is adopted as a framework building platform; training is performed on a classical driving data set, KITTI 2015 stereo data set.

As shown in fig. 1, the unsupervised monocular view depth estimation method based on multi-scale unification of the present invention specifically includes the following steps:

step S1: adopting a binocular data set in classic driving KITTI as a training set, setting a scale parameter to be 4, down-sampling images to 1/2, 1/4 and 1/8 of input images, adding input images with 4 sizes of original images to form a pyramid structure, and then sending the pyramid structure to a ResNet-101 neural network model for feature extraction;

step S2: constructing a network framework of coding and decoding to obtain a disparity map which can be used for obtaining a depth map; the specific process is as follows:

the residual structure in the ResNet network is shown in figure 3(a), firstly convolution of 1 × 1 is used for reducing characteristic dimensionality, and then convolution recovery of 1 × 1 is carried out, so that the parameter quantity is as follows:

1×1×256×64+3×3×64×64+1×1×64×256＝69632

while a normal ResNet module is shown in fig. 3(b), the parameters are:

3×3×256×256×2＝1179648

therefore, the parameter quantity can be greatly reduced by using the residual error module with the bottleneck structure;

step S3: the features extracted in the encoding stage are transmitted to a reverse convolution neural network to realize the feature extraction of the input images with different scales, and the disparity maps of the input images with different scales are fitted in the decoding stage, which specifically comprises the following steps:

step S31: in the network decoding process, in order to ensure that the size of a characteristic graph in the deconvolution neural network corresponds to the size of a ResNet-101 residual error network characteristic graph, the network directly connects part of the characteristic graph in the ResNet-101 coding process to the deconvolution neural network by using jump connection;

step S32: respectively performing feature extraction on the input image with the pyramid structure through a ResNet-101 network in an encoding stage, and reducing the input image to 1/16 in the extraction process relative to input images with different sizes to obtain features of original input images 1/16, 1/32, 1/64 and 1/128;

step S33: inputting the features of four sizes obtained in the encoding stage into a network in the decoding stage, deconvoluting the input features layer by layer in the process to restore the input features to a pyramid structure of the original input image 1, 1/2, 1/4 and 1/8 sizes, and respectively fitting approximate disparity maps of the images of 4 sizes according to the input features and the deconvolution network;

step S4: uniformly up-sampling disparity maps with the sizes of 1, 1/2, 1/4 and 1/8 of the original input image to the size of the original input image;

step S5: reconstructing an image by using the input original image and a corresponding disparity map, reconstructing a right view by using the disparity map and a left view corresponding to the disparity map, reconstructing a left image by using the original right image and the left disparity map, and finally comparing the reconstructed right image with the input left and right original images respectively;

step S6: then, the accuracy of image synthesis is constrained by using appearance matching loss, left-right parallax conversion loss and parallax smoothing loss; the method specifically comprises the following steps:

step S61: the loss function is composed of three parts, namely an appearance matching loss C_aSmoothing loss C_sAnd parallax conversion loss C_t；

In the image reconstruction process, the appearance matching loss C is firstly used_aDetermining pixel by pixel accuracy between the reconstructed image and the corresponding input image, the loss being determined by the structural similarity measure and L₁Loss common component, taking the input left graph as an example:

the S is a structural similarity index which consists of three parts of brightness measurement, contrast measurement and structural contrast and is used for measuring the similarity between two images, and the more similar the two images are, the higher the similarity index value is; l is₁Loss is the minimum absolute error loss for comparing the difference between two images on a pixel-by-pixel basisDistance, relative and L₂α is a weight coefficient of the structural similarity in the appearance matching loss, and N is the total number of pixels in the image;

second, the smoothing loss α_sThe method can relieve the situation that the disparity map is discontinuous due to overlarge local gradient and ensure the smoothness of the formed disparity map, and takes the left map as an example, the specific formula is as follows:

parallax conversion loss C_tThe purpose of the method is to reduce the conversion error between a right disparity map generated according to a left map and a left disparity map generated according to a right map, and ensure the consistency between the two disparity maps, and the specific formula is as follows:

for each term loss, the left and right graphs are computed in the same way, and the final loss function is composed of three terms:

α therein_aTo weight the apparent match loss in the overall loss, α_sTo weight the overall loss for the smoothing loss, α_tThe conversion loss is weighted in the total loss;

step S62: respectively calculating losses on different parallax maps and the input original image on the original input size to obtain 4 losses C_iI is 1,2,3,4, the total loss function is

Step S7: the method adopts the thought of minimizing loss and adopts a gradient descent method to train a network model, and specifically comprises the following steps: during the training of stereo image pairs, we used an open-source TensorFlow 1.40, a platform builds a depth estimation model, a KITTI data set with a stereo image pair is used as a training set, 29000 pairs in the data set are used for training the model, during training, an initial learning rate lr is set, after 40 periods, the learning rate is changed into a half of the current learning rate every 10 periods, a total of 70 periods are trained, meanwhile, the batch processing size is set to bs, namely bs pictures are processed once, an Adam optimizer is used for optimizing the model, and β pairs are set₁，β₂Controlling the attenuation rate of the weight coefficient moving average value, and finally completing all training in 34 hours on a GTX 1080Ti experiment platform;

TABLE 1 loss function and training parameters

Step S8: in the testing stage, fitting a corresponding disparity map according to an input image and a pre-training model; calculating a corresponding scene depth map from the disparity map by using a binocular imaging triangulation principle; in the KITTI road driving data set adopted in the experiment, the baseline distance of the camera is fixed to be 0.54m, the focal length of the camera is changed according to different camera models, different camera models are represented as different image sizes in the KITTI data set, and the corresponding relation is as follows:

the conversion formula of depth and parallax is specifically:

wherein, (i, j) is the pixel coordinate of any point in the image, D (i, j) is the depth value of the point, and D (i, j) is the parallax value of the point;

therefore, a disparity map corresponding to the input image is fitted according to the input image and a network model pre-trained by using a binocular image reconstruction principle, and a corresponding scene depth map of the input image shot by the camera can be calculated according to the known camera focal length and the base line distance.

The standard parts used in the invention can be purchased from the market, the special-shaped parts can be customized according to the description of the specification and the accompanying drawings, the specific connection mode of each part adopts conventional means such as bolts, rivets, welding and the like mature in the prior art, the machines, the parts and equipment adopt conventional models in the prior art, and the circuit connection adopts the conventional connection mode in the prior art, so that the detailed description is omitted.

Claims

1. An unsupervised monocular view depth estimation method based on multi-scale unification is characterized by comprising the following steps:

step S2: constructing a network framework of coding and decoding to obtain a disparity map which can be used for calculating a depth map;

step S5: reconstructing an image by using the input original image and a corresponding disparity map;

2. The method as claimed in claim 1, wherein in step S1, the input image is down-sampled to four sizes of 1, 1/2, 1/4, 1/8 of the original image to form a pyramid input structure, and then sent to the coding model for feature extraction.

3. The unsupervised monocular view depth estimation method based on multi-scale unification as claimed in claim 1, wherein in step S2, a ResNet-101 network structure is adopted as a network model in an encoding stage.

4. The unsupervised monocular view depth estimation method based on multi-scale unification as claimed in claim 1, wherein in step S3, feature extraction is performed on input images of different scales in an encoding stage, and the extracted features are transmitted to a deconvolution neural network in a decoding stage to implement disparity map fitting, specifically:

step S42: inputting the features of four sizes obtained in the encoding stage into the network in the decoding stage, deconvoluting the input features layer by layer in the process to restore the input features to the pyramid structures of the original input images 1, 1/2, 1/4 and 1/8 sizes, and respectively fitting the disparity maps of the images of 4 sizes according to the input features and the deconvolution network.

5. The unsupervised monocular view depth estimation method based on multi-scale unification as claimed in claim 1, wherein in the step S4, the disparity maps with the size of 1, 1/2, 1/4, 1/8 of the original input image are unified up-sampled to the size of the original input image.

6. The unsupervised monocular view depth based on multi-scale unification of claim 1The estimation method is characterized in that in the step S5, since the disparity maps of 4 sizes are uniformly up-sampled to the original input size, the originally input left map I is used^lAnd the right parallax image d^rReconstruct a right image

Original right picture I^rAnd left parallax map d^lReconstruct the left image

7. The unsupervised monocular view depth estimation method based on multi-scale unification as claimed in claim 1, wherein in step S6, accuracy of image reconstruction is constrained by calculating loss using the original input left and right views and the reconstructed left and right views;

8. The unsupervised monocular view depth estimation method based on multi-scale unification as claimed in claim 1, wherein in step S7, a network model is trained by using a gradient descent method using an idea of minimizing loss.

9. The unsupervised monocular view depth estimation method based on multi-scale unification as claimed in claim 1, wherein in the step S8, in the testing stage, an input single image and a pre-trained model are used to fit a disparity map corresponding to the input image, and according to a principle of triangulation of binocular imaging, the disparity map is used to generate a corresponding depth image, specifically: