AU2021103300A4

AU2021103300A4 - Unsupervised Monocular Depth Estimation Method Based On Multi- Scale Unification

Info

Publication number: AU2021103300A4
Application number: AU2021103300A
Authority: AU
Inventors: Yunfeng CAO; Meng Ding; Xinyan JIANG; Li Wei; Zhenzhen Zhang
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2021-06-11
Filing date: 2021-06-11
Publication date: 2021-08-05
Anticipated expiration: 2029-06-11

Abstract

OF THE DISCLOSURE The present disclosure belongs to the technical field of image processing, and discloses an unsupervised monocular depth estimation method based on multi-scale unification, including: Sl: performing pyramid multi-scale processing on an input stereo image pair; S2: constructing a network framework for encoding and decoding; S3: transmitting features extracted in an encoding stage to a deconvolutional neural network to extract features of input images of different scales; S4: uniformly upsampling disparity maps of different scales to an original input size; S5: performing image reconstruction by using the input original images and corresponding disparity maps; S6: constraining accuracy of the image reconstruction; S7: training a network model by using a gradient descent method; and S8: fitting corresponding disparity maps based on the input images and the pre-trained model. The present disclosure does not need to use true depth data to supervise network training, but uses easily accessible binocular images as training samples. This greatly reduces the difficulty of network training, and solves a problem of depth map holes caused by blurred low-scale disparity maps. - 1/5 DRAWINGS Back . ...... ............... ......... ............ propagation . nnag .. ..e . E : .* .) . .. . . . . . . . . Form a ..-.-..-.--.-.----.-- utPre-traina Obtainanaccurate Calculateadepthmap gdisparitymap throughaformula Testingstage FIG. 1

Description

- 1/5

DRAWINGS

Back . ...... ............... ......... ............ propagation

. nnag

.. e .. . : .* .). .. E . . . . . . .

. Form a ..-.-..-.--.-.----.--

utPre-traina Obtainanaccurate Calculateadepthmap gdisparitymap throughaformula

Testingstage

FIG. 1

UNSUPERVISED MONOCULAR DEPTH ESTIMATION METHOD BASED ON MULTI SCALE UNIFICATION TECHNICAL FIELD

[01] The present disclosure relates to the technical field of image processing, and in particular, to an unsupervised monocular depth estimation method based on multi-scale unification.

BACKGROUNDART

[02] With the development of science and technology and the explosive growth of information, people are gradually shifting their focus from two-dimensional (2D) image scenes to three dimensional (3D) image scenes. The 3D information of objects brings great convenience to daily life. The 3D information is most widely used in assisted driving systems in driving scenes. Since images contain a wealth of information, a visual sensor covers almost all relevant information needed for driving, including but not limited to geometric shapes of lanes, traffic signs, lights, and object positions and speeds. Among all forms of visual information, depth information plays an important role in a driver assistance system. For example, a collision avoidance system sends a collision warning by calculating depth information between an obstacle and the vehicle. When a distance between a pedestrian and the vehicle is excessively small, a pedestrian protection system automatically takes measures to slow down the vehicle. Therefore, the driving assistance system can accurately obtain a connection with the external environment only after acquiring depth information between a current vehicle and other traffic participants in a driving scene, so that a warning subsystem can operate properly.

[03] At present, there are many sensors capable of obtaining depth information on the market, such as SICK's LIDAR. A LIDAR can generate sparse 3D point cloud data, but it is costly and can be used only in limited scenes. Therefore, people are turning to 3D structure information that restores scenes from images.

[04] The traditional image-based depth estimation methods are mostly based on geometric constraints and manual features assumed based on a photographing environment. A more widely used method is to restore a structure from motions. This method has the advantages of low implementation costs, low requirements for the photographing environment, and simple operations. However, this method is extremely susceptible to feature extraction and matching errors, and can obtain only relatively sparse depth data.

[05] As convolutional neural networks perform prominently in other vision tasks, many researchers have begun to explore the application of deep learning in monocular image depth estimation. Based on the powerful learning ability of neural networks, people design various models to fully explore a connection between original maps and depth maps, so as to train to predict a scene depth based on an input image. However, as mentioned above, it is difficult to obtain true depth information of a scene, which means that a depth estimation task should be completed in an unsupervised way without using true depth labels of the scene. One of the unsupervised depth estimation methods is to use temporal information of a monocular video as a supervisory signal. However, this method uses video information collected during motion, so a camera itself has motions, and relative poses of the camera between image sequences are unknown, which leads to a need of training a pose estimation network in addition to a depth estimation network, making the complicated depth estimation task even more difficult. In addition, due to the scale uncertainty of monocular videos, this method can obtain only relative depth information, or a relative distance between pixels in an image, but cannot obtain a distance between an object in the image and the camera. Moreover, in the unsupervised depth estimation method, blurred details of low-scale feature maps may cause missing textures or even holes in depth maps, which directly affects accuracy of the depth estimation.

SUMMARY

[06] In order to overcome the prior-art shortcomings, the present disclosure proposes an unsupervised monocular depth estimation method based on multi-scale unification.

[07] To achieve the above objective, the present disclosure adopts the following technical solutions:

[08] An unsupervised monocular depth estimation method based on multi-scale unification includes the following steps:

[09] step Si: performing pyramid multi-scale processing on an input stereo image pair, to extract features of multiple scales;

[10] step S2: constructing a network framework for encoding and decoding, and obtaining disparity maps that can be used to obtain a depth map;

[11] step S3: transmitting features extracted in an encoding stage to a deconvolutional neural network to extract features of the input images of different scales, and fitting disparity maps of the input images of different scales in a decoding stage;

[12] step S4: uniformly upsampling the disparity maps of different scales to an original input size;

[13] step S5: performing image reconstruction by using the input original stereo images and corresponding disparity maps;

[14] step S6: constraining accuracy of the image reconstruction through an appearance matching loss, a left-right disparity transform loss, and a disparity smoothness loss;

[15] step S7: based on a loss minimization principle, training a network model by using a gradient descent method; and

[161 step S8: in a testing stage, fitting corresponding disparity maps based on the input images and the pre-trained model, and calculating a corresponding scene depth map from the disparity maps based on a triangulation principle of binocular imaging.

[171 Preferably, in step S1, the input image is downsampled to 1, 1/2, 1/4, and 1/8 of an original image size to form a pyramid input structure, and then sent to an encoding model for feature extraction.

[18] Preferably, in step S2, a ResNet-101 network structure is used as a network model for the encoding stage, where the ResNet network structure adopts a residual design to deepen the network and reduce information loss.

[19] Preferably, in step S3, feature extraction is performed on input images of different scales in the encoding stage, and the extracted features are sent to the deconvolutional neural network in the decoding stage to implement disparity map fitting, which specifically includes:

[20] step S41: in the encoding stage, performing feature extraction on the input images of the pyramid structure through a ResNet-101 network, and reducing the input images of different sizes to 1/16 during the extraction process, to obtain features whose sizes are 1/16, 1/32, 1/64, and 1/128 of the original input images; and

[211 step S42: inputting the features of the four sizes obtained in the encoding stage into the network in the decoding stage, deconvolving the input features layer by layer to restore the pyramid structure of the images whose sizes are 1, 1/2, 1/4, and 1/8 of the original input images, and separately fitting disparity maps for the images of the four sizes based on the input features and the deconvolutionalnetwork.

[22] Preferably, in step S4, the disparity maps whose sizes are 1, 1/2, 1/4, and 1/8 of the original input images are uniformly upsampled to the original input image size.

[231 Preferably, in step S5, because the disparity maps of the four sizes are uniformly upsampled to the original input size, an original input left image I' and a right disparity map d' are used to reconstruct a right image I', and an original right image 1' and a left disparity map d' are used to reconstruct a left image '.

[24] Preferably, in step S6, the original input left and right images and the reconstructed left and right images are used to constrain the accuracy of the image reconstruction; and

[251 a loss function is minimized by using the gradient descent method, to train an image reconstruction network, which specifically includes:

[26] step S71: calculating three parts of the loss function: an appearance matching loss appearance loss C. , a smoothness loss C,, and a disparity transform loss C,, where calculation methods of each

loss for the left and right images are the same; and the final loss function is obtained by combining the three losses:

[27] C=a.(C'+C.)+a,(C'+C')+a,(C,'+C;)

[28] step S72: calculating the loss separately for the different disparity maps with the original input size and the input original image to obtain four losses C,, where i=1,2,3,4, and the total loss 4 function is C, 1= C,.

[29] Preferably, in step S7, based on the loss minimization principle, the network model is trained by using the gradient descent method.

[30] Preferably, in step S8, in the testing stage, the disparity map corresponding to the input image is fitted by using the input single image and the pre-trained model, and the corresponding depth image is generated by using the disparity map based on the triangulation principle of binocular imaging, which is specifically: D(i, j) bx

[31] d(i, j)

[32] where (ij) is pixel-level coordinates of any point in an image, D(ij) is a depth value of the point, d(ij) is a disparity value of the point, b is a known distance between two cameras, and f is

a known focal length of the camera.

[33] The common deep learning method requires input of a true depth image corresponding to an image for depth estimation, but it is costly to acquire the true depth data, and only sparse point cloud depths can be obtained, which cannot fully meet application requirements. By contrast, the unsupervised monocular depth estimation method based on multi-scale unification according to the present disclosure supervises a model training process through image reconstruction and loss, and uses easily accessible binocular images in place of the true depth for training, thereby implementing unsupervised depth estimation.

[34] The unsupervised monocular depth estimation method based on multi-scale unification according to the present disclosure reduces impact of targets of different sizes on depth estimation by performing pyramid multi-scale processing on the input stereo images in the encoding stage.

[35] The unsupervised monocular depth estimation method based on multi-scale unification according to the present disclosure uniformly samples all disparity maps to the original input size in the case of blurred low-scale depth maps, and performs image reconstruction and loss calculation under this size, alleviating the problem of holes in the depth maps.

[36] The present disclosure is properly designed, and does not need to use true depth data to supervise network training, but uses easily accessible binocular images as training samples. This greatly reduces the difficulty of network training, and also solves the problem of depth map holes caused by blurred low-scale disparity maps.

BRIEF DESCRIPTION OF DRAWINGS

[37] FIG. 1 is a flowchart of an unsupervised monocular depth estimation method based on multi- scale unification according to the present disclosure.

[38] FIG. 2 is a structural diagram of a network model for an unsupervised monocular depth estimation method based on multi-scale unification according to the present disclosure.

[39] FIG. 3 is a schematic diagram of a bottleneck module of a network structure for an unsupervised monocular depth estimation method based on multi-scale unification according to the present disclosure.

[40] FIG. 4 is a schematic diagram of scale unification for an unsupervised monocular depth estimation method based on multi-scale unification according to the present disclosure.

[41] FIG. 5 is an estimation result diagram of an unsupervised monocular depth estimation method based on multi-scale unification according to the present disclosure on a classic driving data set KITTI, where (a) are input images, and (b) are depth estimation results.

[42] FIG. 6 is generalization effect diagram of an unsupervised monocular depth estimation method based on multi-scale unification according to the present disclosure on images captured in real time in a road scene, where (a) are input images, and (b) are depth estimation results.

DETAILED DESCRIPTION OF THE EMBODIMENT

[43] The technical solutions of the embodiments of the present disclosure are clearly and completely described below with reference to the accompanying drawings. Apparently, the described embodiments are merely a part rather than all of the embodiments of the present disclosure.

[44] Referring to FIG. 1 to FIG. 6, in an unsupervised monocular depth estimation method based on multi-scale unification, an unsupervised monocular depth estimation network model runs on a desktop workstation in the laboratory, with a NVIDIA GeForce GTX 1080Ti graphics card, a Ubuntu 14.04 training system, and a TensorFlow 1.4.0 framework building platform; and training is performed on a classic driving data set, KITTI 2015 stereo data set.

[45] As shown in FIG. 1, an unsupervised monocular depth estimation method based on multi scale unification specifically includes the following steps.

[46] Step Sl: Use a binocular data set in a classic driving data set KITTI as a training set, set a scale parameter to 4, and downsample an image to 1/2, 1/4, 1/8 of the input image, form a pyramid structure by using images of the three sizes and the original image, and then send the images to a ResNet-101 neural network model for feature extraction.

[47] Step S2: Construct a network framework for encoding and decoding, and obtain disparity maps that can be used to obtain a depth map. The specific process is as follows:

[48] The ResNet-101 network structure is used as a network model for the encoding stage, where the ResNet network structure adopts a residual design, which deepens the network and reduces information loss. The residual structure in the ResNet network is shown in FIG. 3(a). First, lxI convolution is used to reduce a feature dimension, and then 1x1 convolution is used to restore the feature dimension, and a quantity of parameters is:

[49] lxIx 256x 64+3x 3x 64x 64+1xIx 64x 256=69632

[50] A common ResNet module is shown in FIG. 3(b), and a quantity of parameters is:

[51] 3x 3x 256x 256x 2=1179648

[52] As seen from above, the use of a bottleneck residual module can greatly reduce the quantity of parameters.

[53] Step S3: Transmit features extracted in the encoding stage to a deconvolutional neural network to extract features of the input images of different scales, and fit disparity maps of the input images of different scales in a decoding stage, which specifically includes the following steps:

[54] step S31: in a network decoding process, to ensure that sizes of feature maps in the deconvolutional neural network correspond to sizes of feature maps of the ResNet-101 residual network, directly connecting some of the feature maps in the ResNet-101 encoding process to the deconvolutional neural network through skip connection;

[55] step S32: in the encoding stage, performing feature extraction on the input images of the pyramid structure through the ResNet-101 network, and reducing the input images of different sizes to 1/16 during the extraction process, to obtain features whose sizes are 1/16, 1/32, 1/64, and 1/128 of the original input images; and

[56] step S33: inputting the features of the four sizes obtained in the encoding stage into the network in the decoding stage, deconvolving the input features layer by layer to restore the pyramid structure of the images whose sizes are 1, 1/2, 1/4, and 1/8 of the original input images, and separately fitting approximate disparity maps for the images of the four sizes based on the input features and the deconvolutional network.

[57] Step S4: Uniformly upsample disparity maps whose sizes are 1, 1/2, 1/4, and 1/8 of the original input image to the size of the original input image.

[58] Step S5: Perform image reconstruction by using the input original images and the corresponding disparity maps: use a right disparity map and its corresponding left image to reconstruct a right image, then use an original right image and a left disparity map to reconstruct a left image, and finally compare the reconstructed left and right images with the input original left and right images.

[59] Step S6: Constrain accuracy of the image reconstruction through an appearance matching loss, a left-right disparity transform loss, and a disparity smoothness loss, which specifically includes the following steps:

[60] step S61: calculating three parts of a loss function: an appearance matching loss Ca, a smoothness loss Cs, and a disparity transform loss Ct;

[61] in the image reconstruction process, first determining accuracy of the reconstructed image and the corresponding input image pixel by pixel by using the appearance matching loss Ca, where the loss includes a structural similarity index and an L i loss; taking the input left image as an example:

[62] c = K1 a - S(I - Z')( 4+(1-a) II'-k -Z

[63] where S is the structural similarity index, which includes three parts: luminance measurement, contrast measurement, and structure comparison, and is used to measure similarity of two images: the more similar the two images, the higher the similarity index value; the Li loss is a minimum absolute error loss, used to compare difference between two images pixel by pixel, and is more insensitive to abnormal points than an L2 loss; a is a weight coefficient of structural similarity in the appearance matching loss, and N is a total quantity of pixels in an image;

[64] the smoothness loss Cs can reduce discontinuity of the disparity maps caused by an excessively large local gradient, and ensure smoothness of the disparity maps; taking the left image as an example, a specific formula is as follows:

[65] C ='- d,-d

[66] the disparity transform loss Ct is used to reduce a transform error between the right disparity map generated based on the left image and the left disparity map generated based on the right image, and ensure consistency of the two disparity images; a specific formula is as follows:

[671 Cd=- d,|e- +| e

[68] calculation methods of each loss for the left and right images are the same; and the final loss function is obtained by combining the three losses:

[69] C=a,(C,+C,)+a,(C,+C7)+a,(C|+C)

[70] where aa is a weight of the appearance matching loss in the total loss, as is a weight of the smoothness loss in the total loss, and at is a weight of the transform loss in the total loss; and

[71] step S62: calculating the loss separately for the different disparity maps with the original input size and the input original image to obtain four losses Ci, where i= 1, 2, 3, and 4, and the total 4

loss function is C,", = ZC.

[72] Step S7: Based on a loss minimization principle, train a network model by using a gradient descent method. Specifically, in a training process of stereo image pairs, use the open-source TensorFlow 1.4.0 platform to build a depth estimation model, use a KITTI data set of stereo image pairs as a training set, and use 29000 pairs in the data set for model training; during training, set an initial learning rate lr, and after 40 cycles, change the learning rate to half of a current learning rate every 10 cycles, where a total of 70 cycles of training is performed; set a batch size to bs, which means bs images are processed at a time; optimize the model by using an Adam optimizer, set Pi and P2 to control a decay of moving averages for weight coefficients, and finally complete all training on a GTX 1080Ti experimental platform in 34 hours.

[73] Table 1 Loss function and training parameters

[74] Algorithm parameter name and symbol Value Weight coefficients a,a,a, 1 Weight coefficient a 0.8 Initial learning rate lr 0.0001 Batch size bs 8 Hyperparameter 4 0.9 Hyperparameter p2 0.999

[75] Step S8: In a testing stage, fit corresponding disparity maps based on the input images and the pre-trained model, and calculate a corresponding scene depth map from the disparity maps based on a triangulation principle of binocular imaging. In the KITTI road driving data set used in this experiment, a baseline distance of a camera is fixed at 0.54 m, and a focal length of the camera varies with camera models. Different camera models are reflected as different image sizes in the KITTI data set. A corresponding relationship is shown in the following table.

[76] Image size/pixel f/pixel

1242 721.5377 1241 718.856 1224 707.0493 1238 718.3351

[77] A formula for depth-disparity transform is specifically:

[78] D(i, j)= bxf 0.54x f d(ij) d(ij)

[79] where (ij) is pixel coordinates of any point in an image, D(ij) is a depth value of the point, and d(ij) is a disparity value of the point.

[80] In this way, based on the input images and the network model pre-trained by using the binocular image reconstruction principle, the disparity maps corresponding to the input images are fitted, and a scene depth map corresponding to an input image taken by the camera can be calculated based on the known focal length and baseline distance of the camera.

[81] The standard parts used in the present disclosure can be purchased from the market, and the special-shaped parts can be customized according to this specification and the accompanying drawings. The parts are connected by using the conventional methods in the prior art, such as bolts, rivets, and welding, machinery, parts, and devices are the conventional models in the prior art, and the circuit connection adopts the conventional connection methods in the prior art, which are not described in detail herein.

[82] In the claims which follow, and in the preceding description, except where the context requires otherwise due to express language or necessary implication, the word "comprise" and variations such as "comprises" or "comprising" are used in an inclusive sense, i.e. to specify the presence of the stated features but not to preclude the presence or addition of further features in various embodiments of the apparatus and method as disclosed herein.

Claims

WHAT IS CLAIMED IS:

1. An unsupervised monocular depth estimation method based on multi-scale unification, comprising the following steps: step S1: performing pyramid multi-scale processing on an input stereo image pair, to extract features of multiple scales; step S2: constructing a network framework for encoding and decoding, and obtaining disparity maps that can be used to calculate a depth map; step S3: transmitting features extracted in an encoding stage to a deconvolutional neural network to extract features of the input images of different scales, and fitting disparity maps of the input images of different scales in a decoding stage; step S4: uniformly upsampling the disparity maps of different scales to an original input size; step S5: performing image reconstruction by using the input original images and corresponding disparity maps; step S6: constraining accuracy of the image reconstruction through an appearance matching loss, a left-right disparity transform loss, and a disparity smoothness loss; step S7: based on a loss minimization principle, training a network model by using a gradient descent method; and step S8: in a testing stage, fitting corresponding disparity maps based on the input images and the pre-trained model, and calculating a corresponding scene depth map from the disparity maps based on a triangulation principle of binocular imaging.

2. The unsupervised monocular depth estimation method based on multi-scale unification according to claim 1, wherein in step Sl, the input image is downsampled to 1, 1/2, 1/4, and 1/8 of the original image to form a pyramid input structure, and then sent to an encoding model for feature extraction.

3. The unsupervised monocular depth estimation method based on multi-scale unification according to claim 1, wherein in step S2, a ResNet-101 network structure is used as a network model for the encoding stage.

4. The unsupervised monocular depth estimation method based on multi-scale unification according to claim 1, wherein in step S3, feature extraction is performed on the input images of different scales in the encoding stage, and the extracted features are sent to the deconvolutional neural network in the decoding stage to implement disparity map fitting, which specifically comprises: step S41: in the encoding stage, performing feature extraction on the input images of the pyramid structure through a ResNet-101 network, and reducing the input images of different sizes to 1/16 during the extraction process, to obtain features whose sizes are 1/16, 1/32, 1/64, and 1/128 of the original input images; and step S42: inputting the features of the four sizes obtained in the encoding stage into the network in the decoding stage, deconvolving the input features layer by layer to restore the pyramid structure of the images whose sizes are 1, 1/2, 1/4, and 1/8 of the original input images, and separately fitting disparity maps for the images of the four sizes based on the input features and the deconvolutional network.

5. The unsupervised monocular depth estimation method based on multi-scale unification according to claim 1, wherein in step S4, the disparity maps whose sizes are 1, 1/2, 1/4, and 1/8 of the original input image are uniformly upsampled to a size of the original input image; wherein in step S5, because the disparity maps of the four sizes are uniformly upsampled to the original input size, an original input left image P and a right disparity map d' are used to reconstruct a right image i, and an original right image 1' and a left disparity map d' are used to reconstruct a left image I'; wherein in step S6, the original input left and right images and the reconstructed left and right images are used to calculate the losses to constrain the accuracy of the image reconstruction; and a loss function is minimized by using the gradient descent method, to train an image reconstruction network, which specifically comprises: step S71: calculating three parts of the loss function: an appearance matching loss C. , a smoothness loss c,, and a disparity transform loss c,, wherein calculation methods of each loss for

the left and right images are the same; and the final loss function is obtained by combining the three losses: C=a.(C' +C;)+a,(C+C,)+a,(C,'+C)

step S72: calculating the loss separately for the different disparity maps with the original input size and the input original image to obtain four losses C,, wherein i=1,2,3,4, and the total loss 4 function is Ca = YC,; i=1

wherein in step S7, based on the loss minimization principle, the network model is trained by using the gradient descent method; wherein in step S8, in the testing stage, the disparity maps corresponding to the input image are fitted by using the input single image and the pre-trained model, and the corresponding depth map is generated by using the disparity maps based on the triangulation principle of binocular imaging, which is specifically: D(i, j) bxf d(i, j) wherein (i, j) is pixel-level coordinates of any point in an image, D(i,j) is a depth value of the point, d(i,j) is a disparity value of the point, b is a known distance between two cameras, and f is a known focal length of the camera.