CN109472830A

CN109472830A - A kind of monocular visual positioning method based on unsupervised learning

Info

Publication number: CN109472830A
Application number: CN201811141754.3A
Authority: CN
Inventors: 黄镇业; 吴贺俊
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2018-09-28
Filing date: 2018-09-28
Publication date: 2019-03-15

Abstract

The present invention proposes a kind of monocular visual positioning method based on unsupervised learning, and steps are as follows: obtaining Video stream information and equal part is cut into picture frame；The first and second adjacent picture frames are separately input to pose estimation network, in the first and second estimation of Depth networks, obtain pose transformation between the corresponding pose of the first and second picture frames, corresponding first and second depth map of the first and second picture frames；It is rebuild the first depth map and pose transformation to obtain the second reconstruction image frame, is rebuild the second depth map and pose transformation to obtain the first reconstruction image frame；It calculates reconstruction error and desired para appearance is minimised as with reconstruction error and estimate that network and depth map estimation network are fitted training；The deep neural network for being fitted trained pose estimation network and depth map estimation combination of network is applied in monocular vision positioning.The depth map neural network that adjacent two field pictures are fitted by the present invention as input, can effectively improve locating effect, scalability is strong.

Description

A kind of monocular visual positioning method based on unsupervised learning

Technical field

The present invention relates to vision positioning fields, position more particularly, to a kind of monocular vision based on unsupervised learning Method.

Background technique

Monocular vision location technology solves the problems, such as, single in recovery process using the video flowing of monocular cam acquisition The motion profile of mesh camera.

The existing vision positioning method based on deep learning is generally divided into supervised learning and two kinds of unsupervised learning, base It is to need a large amount of artificial mark sample in the shortcomings that vision positioning method of supervised learning, expends a large amount of manpower and needs Expensive high precision apparatus, higher cost.

And the existing vision positioning method based on unsupervised learning and immature.Bibliography one proposes one kind and is based on The binocular visual positioning frame of unsupervised learning, using the CNN structure of an Auto Encoder by the left figure of binocular camera It is mapped as corresponding depth map, then left figure is rebuild by the right figure of depth map and binocular camera, reconstruction error is obtained, carries out nothing Supervised learning.The method that bibliography two proposes a kind of binocular camera or so consistency can improve to rebuild in the first paper and miss The problem that the constraint of difference is insufficient.However, both methods all has the support for needing binocular camera shooting head apparatus, can not be taken the photograph in monocular The problem of as working on head video flowing.Bibliography three is that can be individually used for monocular vision positioning, is proposed based on unsupervised The monocular vision positioning framework of habit adds the CNN of an Auto Encoder structure on the basis of binocular visual positioning frame, uses Pose transformation between two picture of front and back of estimation monocular cam acquisition, for replacing the outer ginseng square of binocular camera Battle array.However, the shortcomings that this method is, the convolutional neural networks for estimating depth figure only considered the single frames in video flowing For image as input, the estimation of depth map can be had an impact by not accounting for adjacent image, cause locating effect bad.

Bibliography one: Ravi Garg, Vijay Kumar, Gustavo Carneiro, Ian Reid. “Unsupervised CNN for Single View Depth Estimation:Geometry to the Rescue”, ECCV 2016.

Bibliography two: Clement Godard, Oisin Mac Aodha, Gabriel Brostow. “Unsupervised Monocular Depth Estimation with Left-Right Consistency”,CVPR 2017.

Bibliography three: Tinghui Zhou, Matthew Brown, Noah Snavely, David Lowe, “Unsupervised Learning ofDepth and Ego-Motion fromVideo”,CVPR 2017.

Summary of the invention

The present invention is that the influence etc. of adjacent image frame is not accounted for when overcoming estimating depth figure described in the above-mentioned prior art At least one defect provides a kind of monocular visual positioning method based on unsupervised learning, can effectively improve scalability and Locating effect.

In order to solve the above technical problems, technical scheme is as follows:

A kind of monocular visual positioning method based on unsupervised learning, comprising the following steps:

S1: Video stream information is obtained, video flowing equal part is cut into picture frame；

S2: the first picture frame of arbitrary neighborhood and the second picture frame heap poststack are input in pose estimation network, obtained Pose transformation between first picture frame and the corresponding pose of the second picture frame；

S3: the first picture frame and the second picture frame heap poststack are separately input to the first depth map estimation network and second deeply Corresponding first depth map of the first picture frame and corresponding second depth map of the second picture frame are obtained in degree figure estimation network；

S4: being rebuild first depth map and pose transformation to obtain the first reconstruction image frame, deep by described second Degree figure and pose transformation are rebuild to obtain the second reconstruction image frame；

S5: it is calculated and is rebuild according to the first picture frame and the first reconstruction image frame, the second picture frame and the second reconstruction image frame Error L is minimised as target with reconstruction error L, estimates network to pose estimation network, the first depth map using reconstruction error L Training is fitted with the second depth map estimation network；

S6: trained pose estimation network, the first depth map estimation network and the second depth map estimation network will be fitted Combined deep neural network is applied in monocular vision positioning.

In the S1 step of the technical program, video flowing letter can be obtained by monocular cam or using existing data set Breath, and fully considered all information of the video flowing inputted, using the two field pictures frame of arbitrary neighborhood as input, pass through position The reconstruction of appearance estimation network and depth map estimation network obtains reconstruction image frame, target is minimised as with reconstruction error, to pose Estimation network and depth map estimate that network is fitted training, and final fitting is trained to the pose estimation network finished, first deeply The deep neural network of degree figure estimation network and the second depth map estimation network composition is finally applied in monocular vision positioning.Phase Method more as input than existing use single-frame images, this technology hair dare not have more geometric meaning, can more effectively improve it Locating effect.

Preferably, the pose estimation network in S2 step includes convolutional neural networks CNN and full articulamentum, in this technology side In case, picture frame first passes through after convolutional neural networks CNN extracts characteristics of image and exports picture frame correspondence by full articulamentum again Pose transformation.

Preferably, the specific steps of S2 step include:

S2.1: the first picture frame and the second picture frame heap poststack are extracted into characteristics of image by convolutional neural networks CNN；

S2.2: passing through full articulamentum for the first picture frame and the extracted characteristics of image of the second picture frame respectively, output the One picture frame and the second picture frame correspond to the transformation of the pose between pose.

Preferably, the pose transformation in S2.2 step passes through representation of Lie algebra.

Preferably, the estimation of Depth network in S3 step includes the decoding of convolutional neural networks CNN and deconvolution structure Device, in the technical scheme, picture frame first passes through after convolutional neural networks CNN extracts characteristics of image passes through deconvolution structure again Decoder, the corresponding depth map of adjacent two picture frame of output.

Preferably, the specific steps of S3 step include:

S3.1: the first picture frame and the second picture frame are passed through into the convolutional Neural net in the first depth map estimation network respectively The extraction of network CNN completion characteristics of image；

S3.2: the extracted characteristics of image of S3.1 is passed through into the decoding of the deconvolution structure in the first depth map estimation network Device, corresponding first depth map of the first picture frame of output；

S3.3: the first picture frame and the second picture frame are passed through into the convolutional Neural net in the second depth map estimation network respectively The extraction of network CNN completion characteristics of image；

S3.4: the extracted characteristics of image of S3.3 is passed through into the decoding of the deconvolution structure in the second depth map estimation network Device, corresponding second depth map of the second picture frame of output.

Preferably, the depth value in the depth map in S3.2 step is that the inverse of depth uses inverse depth that is, against depth Benefit is can preferably to indicate the case where depth is infinity, and can simplify operation in calculating process.

Preferably, the relationship of the first reconstruction image frame and the first picture frame in S4 step meets:

p₂=K*T*D₁(p₁)*K^-1*p₁

Wherein,In, D₁Indicate the first depth map,Indicate that the first depth map estimates network, I₁And I₂Respectively the first picture frame and the second picture frame, T=f_T(I₁,I₂) indicate that pose estimates network, p₂Figure is rebuild for first As the coordinate of each pixel in frame, p₁For the coordinate of corresponding pixel points in the first picture frame, K is the internal reference of monocular cam Matrix.

Preferably, in S5 step reconstruction error L calculation formula are as follows:

Wherein, I₁(p₁) it is the first picture frame I₁Middle p₁The corresponding coordinate of pixel, I₂(p₂) it is the second picture frame I₂Middle p₂Picture The corresponding coordinate of vegetarian refreshments.

Compared with prior art, the beneficial effect of technical solution of the present invention is: by adjacent two field pictures as input into The depth map estimation function of row fitting, can effectively improve locating effect；It is proposed that pose estimation network and depth map estimate net The design of the frame of network combination, can according to demand be replaced network module therein, scalability is strong.

Detailed description of the invention

Fig. 1 is flow chart of the invention.

Fig. 2 is that the pose of the present embodiment estimates the structural schematic diagram of network.

Fig. 3 is that the depth map of the present embodiment estimates the structural schematic diagram of network.

Specific embodiment

The attached figures are only used for illustrative purposes and cannot be understood as limitating the patent；

In order to better illustrate this embodiment, the certain components of attached drawing have omission, zoom in or out, and do not represent actual product Size；

To those skilled in the art, it is to be understood that certain known features and its explanation, which may be omitted, in attached drawing 's.

The following further describes the technical solution of the present invention with reference to the accompanying drawings and examples.

As shown in Figure 1, being flow chart of the invention.

Step 1: Video stream information is obtained by monocular cam, and video flowing is cut into picture frame.

Step 2: the first picture frame of arbitrary neighborhood and the second picture frame heap poststack are input in pose estimation network, Obtain the pose transformation between the first picture frame and the corresponding monocular cam pose of the second picture frame.As shown in Fig. 2, being this reality Apply the structural schematic diagram of the pose estimation network of example, wherein pose estimation network includes convolutional neural networks CNN and full connection Layer.Specific step is as follows:

S1: the two field pictures of input are stacked, and form the input picture that a port number is 6；

S2: convolution operation is carried out with the convolution kernel of 5*5 to 6 channel images of input, is then followed by and characteristic pattern is done once Down-sampling operation, keeps its half-sized；

S3: doing the convolution operation of 6 3*3 to characteristic pattern in succession, after convolution operation twice, is a down-sampling behaviour Make, keeps its half-sized.

S4:: the characteristic pattern that S3 is obtained is done into a convolution operation with the convolution kernel of 3*3, obtained characteristic pattern is exactly to defeated The feature that the two field pictures entered are extracted.

S5: obtained feature will be extracted and handled by several full articulamentums, export the vector of one 6 dimension, as Pose transformation between first picture frame and the corresponding monocular cam pose of the second picture frame, the pose transformation pass through Lee's generation Number indicates.

Step 3: above-mentioned first picture frame and the second picture frame heap poststack are separately input to the first depth map estimation network The first depth map corresponding with the first picture frame is obtained in the second depth map estimation network and the second picture frame corresponding second are deeply Degree figure.As shown in figure 3, estimating the structural schematic diagram of network for the depth map of the present embodiment, wherein depth map estimation network includes Decoder including convolutional neural networks CNN and deconvolution structure.As the first picture frame and the second picture frame input the first depth The detailed process of figure estimation network is as follows:

S1: the first picture frame of input and the second picture frame are stacked, and form the input figure that a port number is 6 Picture；

S2: convolution operation is carried out with the convolution kernel of 5*5 to 6 channel images of input, and the characteristic pattern that convolution is obtained is protected A copy is stayed, is then followed by and a down-sampling operation is done to characteristic pattern, keep its half-sized；

S3: doing the convolution operation of 6 3*3 to characteristic pattern in succession, after convolution operation twice, characteristic pattern is retained a Then copy does a down-sampling operation, makes half-sized；

S4: obtained characteristic pattern is done into a convolution operation with the convolution kernel of 3*3, obtained characteristic pattern is exactly to input The feature that two field pictures are extracted；

S5: convolution operation is done to feature with 3*3 convolution kernel to the feature that S4 is obtained；

S6: 3 following operations are done in repetition: deconvolution operation are done with the convolution kernel of 3*3 to characteristic pattern, so that characteristic pattern Size become original 2 times, and do primary stacking with the characteristic pattern of corresponding size retained before, then use the convolution of 3*3 Core does a convolution operation；

S7: doing deconvolution operation with the convolution kernel of 5*5 to the finally obtained characteristic pattern of S6, and with the phase that retains before It answers the characteristic pattern of size to do primary stacking, then does a convolution operation with the convolution kernel of 5*5, obtain equal in magnitude with input picture A frame image, the first depth map as exported, wherein the depth value in depth map be the pixel depth inverse.

The detailed process that first picture frame and the second picture frame input the second depth map estimation network is identical as above-mentioned process, Difference is the different from after training of the parameter in the first depth map estimation network and the second depth map estimation network.

In Fig. 2 and Fig. 3, the scaling of shape indicates characteristic pattern with 2 times of size reduction or amplification, and connecting line indicates feature Figure stacks.The reason of designing in this way is that the depth map size of output needs to be consistent with the size of input picture, and convolution Layer can keep the size of characteristic pattern constant or reduce the size of characteristic pattern, and warp lamination can increase the size of characteristic pattern.

Step 4: being rebuild first depth map and pose transformation to obtain the first reconstruction image frame, by described the Two depth maps and pose transformation are rebuild to obtain the second reconstruction image frame.It is deep using first by taking the first reconstruction image frame as an example The formula that degree figure and pose transformation are rebuild is as follows:

p₂=K*T*D₁(p₁)*K^-1*p₁

Wherein,For the first depth map,Function, T=f are calculated for depth map_T(I₁,I₂) be Pose transformation between first picture frame and the second picture frame, f_T(I₁,I₂) it is pose transformation calculations function, p₂It is rebuild for first The coordinate of each pixel, p in picture frame₁For the coordinate of corresponding pixel points in the first picture frame, K is the interior of monocular cam Join matrix.

Step 5: it is calculated according to the first picture frame and the first reconstruction image frame, the second picture frame and the second reconstruction image frame Reconstruction error L is minimised as target with reconstruction error L, is estimated using reconstruction error L pose estimation network, the first depth map Network and the second depth map estimation network are fitted training.

It, can be deep to training by calculating reconstruction error according to the first reconstruction image frame and the second reconstruction image frame of output Spend neural network, the calculation formula of reconstruction error L are as follows:

Step 6: trained pose estimation network, the first depth map estimation network and the estimation of the second depth map will be fitted The deep neural network of combination of network is applied in monocular vision positioning.

The same or similar label correspond to the same or similar components；

The terms describing the positional relationship in the drawings are only for illustration, should not be understood as the limitation to this patent；

Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be pair The restriction of embodiments of the present invention.For those of ordinary skill in the art, may be used also on the basis of the above description To make other variations or changes in different ways.There is no necessity and possibility to exhaust all the enbodiments.It is all this Made any modifications, equivalent replacements, and improvements etc., should be included in the claims in the present invention within the spirit and principle of invention Protection scope within.

Claims

1. a kind of monocular visual positioning method based on unsupervised learning, it is characterised in that: the following steps are included:

S2: the first picture frame of arbitrary neighborhood and the second picture frame heap poststack are input in pose estimation network, obtain first Picture frame and the second picture frame correspond to the transformation of the pose between pose；

S3: the first image frame and the second picture frame heap poststack are separately input to the first depth map estimation network and second deeply Corresponding first depth map of the first picture frame and corresponding second depth map of the second picture frame are obtained in degree figure estimation network；

S4: it is rebuild first depth map and pose transformation to obtain the first reconstruction image frame, by second depth map It is rebuild to obtain the second reconstruction image frame with pose transformation；

S5: reconstruction error is calculated according to the first picture frame and the first reconstruction image frame, the second picture frame and the second reconstruction image frame L is minimised as target with reconstruction error L, using reconstruction error L to pose estimation network, the first depth map estimation network and the Two depth maps estimation network is fitted training；

S6: trained pose estimation network, the first depth map estimation network and the second depth map estimation combination of network will be fitted Deep neural network apply monocular vision positioning in.

2. a kind of monocular visual positioning method based on unsupervised learning according to claim 1, it is characterised in that: described Pose estimation network in S2 step includes convolutional neural networks CNN and full articulamentum.

3. a kind of monocular visual positioning method based on unsupervised learning according to claim 2, it is characterised in that: described The specific steps of S2 step include:

S2.1: the first picture frame and the second picture frame heap poststack input convolutional neural networks CNN are extracted into characteristics of image；

S2.2: by extracted characteristics of image by full articulamentum, export the first picture frame and the second picture frame correspond to pose it Between pose transformation.

4. a kind of monocular visual positioning method based on unsupervised learning according to claim 3, it is characterised in that: described Pose transformation in S2.2 step passes through representation of Lie algebra.

5. a kind of monocular visual positioning method based on unsupervised learning according to claim 1-4, feature Be: the estimation of Depth network in the S3 step includes the decoder of convolutional neural networks CNN and deconvolution structure.

6. a kind of monocular visual positioning method based on unsupervised learning according to claim 5, it is characterised in that: described The specific steps of S3 step include:

S3.1: the first picture frame and the second picture frame heap poststack are passed through into the convolutional neural networks in the first depth map estimation network The extraction of CNN completion characteristics of image；

The extracted characteristics of image of S3.1: being passed through the decoder for the deconvolution structure that the first depth map is estimated in network by S3.2, Export corresponding first depth map of the first picture frame；

S3.3: the first picture frame and the second picture frame heap poststack are passed through into the convolutional neural networks in the second depth map estimation network The extraction of CNN completion characteristics of image；

The extracted characteristics of image of S3.3: being passed through the decoder for the deconvolution structure that the second depth map is estimated in network by S3.4, Export corresponding second depth map of the second picture frame.

7. a kind of monocular visual positioning method based on unsupervised learning according to claim 6, it is characterised in that: described The depth value in depth map in S3.2 step is the inverse of depth.

8. a kind of monocular visual positioning method based on unsupervised learning according to claim 1, it is characterised in that: described The relationship of the first reconstruction image frame and the first picture frame in S4 step meets:

p₂=K*T*D₁(p₁)*K^-1*p₁

Wherein,In, D₁Indicate the first depth map,Indicate that the first depth map estimates network, I₁And I₂ Respectively the first picture frame and the second picture frame, T=f_T(I₁,I₂) indicate that pose estimates network, p₂For in the first reconstruction image frame The coordinate of each pixel, p₁For the coordinate of corresponding pixel points in the first picture frame, K is the internal reference matrix of monocular cam.

9. a kind of monocular visual positioning method based on unsupervised learning according to claim 1, it is characterised in that: described The calculation formula of reconstruction error L in S5 step are as follows:

Wherein, I₁(p₁) it is the first picture frame I₁Middle p₁The corresponding coordinate of pixel, I₂(p₂) it is the second picture frame I₂Middle p₂Pixel Corresponding coordinate.