CN110060331A

CN110060331A - Three-dimensional rebuilding method outside a kind of monocular camera room based on full convolutional neural networks

Info

Publication number: CN110060331A
Application number: CN201910193450.XA
Authority: CN
Inventors: 颜成钢; 徐浙峰; 任浩帆; 孙垚棋; 张继勇; 张勇东
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2019-03-14
Filing date: 2019-03-14
Publication date: 2019-07-26

Abstract

The invention discloses three-dimensional rebuilding methods outside a kind of monocular camera room based on full convolutional neural networks.The present invention is the following steps are included: step 1, the full convolutional neural networks of training in the way of supervised learning；Step 2 carries out estimation of Depth to each picture with full convolutional neural networks；A series of continuous picture of outdoor scenes is shot with monocular camera, then using each picture as input, estimation of Depth is carried out to picture with the trained full convolutional neural networks in front, obtains its three-dimensional point cloud model；The threedimensional model of each picture is fused into a complete threedimensional model with ICP algorithm by step 3.The present invention solves the problems, such as the three-dimensional reconstruction of monocular camera, and can realize on the hardware systems such as ordinary PC or work station through the invention.

Description

Three-dimensional rebuilding method outside a kind of monocular camera room based on full convolutional neural networks

Technical field

The invention belongs to computer visions, computer graphics techniques field, and particularly, the present invention relates to one kind based on complete Three-dimensional rebuilding method outside the monocular camera room of convolutional neural networks.

Background technique

Three-dimensional reconstruction is an important and basic problem in computer vision and computer graphics field, it is in agriculture The fields such as industry, medical treatment, space flight, military affairs, environmental observation, landform exploration have a very wide range of applications.And one of those small point Branch --- carrying out outdoor three-dimensional reconstruction to City scenarios can then play an important role in fields such as digital map navigation, urban plannings.? It crosses the river after the three-dimensional map in city, people can easily check the sample in any one of city corner by various electronic equipments Son, Google Maps are exactly a very successful example in this respect.By being combined with virtual reality and augmented reality, then The functions such as integrated living information, e-commerce, virtual community and service, can bring people's experience of more immersion.Therefore, The research of outdoor three-dimensional reconstruction has high scientific research and application value.

In field of Computer Graphics, the three-dimensional reconstruction of monocular camera is always one important and challenging ask Topic.Although monocular camera cannot pass through range of triangle as binocular camera and depth camera or ToF, structure light principle are direct The depth information of each pixel is obtained, but passes through long-run development, the technology relative maturity of monocular camera, cost is relatively low, knot Structure is simple, to the of less demanding of computing resource, it is easier to be commercialized, such as manpower one smart phone standard configuration camera just It is good monocular camera.Therefore, the method for the invention trained full convolutional Neural net in the way of by supervised learning Network carries out estimation of Depth to each picture that monocular camera obtains, and is then fused into a complete threedimensional model, from And complete three-dimensional reconstruction.

Summary of the invention

The present invention is intended to provide a kind of useful solution.It is an object of the invention to solve the three of monocular camera thus Problems of Reconstruction is tieed up, input is the picture of multiple outdoor scenes shot by monocular camera, and the method in invention is individually right Each picture carries out estimation of Depth, is finally fused into a complete threedimensional model.

The present invention propose the process of realization the following steps are included:

Step 1, the full convolutional neural networks of training in the way of supervised learning；

Step 2 carries out estimation of Depth to each picture with full convolutional neural networks；

A series of continuous picture of outdoor scenes is shot with monocular camera, then using each picture as input, is used The trained full convolutional neural networks in front carry out estimation of Depth to picture, obtain its three-dimensional point cloud model；

The threedimensional model of each picture is fused into a complete threedimensional model with ICP algorithm by step 3；

The step 1 is implemented as follows:

1-1. prepares a large amount of training pictures for training mesh parameter；

Each group of trained picture includes a common color image of the outdoor scene shooting to a certain angle and is somebody's turn to do The semantic segmentation information of color image corresponding depth picture and Pixel-level；Pass through the semanteme of Pixel-level in SYNTHIA data set Segmentation information rejects redundant data；

1-2. carries out mathematical modeling to image data；WithIndicate the N group color image and depth in data set Picture, and known camera internal reference matrix K；For color image I_iIn any one pixel q, its homogeneous coordinates are [x,y,1]^T, T expression transposition；Then it corresponding point Q is calculated with formula once in three dimensions:

Q=D_i(q)·K^-1 _qFormula 1

Assuming that the normal vector of a plane in three-dimensional space isIndicate the real vector of 1*3；In order to make The normal vector of each plane is uniquely that n calculation is as follows:

The unit normal vector for indicating plane, is directed toward plane from origin；D indicates plane with a distance from origin；If Point Q is in some plane, then is met

Assuming that color image I_iIn have M plane, then to color image construct a pixels probability matrix S_i；S therein_i It (q) is the vector of one (M+1) dimension, its j-th of element is denoted asIndicate that pixel q falls in the probability of j-th of plane, Indicate non-planar with j=0 simultaneously；The plane parameter of the i-th picture can be obtained by minimizing following objective function

Wherein,For regular terms, network generates unessential result in order to preventI.e. all Pixel is all grouped into non-planar；α is then learning rate；When pixel q is projected to three-dimensional space from a picture, its institute is right Due to perspective structure, one is scheduled on from a ray of q point in the three-dimensional space answered；Remember the intersection point of ray and plane Depth is λ, and the three-dimensional coordinate of pixel q spatially is λ k^-1q；So

For regular termsIt is calculated with following formula:

WhereinIndicate that pixel q falls probability in the plane, value range [0, 1]；

Semantic information in data set is divided into two classes: " reservation "={ building, road, pavement, lane line } and " house Abandon "={ pedestrian, automobile, sky, bicycle }；If a pixel belongs to " reservation " class, z (q)=1 is enabled；If If belonging to " giving up " class, then z (q)=0 is enabled；Then regular terms formula above is rewritten are as follows:

Full convolutional neural networks are divided into two large divisions: a part is used to divide the plane in picture；Another part is then For generating the three-dimensional point cloud model of picture；The identical abstract characteristic pattern of two partial sharings.

The threedimensional model of each picture is fused into a complete threedimensional model with ICP algorithm described in step 3, is had Body is accomplished by

3-1. solves the lap of two clouds；

The characteristic point for being extracted and being matched first two pictures using SIFT algorithm obtains matching point set Q and Q '；For this two Transformation between a point set acquires its homography matrix H, i.e. Q '=HQ；

Then four apex coordinates for calculating registration figure, then carry out image registration, and then obtain being overlapped in two pictures The pixel set in region, then the pixel for wherein belonging to " reservation " classification is only retained by semantic information, finally obtain set of pixels Conjunction N '=1 ..., n ' }；

For putting cloud known to two, overlapping region can be expressed as:

P={ p₁,...,p_n′P '={ p '₁,...,p′_n′Formula 9

3-2. finds the spin matrix R and translation matrix t of an European transformation, and two clouds are matched, it may be assumed that

R and t are solved using ICP algorithm, acquires R and t by making error sum of squares reach minimum, i.e.,

Spin matrix R, the centroid position of two groups of point clouds are calculated first；

Then calculate every group of point cloud midpoint removes center-of-mass coordinate qi and q '_i:

q_i=p_i-p,q′_i=p '_i- p formula 13

Define matrixW is 3 × 3 matrixes, carries out SVD decomposition to W, obtains:

W=U ∑ V^T

Then R is

R=UV^T

Then translation matrix t is calculated

T=p-Rp '；

3-3. is translated, after rotation transformation, will be under the point Cloud transform to P coordinate system in P ' with following formula:

To realize the fusion of two cloudsThis operation is all taken to all point clouds, until leaving behind one Three-dimensional point cloud model, so as to complete the three-dimensional reconstruction of entire outdoor scene.

The features of the present invention and the utility model has the advantages that

The present invention realizes three-dimensional rebuilding method outside a kind of monocular camera room based on full convolutional neural networks, to Three-dimensional Gravity Have greater significance.The present invention full convolutional neural networks of training in the way of supervised learning can directly carry out color image Estimation of Depth obtains its three-dimensional point cloud model, then merges to all point cloud models, completes the Three-dimensional Gravity to outdoor scene It builds.

Compared with binocular camera and depth camera, monocular camera passes through long-run development, and technology relative maturity, cost is relatively low, Structure is simple, high the range of triangle unlike needed for binocular camera of the requirement to computing resource, it is easier to be commercialized.For example, Almost the camera of all standard configurations is exactly monocular camera on manpower one smart phone now, and the imaging effect of camera is not Mistake can directly bring use.

This technology can be realized on the hardware systems such as ordinary PC or work station.

Detailed description of the invention

Fig. 1 is the method for the present invention overview flow chart.

Fig. 2 is the model of full convolutional neural networks used in the present invention.

Specific embodiment

In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, with reference to the accompanying drawing and specific real Applying mode, the present invention is described in further detail.

As shown in Figure 1, three-dimensional rebuilding method outside a kind of monocular camera room based on full convolutional neural networks, including following step It is rapid:

Step 1, the full convolutional neural networks of training in the way of supervised learning:

It is the same with others neural network model, it is necessary first to prepare a large amount of training picture for training mesh parameter. Each group of trained picture includes a common color image and the color image for the outdoor scene shooting to a certain angle The semantic segmentation information of corresponding depth picture and Pixel-level.Due to data set artificially collect with label can spend a large amount of when Between and energy, SYNTHIA data set can be used.Although data therein are all from virtual city, computer simulation Effect and real world out has certain similitude.It should be noted that the initial purpose of this data set is for automatic It drives, the mode of acquisition data is to simulate an automobile in true traffic behavior downward driving, at regular intervals from vehicle One fixed position and angle shot photo.Almost the same picture is much organized so having in data set.Can by speed come The data for rejecting redundancy, to avoid meaningless calculation amount.In addition to this, it is also necessary to reject unwanted part in picture.Figure The information such as pedestrian, automobile in piece are needed not necessarily include in the threedimensional model rebuild, and road, building surface etc. information Must then it retain.The step can be completed by the semantic segmentation information of Pixel-level in SYNTHIA data set, detailed process will It is embodied in following regular terms.

Before introducing neural network model, need to carry out mathematical modeling to image data.WithIndicate data The N group color image and depth picture of concentration, and known camera internal reference matrix K.For color image I_iIn any one Pixel q, its homogeneous coordinates are [x, y, 1]^T, T expression transposition.Then its corresponding point Q formula once in three dimensions It is calculated:

Q=D_i(q)·K^-1Q formula 1

Since data almost all of during three-dimensional reconstruction are all about plane information.Assuming that in three-dimensional space The normal vector of a plane beIndicate the real vector of 1*3；Normal vector in order to make each plane is Uniquely, n is calculated in the following way:

The unit normal vector for indicating plane, is directed toward plane from origin；D indicates plane with a distance from origin.If Point Q is in some plane, then meets n^TQ=1.

Assuming that color image I_iIn have M plane, then to color image construct a pixels probability matrix S_i.S therein_i It (q) is the vector of one (M+1) dimension, its j-th of element is denoted asIndicate that pixel q falls in the probability of j-th of plane, together When with j=0 indicate non-planar.The plane parameter of the i-th picture can be obtained by minimizing following objective function

Wherein,For regular terms, network generates unessential result in order to preventI.e. all Pixel is all grouped into non-planar；α is then learning rate.When pixel q is projected to three-dimensional space from a picture, its institute is right Due to perspective structure, one is scheduled on from a ray of q point in the three-dimensional space answered.Remember the intersection point of ray and plane Depth is λ, and the three-dimensional coordinate of pixel q spatially is λ k^-1q.So

For regular termsIt can be calculated with following formula:

WhereinIndicate the probability that pixel q is fallen in plane (regardless of which plane), Value range is in [0,1].It should be noted that not all pixel will participate in this when three-dimensional reconstruction Process.Possess different semantic informations pixel logically whether need reconstructed probability be it is different, such as road, The pixel of the semantic informations such as external wall should just be included in the three-dimensional point cloud model rebuild, and pedestrian, automobile etc. The pixel of semantic information should be just removed.Therefore, the semantic information in data set can be divided into two classes --- it " protects Stay "=building, and road, pavement, lane line, etc. and " giving up "=pedestrian, and automobile, sky, bicycle, etc..Then, such as If one pixel of fruit belongs to " reservation " class, then z (q)=1 is enabled；If belonging to " giving up " class, z (q)=0 is enabled.In It is that can rewrite regular terms formula above are as follows:

Full convolutional neural networks model used in the present invention is from the beginning trained from the TensorFlow frame of full disclosure It obtains, network structure is shown in Fig. 2.Entire neural network framework is divided into two large divisions.One part is used to divide flat in picture Face, because the plane of outdoor scene occupies substantial portion of data during entire three-dimensional reconstruction, so independent It is calculated, to guarantee the accuracy of final result.It in addition to the activation primitive of prediction interval is wherein Softmax function, He all layers be all ReLU function.Another part is then the three-dimensional point cloud model for generating picture.This part is with before The identical abstract characteristic pattern of that partial sharing of face.It includes the convolutional layer of two stride-2 (3*3*512), then followed by The convolutional layer of the 1*1*3m of M plane parameter of one output, then uses an overall situation to be averaged pond.In addition to the last layer what Activation primitive need not, other are all ReLU functions.In final parameter designing, α=0.1, plane quantity M=5.Instruction When practicing model, Adam optimization algorithm, β can be used₁=0.99, β₂=0.9999, learning rate 0.0001, batch size It is 4.

Step 2 carries out estimation of Depth to each picture with full convolutional neural networks.

A series of continuous picture of outdoor scenes is shot with monocular camera, then using each picture as input, is used The trained full convolutional neural networks in front carry out estimation of Depth to it, obtain its three-dimensional point cloud model.

The threedimensional model of each picture is fused into a complete threedimensional model with ICP algorithm by step 3.

After the three-dimensional point cloud model for obtaining each picture, it is necessary to be fused into a point cloud model.Iteration is most Near point (Iterative Closest Point, lower abbreviation ICP) algorithm is a kind of point cloud matching algorithm, for solving 3D-3D's Pose estimation problem.The point cloud for two pictures for taking shooting time mutually to close on, since its shooting time is close, then its difference is not Greatly, their three-dimensional point cloud lap is very big, is more suitable for for matching and merging.

But before application ICP algorithm, need to solve the lap of two clouds.Since point cloud model is from coloured silk Estimation is got in chromatic graph piece, in order to guarantee the accuracy of overlaid pixel set calculating, directly calculates two color images here Overlapping region.The characteristic point for being extracted and being matched first two pictures using SIFT algorithm obtains matching point set Q and Q '.For this Transformation between two point sets, can be in the hope of its homography matrix H, i.e. Q '=HQ.Then four vertex for calculating registration figure are sat Mark, then can be carried out image registration, and then obtain the pixel set of overlapping region in two pictures, then pass through semantic information Only retain the pixel for wherein belonging to " reservation " classification, finally obtains pixel set N '={ 1 ..., n ' }.

For putting cloud known to two, wherein overlapping region can be expressed as:

P={ p₁..., p_n′P '={ p '₁... p '_n′}

It, can will be on two point cloud matchings if finding the spin matrix R and translation matrix t of an European transformation, it may be assumed that

R and t can be solved using ICP algorithm, the present invention uses the solution mode of linear algebra, and purpose is exactly logical Crossing, which makes error sum of squares reach minimum, acquires R and t, i.e.,

Spin matrix R is calculated first, and the centroid position of two groups of point clouds, then calculate every group of point cloud midpoint goes center-of-mass coordinate q_iWith q '_i:

q_i=p_i-p,q′_i=p '_i-p

Define matrixIt is 3 × 3 matrixes, carries out SVD decomposition to W, obtains:

W=U ∑ V^T

Then R is

R=UV^T

It then can be with calculating translation matrix t

T=p-Rp '

It is translated, after rotation transformation, it will be under the point Cloud transform to P coordinate system in P ' with following formula:

Thereby realize the fusion of two cloudsThis operation is all taken to all point clouds, until leaving behind One three-dimensional point cloud model, so as to complete the three-dimensional reconstruction of entire outdoor scene.

Claims

1. three-dimensional rebuilding method outside a kind of monocular camera room based on full convolutional neural networks, it is characterised in that including following step It is rapid:

A series of continuous picture of outdoor scenes is shot with monocular camera, then using each picture as input, uses front Trained full convolutional neural networks carry out estimation of Depth to picture, obtain its three-dimensional point cloud model；

The step 1 is implemented as follows:

Each group of trained picture includes a common color image and the colour for the outdoor scene shooting to a certain angle The semantic segmentation information of picture corresponding depth picture and Pixel-level；Pass through the semantic segmentation of Pixel-level in SYNTHIA data set Information rejects redundant data；

1-2. carries out mathematical modeling to image data；WithIndicate the N group color image and depth picture in data set, And known camera internal reference matrix K；For color image I_iIn any one pixel q, its homogeneous coordinates are [x, y, 1]^T, T indicates transposition；Then it corresponding point Q is calculated with formula once in three dimensions:

Q=D_i(q)·K^-1Q formula 1

Assuming that the normal vector of a plane in three-dimensional space is Indicate the real vector of 1*3；It is each in order to make The normal vector of plane is all uniquely that n calculation is as follows:

The unit normal vector for indicating plane, is directed toward plane from origin；D indicates plane with a distance from origin；At fruit dot Q In some plane, then meet n^TQ=1；

Assuming that color image I_iIn have M plane, then to color image construct a pixels probability matrix S_i；S therein_i(q) it is The vector of one (M+1) dimension, its j-th of element are denoted asIt indicates that pixel q falls in the probability of j-th of plane, while using j =0 indicates non-planar；The plane parameter of the i-th picture can be obtained by minimizing following objective function

Wherein,For regular terms, network generates unessential result in order to prevent I.e. all pixels Point is all grouped into non-planar；α is then learning rate；When pixel q is projected to three-dimensional space from a picture, corresponding to it Due to perspective structure, one is scheduled on from a ray of q point in three-dimensional space；Remember the depth of the intersection point of ray and plane For λ, the three-dimensional coordinate of pixel q spatially is λ k^-1q；So

For regular termsIt is calculated with following formula:

WhereinIndicate that pixel q falls probability in the plane, value range is in [0,1]；

Semantic information in data set is divided into two classes: " reservation "={ building, road, pavement, lane line } and " giving up "= { pedestrian, automobile, sky, bicycle }；If a pixel belongs to " reservation " class, z (q)=1 is enabled；If belonged to If " giving up " class, then z (q)=0 is enabled；Then regular terms formula above is rewritten are as follows:

Full convolutional neural networks are divided into two large divisions: a part is used to divide the plane in picture；Another part be then for Generate the three-dimensional point cloud model of picture；The identical abstract characteristic pattern of two partial sharings.

2. three-dimensional rebuilding method outside a kind of monocular camera room based on full convolutional neural networks according to claim 1, It is characterized in that the threedimensional model of each picture is fused into a complete threedimensional model with ICP algorithm described in step 3 have Body is accomplished by

3-1. solves the lap of two clouds；

The characteristic point for being extracted and being matched first two pictures using SIFT algorithm obtains matching point set Q and Q '；For the two point Transformation between collection acquires its homography matrix H, i.e. Q '=HQ；

Then four apex coordinates for calculating registration figure, then carry out image registration, and then obtain overlapping region in two pictures Pixel set, then the pixel for wherein belonging to " reservation " classification is only retained by semantic information, finally obtains pixel set N ' ={ 1 ..., n ' }；

For putting cloud known to two, overlapping region can be expressed as:

P={ p₁..., p_n′P '={ p '₁..., p '_n′Formula 9

Then calculate every group of point cloud midpoint removes center-of-mass coordinate q_iWith q '_i:

q_i=p_i- P, q '_i=p '_i- p formula 13

Define matrixW is 3 × 3 matrixes, carries out SVD decomposition to W, obtains:

W=U ∑ V^T

Then R is

R=UV^T

Then translation matrix t is calculated

T=p-Rp '；

To realize the fusion of two cloudsThis operation is all taken to all point clouds, until leaving behind a three-dimensional Point cloud model, so as to complete the three-dimensional reconstruction of entire outdoor scene.