CN106780592B

CN106780592B - Kinect depth reconstruction method based on camera motion and image shading

Info

Publication number: CN106780592B
Application number: CN201611061364.6A
Authority: CN
Inventors: 青春美; 黄韬; 袁书聪; 徐向民
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2016-06-30
Filing date: 2016-11-28
Publication date: 2020-05-22
Anticipated expiration: 2036-11-28
Also published as: CN106780592A

Abstract

The invention discloses a Kinect depth reconstruction method based on camera motion and image shading, which comprises the following steps of: 1) under the condition that the Kinect depth camera and the RGB camera are calibrated and aligned, uploading data collected by the Kinect to a computer through a third-party interface; 2) recovering a three-dimensional scene structure and a motion track of a kinect RGB camera from an RGB video sequence to obtain a point cloud and camera motion relation; 3) and (3) reconstructing the image depth by combining the point cloud obtained in the step 2) and the camera motion relation and utilizing the light and shade condition information of the image. The method does not need to physically improve the depth camera, does not need to design complex device combination, does not need complex and harsh illumination calibration steps which are usually used in the traditional depth reconstruction method and only can be limited under laboratory conditions without practical application value, and has greater practical application value and significance compared with the traditional method.

Description

Kinect depth reconstruction method based on camera motion and image shading

Technical Field

The invention relates to the field of depth reconstruction in computer image processing, in particular to a Kinect depth reconstruction method based on camera motion and image shading.

Background

With the advent and popularization of some civil depth cameras with relatively low prices in recent years, such as Microsoft Kinect, application Xtion Pro and the like, depth information is widely applied to various fields of motion sensing games, real-time three-dimensional reconstruction, augmented reality, virtual reality and the like, and the application of the depth information becomes an important support for the development of a novel human-computer interaction mode. However, most of the civil depth cameras which are popular in the market at present have the problems of insufficient depth detection precision and too large interference noise, and the quality of an application product based on depth information is seriously influenced. Therefore, how to acquire more accurate depth information is of great significance to applications developed based on the depth information.

Due to the above requirements, depth reconstruction algorithms are receiving more and more attention from academia and industry. At present, a novel method is used for reconstructing a depth map by combining the idea of three-dimensional reconstruction in computer graphics to assist the depth map, and the idea is also used in the patent. At present, the main methods in the aspect of three-dimensional reconstruction include recovering a three-dimensional scene structure from motion information, reconstructing an object shape from the light and shade conditions of an image, a photometric stereo method and the like. The patent mainly utilizes two methods of recovering a three-dimensional scene structure from motion information and reconstructing an object shape from the light and shade conditions of an image.

The method for recovering the three-dimensional scene structure from the motion information mainly utilizes the motion process of a camera to dynamically generate and correct a three-dimensional point cloud, and the typical representation of the method is a monocular camera-based SLAM system. The method for reconstructing the shape of the object from the brightness of the image is to establish an effective illumination model by using the brightness of the image and solve the illumination model by using an optimization method, so that the surface shape information of the target can be acquired.

By utilizing and improving the ideas of the two methods and utilizing the close relation between the depth map and the point cloud, the depth map can be effectively optimized and reconstructed, and a more accurate depth result is obtained.

Disclosure of Invention

The invention aims to overcome the defect of insufficient depth detection precision of the existing civil depth camera, and provides a Kinect depth reconstruction method based on camera motion and image shading.

In order to achieve the purpose, the technical scheme provided by the invention is as follows: the Kinect depth reconstruction method based on camera motion and image shading comprises the following steps:

1) under the condition that the Kinect depth camera and the RGB camera are calibrated and aligned, uploading data collected by the Kinect to a computer through a third-party interface;

2) recovering a three-dimensional scene structure and a motion track of a kinect RGB camera from an RGB video sequence to obtain a point cloud and camera motion relation;

3) and (3) reconstructing the image depth by combining the point cloud obtained in the step 2) and the camera motion relation and utilizing the light and shade condition information of the image.

The step 2) comprises the following steps:

2.1) reading an RGB picture as a key frame when a system is initialized, binding a depth map to the key frame, traversing the depth map, and assigning a random value to each pixel position, wherein the depth map and the gray map have the same dimensionality;

2.2) for each read RGB picture, the following cost function is constructed:

wherein | · | purple_δIs the habo operator, r_pWhich represents an error in the form of,

represents the variance of the error;

the definition of the habo operator is as follows:

δ is a parameter of the habo operator;

definition of the error function r_pThe following were used:

r_p(p,ζ_ji)＝I_i(p)-I_j(w(p,D_i(p),ζ_ji))

I_i(p) the gray value, ζ, representing the position of the pixel p in the current frame_jiIndicating rotational translation of a three-dimensional point in the i coordinate to just under the j coordinateLie algebra of volume transformations, D_i(p) denotes a depth value of a position corresponding to the pixel p in the depth map of the reference frame, w (p, D)_i(p),ζ_ji) And the three-dimensional point corresponding to the position of the pixel p in the reference frame i is transformed to the position of j in the current frame through a rotational translation rigid body, wherein the transformation formula is as follows:

wherein X, Y and Z respectively represent the coordinates of a three-dimensional point in an XYZ three-direction under a camera coordinate system; u, v represent pixel coordinates; f. of_x，f_yRespectively representing the focal lengths in the X and Y directions;

variance (variance)

Is defined as follows:

wherein

Variance, V, representing picture gray_i(p) represents the variance of pixel point p of the reference frame depth map;

2.3) solving zeta of the minimum cost function in the step 2.2) by a Gauss-Newton iteration method to obtain a rotation translation relation between the reference frame and the current frame;

2.4) solving the gradient of all points on the reference frame gray level image, and selecting the points of which the gradient is greater than a threshold; then, screening the points; traversing all the points meeting the requirements, and searching the corresponding points of the points on the epipolar line of the current frame according to the epipolar line set; calculating the space coordinates of the points according to the monocular vision three-dimensional reconstruction geometric knowledge;

2.5) fusing the newly obtained depth value with the depth value in the depth map of the reference frame by using a Kalman filter.

The step 3) comprises the following steps:

3.1) aligning the depth image collected by the depth camera under the current frame with the color image collected by the monocular color camera; because the difference of the field of view ranges exists between the color camera and the depth camera, only the overlapped part of the field of view of the color camera and the depth camera has effective depth values, and an incomplete depth map is obtained after alignment;

3.2) generating a three-dimensional point cloud according to the incomplete depth map according to a model of a pinhole camera in the depth camera; the pinhole camera model is briefly described as follows: the relationship between the spatial coordinates [ x, y, z ] of a spatial point and its pixel coordinates [ u, v, d ] in the image is expressed as:

z＝d/s

x＝(u-c_x)·z/f_x

y＝(v-c_y)·z/f_y

where d is the depth value of each pixel in the depth map, s is the scaling factor of the depth map, c_xAnd c_yIs the abscissa and ordinate of the principal point, f_xAnd f_yIs the focal length component in the abscissa direction and the ordinate direction;

converting the pixel coordinate of each pixel into a corresponding space coordinate by using the formula, and then completing the conversion from the depth map to the three-dimensional point cloud;

3.3) registering the point cloud generated in the monocular algorithm and the point cloud generated by the incomplete depth map by using a point-to-point iterative nearest neighbor point ICP algorithm to obtain a rotation matrix R and a translation matrix T between the point cloud generated in the monocular algorithm and the point cloud generated by the incomplete depth map;

3.4) converting the point cloud obtained by the monocular algorithm to a coordinate system of the point cloud generated by the incomplete depth map according to the obtained rotation matrix and translation matrix, and splicing the point cloud and the point cloud into a large point cloud;

3.5) for the invalid area of the depth value formed by the non-overlapping field of view in the depth map, calculating the spatial position of each pixel in the invalid area; if the spatial point position corresponding to the pixel is just coincident with the spatial position of a certain point in the large point cloud, directly endowing the z coordinate of the point, namely the depth value, to the pixel as the depth value; if the space point corresponding to the pixel does not coincide with the point in the point cloud, calculating the average value of the sum of the distances between the space point and the cloud point of the adjacent point in the large point cloud, and if the value is greater than a certain threshold value, taking the value as the depth value of the pixel value;

3.6) detecting whether pixel points without effective values exist in the depth map; if the pixel points which do not have the effective values exist, the depth value filling is carried out on the depth map by using the combined bilateral filter, and each pixel point in the depth map is ensured to have a depth value;

3.7) using the extended intrinsic image decomposition model function with the normal vector of each pixel point as a variable as an illumination model function of each pixel point; the extended intrinsic image decomposition model function used is:

3.8) calculating shading information for each pixel in the image; the shading information function is expressed by using a matrix form of a linear polynomial of zero-order and first-order spherical harmonic coefficients and a point cloud surface normal vector, namely:

firstly, calculating a normal vector of each point, and then solving a parameter vector of a light and shade function through a target function minimizing a difference value between the light and shade function and the real illumination intensity so as to determine a light and shade function value for each pixel point;

3.9) calculating albedo for each pixel in the image; since the shading information function only considers distant light sources and ambient light sources, it is a preliminary prediction of illumination, and thus it is necessary to introduce a different albedo for each pixel in order to take into account the effects caused by specular reflection, shadows and low-beam sources;

the minimization objective function is constructed as:

where ρ is the albedo of each pixel, I is the illumination intensity of each pixel, N is a neighborhood of the pixel being operated on in the full-image iteration operation, ρ_kIs the pixel point in the neighborhood, λ_ρIs a parameter and, moreover,

3.10) calculating a value of the illumination difference for each pixel in the image due to the local illumination difference;

the minimization objective function is constructed as:

wherein β is the light difference value of each pixel point, β_kIs the pixel point in a neighborhood of the pixel being operated in the full image iteration operation;

is a parameter;

3.11) constructing an objective function between the light and shade model and the actually measured illumination intensity, and minimizing the objective function by using an improved depth enhancement acceleration algorithm so as to obtain an optimized depth map;

the normal of the point on the point cloud corresponding to each pixel is first represented in the form of the gradient of the depth map, i.e.:

wherein,

is the gradient of the depth map;

then, an objective function of depth optimization is established as follows:

wherein,

Δ z is the laplace transform of the depth map;

is a parameter;

the depth is then iteratively optimized using a depth-enhanced acceleration algorithm, as follows:

① input initial depth map, and spherical harmonic coefficient

And a vectorized albedo vector ρ and a vector of illumination difference values β;

② when the depth optimization objective function value is always in a reduced state, steps ③ to ⑤ are executed in a loop;

③ update

Wherein

④ update

⑤ update z^kSo that f (z)^k) Smaller;

after the step is finished, the step 2.1) to the step 2.5) are operated in the monocular algorithm, and the program is executed until the user is detected to execute the operation of stopping the method operation.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the method can well perfect the Kinect depth sensor to obtain the depth map, and solves the problems that only the depth sensor collects data, and the depth value is inaccurate.

2. The constraint of the Kinect equipment on the depth value range of the measured object is solved, and the range of the measurable depth is widened.

3. The invention uses a specially designed monocular distance measurement point screening method, so that the depth precision is far higher than that of the common method.

4. The Kinect can adapt to a complex illumination conversion environment, and meanwhile, the problem that the Kinect is not suitable for outdoor use is solved.

5. The invention uses the illumination model of the surface normal vector combined with the three-dimensional representation when calculating the global brightness of the object, and better describes the global illumination.

6. The invention considers and utilizes the global illumination effect and the local illumination effect when the light irradiates on the object under the real illumination environment, and has robustness and practical significance for processing the depth reconstruction under different illumination.

7. The invention uses the depth enhancement acceleration algorithm to carry out optimization on the depth optimization step, thereby greatly reducing the calculated amount of the optimization step and the running time of the method.

Detailed Description

The present invention will be further described with reference to the following specific examples.

The Kinect depth reconstruction method based on camera motion and image shading comprises the following steps:

1) under the condition that the Kinect depth camera and the RGB camera are aligned in a calibration mode, data collected by the Kinect are uploaded to a computer through a third-party interface.

2.2) for each read RGB picture, the following cost function is constructed:

representing the variance of the error.

The definition of the habo operator is as follows:

δ is a parameter of the habo operator.

Definition of the error function r_pThe following were used:

r_p(p,ζ_ji)＝I_i(p)-I_j(w(p,D_i(p),ζ_ji))

I_i(p) the gray value, ζ, representing the position of the pixel p in the current frame_jiLie algebra, D, representing a rigid body transformation that rotationally translates a three-dimensional point in the i-coordinate to the j-coordinate_i(p) denotes a depth value of a position corresponding to the pixel p in the depth map of the reference frame, w (p, D)_i(p),ζ_ji) And the three-dimensional point corresponding to the position of the pixel p in the reference frame i is transformed to the position of j in the current frame through a rotational translation rigid body, wherein the transformation formula is as follows:

wherein X, Y and Z respectively represent the coordinates of three-dimensional points in the XYZ three directions under the camera coordinate system, u and v represent pixel coordinates, and f_x，f_yRespectively, the focal lengths in the X and Y directions.

Variance (variance)

Is defined as follows:

wherein

Variance, V, representing picture gray_i(p) represents the variance of a pixel point p of the reference frame depth map.

2.3) solving zeta of the minimum cost function in the step 2.2) by a Gauss-Newton iteration method to obtain the rotation translation relation between the reference frame and the current frame.

2.4) calculating the gradient of all points on the gray level image of the reference frame, and selecting the points of which the gradient is greater than a threshold. These spots were then screened. The unit gradient direction g and gradient size g0 of the corresponding point are calculated. Calculating the inner product of g and l, and calculating

If it is not

This point is passed.

Traversing all the points meeting the requirements, and according to the epipolar geometry:

E＝[t]_xR

l＝Fx

and calculating the position of the epipolar line where the point requiring solution depth in the current frame is located. Wherein K₁Is the internal reference matrix of the RGB camera, t and R are the form of the algebraic decomposition of ζ lie in step 2.2) into a rotational translation. l is the corresponding epipolar line in the current frame, and x is the coordinate of the pixel point requiring solution depth in the reference frame.

The corresponding points of these points on the epipolar line of the current frame are searched. And calculating the space coordinates of the points according to the monocular vision three-dimensional reconstruction geometric knowledge:

according to

x(p^3TX)-(p^1TX)＝0

y(p^3TX)-(p^2TX)＝0

x(p^2TX)-y(p^1TX) 0 form a matrix with AX 0 is formed, where P is the product of the rotational-translational matrix and the internal reference matrix, P^iTIs the ith row of the P matrix and X is the three-dimensional coordinate of the solution point. And x and y are coordinates of a corresponding point in the current frame in the horizontal axis direction and the horizontal axis direction. For the decomposition of the A matrix extreme svd, the eigenvector with the smallest eigenvalue is the spatial coordinate of the point.

The updated depth value is:

the variance of the update depth is:

wherein d is_oIs the original depth in the depth map, d_pIs the newly calculated depth, σ_oIs the depth variance, σ, maintained in the depth map_pThe depth variance is calculated newly.

And 3.1) aligning the depth image collected by the depth camera under the current frame with the color image collected by the monocular color camera. Because the field of view range difference exists between the color camera and the depth camera, only the overlapped part of the two fields of view has effective depth value, and therefore, an incomplete depth map is obtained after alignment.

And 3.2) generating a three-dimensional point cloud from the incomplete depth map according to a model of a pinhole camera in the depth camera. The pinhole camera model can be briefly described as follows, and the relationship between the spatial coordinates [ x, y, z ] of a spatial point and its pixel coordinates [ u, v, d ] in the image can be expressed as:

z＝d/s

x＝(u-c_x)·z/f_x

y＝(v-c_y)·z/f_y

where d is the depth value of each pixel in the depth map, s is the scaling factor of the depth map, c_xAnd c_yIs the abscissa and ordinate of the principal point, f_xAnd f_yAre the focal length components in the abscissa direction and the ordinate direction.

The pixel coordinates of each pixel are converted into corresponding space coordinates by using the formula, and the conversion from the depth map to the three-dimensional point cloud can be completed.

3.3) registering the point cloud generated in the monocular algorithm and the point cloud generated by the incomplete depth map by using a point-to-point iterative nearest point (ICP) algorithm to obtain a rotation matrix R and a translation matrix T between the point cloud generated in the monocular algorithm and the point cloud generated in the incomplete depth map.

And 3.4) converting the point cloud obtained by the monocular algorithm to a coordinate system of the point cloud generated by the incomplete depth map according to the obtained rotation matrix and translation matrix, and splicing the point cloud and the point cloud into a large point cloud.

3.5) for depth value invalid regions in the depth map formed due to non-overlapping field of view ranges, we calculate their spatial position for each pixel that is in the invalid region. If the spatial point position corresponding to the pixel is just coincident with the spatial position of a certain point in the large point cloud, the z coordinate of the point, namely the depth value, is directly given to the pixel as the depth value. If the space point corresponding to the pixel does not coincide with a point in the point cloud, the average value of the sum of the distances between the space point and the cloud point of the adjacent point in the large point cloud is calculated, and if the value is larger than a certain threshold value, the value is taken as the depth value of the pixel value.

3.6) detecting whether pixel points without valid values exist in the depth map. If the pixel points without effective values still exist, the depth value filling is carried out on the depth map by using the combined bilateral filter, and each pixel point in the depth map is ensured to have a depth value.

3.7) using the extended intrinsic image decomposition model function with the normal vector for each pixel point as a variable as the illumination model function for each pixel point. The extended intrinsic image decomposition model function used is:

3.8) calculate shading information for each pixel in the image. The shading information function is expressed by using a matrix form of a linear polynomial of zero-order and first-order spherical harmonic coefficients and a point cloud surface normal vector. That is:

firstly, a normal vector of each point is calculated, and then a parameter vector of the light and shade function is solved through a target function which minimizes the difference between the light and shade function and the real illumination intensity, so that a light and shade function value is determined for each pixel point.

3.9) calculate the albedo for each pixel in the image. Since the shading information function only considers distant and ambient light sources, it is a preliminary prediction of the illumination, and thus a different albedo needs to be introduced for each pixel in order to take into account the effects caused by specular reflection, shadows and low-beam sources.

The minimization objective function is constructed as:

3.10) calculate for each pixel in the image the illumination difference value due to the local illumination difference.

The minimization objective function is constructed as:

wherein β is the light difference value of each pixel point, β_kIs the pixel point in a neighborhood of the pixel being operated on in the full graph iteration operation,

is a parameter.

3.11) constructing an objective function between the light and shade model and the actually measured illumination intensity, and minimizing the objective function by using an improved depth enhancement acceleration algorithm, thereby obtaining an optimized depth map.

wherein,

is the gradient of the depth map.

Then, an objective function of depth optimization is established as follows:

wherein,

az is the laplace transform of the depth map,

is a parameter.

Then, the depth is optimized iteratively by using a depth enhancement acceleration algorithm:

① input initial depth map, and spherical harmonic coefficient

And a vectorized albedo vector p and a vector of illumination difference values β.

② steps ③ through ⑤ are executed in a loop when the depth optimization objective function value is always in a reduced state.

③ update

Wherein

④ update

⑤ update z^kSo that f (z)^k) And is smaller.

After the step is finished, returning to the step 2.1) to the step 2.5) of the operation phase in the monocular algorithm, and executing the program until detecting that the user executes the operation of stopping the method operation.

The above-mentioned embodiments are merely preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, so that variations based on the shape and principle of the present invention should be covered within the scope of the present invention.

Claims

1. The Kinect depth reconstruction method based on camera motion and image shading is characterized by comprising the following steps of:

2) recovering a three-dimensional scene structure and a motion track of a kinect RGB camera from an RGB video sequence to obtain a point cloud and camera motion relation; the method comprises the following steps:

2.2) for each read RGB picture, the following cost function is constructed:

represents the variance of the error;

the definition of the habo operator is as follows:

δ is a parameter of the habo operator;

definition of the error function r_pThe following were used:

r_p(p,ζ_ji)＝I_i(p)-I_j(w(p,D_i(p),ζ_ji))

variance (variance)

Is defined as follows:

wherein

2.5) fusing the newly obtained depth value with the depth value in the depth image of the reference frame by using a Kalman filter;

2. The Kinect depth reconstruction method based on camera motion and image shading as claimed in claim 1, wherein the step 3) comprises the steps of:

z＝d/s

x＝(u-c_x)·z/f_x

y＝(v-c_y)·z/f_y

where d is per in the depth mapDepth value of a pixel, s is the scaling factor of the depth map, c_xAnd c_yIs the abscissa and ordinate of the principal point, f_xAnd f_yIs the focal length component in the abscissa direction and the ordinate direction;