CN114943776A

CN114943776A - Three-dimensional reconstruction method and device based on cross-correlation function and normal vector loss

Info

Publication number: CN114943776A
Application number: CN202210606204.4A
Authority: CN
Inventors: 朱翱宇; 陈珺; 罗林波; 官文俊; 熊永华
Original assignee: China University of Geosciences
Current assignee: China University of Geosciences
Priority date: 2022-05-31
Filing date: 2022-05-31
Publication date: 2022-08-26

Abstract

The invention provides a three-dimensional reconstruction method and a device based on cross-correlation function and normal vector loss, comprising the following steps: acquiring a multi-view image of a scene to be reconstructed and corresponding camera parameters; down-sampling the image and reducing the internal parameters of the camera; establishing an initial plane hypothesis in a random initialization mode at a minimum scale; calculating the cost of multi-view matching according to the central value normalized cross-correlation function and the normal vector loss function; passing the low-cost plane hypothesis in the neighborhood to the plane hypothesis of the current point; finding a better hypothesis through random perturbation; and (3) upsampling the hypothesis with the optimal current scale by a combined bilateral upsampling method, and continuing cost calculation, hypothesis propagation, plane disturbance optimization and upsampling as a plane hypothesis initialization result of the next scale until the original scale of the image. Therefore, the purposes of improving the accuracy of multi-view dense matching of the weak texture region and improving the accuracy and integrity of the reconstructed point cloud weak texture region are achieved.

Description

Three-dimensional reconstruction method and device based on cross-correlation function and normal vector loss

Technical Field

The invention relates to the technical field of computer vision image processing, in particular to a three-dimensional reconstruction technology, and provides a three-dimensional reconstruction method and a three-dimensional reconstruction device based on a cross-correlation function and normal vector loss.

Background

Three-dimensional reconstruction refers to accurately restoring the three-dimensional spatial shape of a scene or an object from the sampled data of a sensor. Three-dimensional reconstruction can be roughly classified into the following three types according to the difference of sensors: three-dimensional reconstruction based on lidar, three-dimensional reconstruction based on structured light, and three-dimensional reconstruction techniques based on Multi-View stereogeometry (MVS). The MVS technology is adopted for three-dimensional reconstruction, and the requirements on the sensor are not high, the cost is low, and the MVS is suitable for three-dimensional reconstruction of a large scene. MVS means that a three-dimensional scene is shot and imaged under multiple visual angles, depth information lost in the imaging process is recovered by using a solid geometry principle, and point cloud is reconstructed by using depth and camera parameters.

At present, MVS three-dimensional reconstruction has been widely applied in the fields of geological mapping, urban mapping and the like. The quality of the reconstructed point cloud mainly comprises integrity and accuracy, and the improvement of the quality of the point cloud is beneficial to subsequent geological disaster analysis and the like.

Although a high-quality point cloud model can be reconstructed for most scene MVS, for areas with insufficient color information, due to the fact that texture details are few, large-area similar areas exist, dense matching has ambiguity, a luminosity consistency cost function for evaluating depth estimation accuracy is degraded, and unacceptable errors occur in depth estimation. The ambiguity of dense matching of MVS methods is more severe especially when a certain point in space appears in a multi-view image only a few times.

In the MVS field, the existing basic thought for reconstructing weak texture regions has two: firstly, a weak texture region can be approximated by a spatial plane, parameters of one plane can be calculated by only using three points with correct depth in the weak texture region, and the problem that the weak texture region cannot be reconstructed can be solved by approximating the weak texture region by the plane. However, such methods are often only applicable to indoor scenes or artificial buildings, and will not be feasible when weakly textured areas cannot be approximated with a plane. Secondly, the idea of increasing the relative window size is that when texture information in a fixed window is not abundant, the matching degree between windows cannot be accurately evaluated, and the size of the window is increased by an appropriate value, which is beneficial to improving the color information amount in the window and improving the accuracy of dense matching. However, the photometric cost function actually measures the degree of matching between two windows, which can be approximated to the center pixel when the window size is small, but which approximation will introduce unacceptable errors when the window is too large.

Disclosure of Invention

In order to solve the problem that the existing MVS method can not accurately reconstruct the texture of the weak texture area, the invention adopts the technical scheme that a three-dimensional reconstruction method and a three-dimensional reconstruction device based on a cross-correlation function and normal vector loss are provided.

According to an aspect of the present invention, there is provided a three-dimensional reconstruction method based on a cross-correlation function and normal vector loss, comprising the steps of:

s1: acquiring a multi-view image of a scene to be reconstructed and corresponding camera parameters;

s2: down-sampling the image and reducing the internal parameters of the camera;

s3: adjusting the image to the minimum scale, and establishing an initial plane hypothesis pi in a random initialization mode _p ；

S4: selecting one of the multi-view images to be reconstructed in turn as a reference image, performing paired dense matching on the reference image and the rest images, and calculating the luminosity consistency cost among the multi-view images by using a central value normalized cross-correlation function; the rest images are source images;

s5: in a checkerboard fashion at a pixel point p to be processedSelecting one point with the lowest cost from eight directions in the neighborhood as a candidate point p _j Assuming the plane corresponding to the candidate point as pi _pj And the initial plane hypothesis π _p Forming a set S of candidate plane hypotheses _π ；

S6: selecting the visual angle of the image pixel level by utilizing the luminosity consistency cost, and calculating the weight of the cost values of different visual angles;

s7: forming an image pair by a reference image and a certain source image, then carrying out binocular stereo matching, and calculating S _π Assuming corresponding luminosity consistent cost, depth consistent cost and normal vector consistent cost for each plane, and performing weighted average processing on the multi-view image according to the weight obtained in the step S6 to obtain comprehensive cost;

s8: selecting a plane hypothesis corresponding to the lowest comprehensive cost as a new plane hypothesis of the pixel point p;

s9: finding a plane hypothesis with lower cost through random perturbation and updating the new plane hypothesis in the step S8;

s10, repeatedly executing the steps S4-S9 four times under the current scale, wherein the comprehensive cost is gradually reduced along with the increase of the iteration times;

s11: processing the optimal plane hypothesis under the current scale by a combined bilateral sampling method to serve as an initialization result of the plane hypothesis of the next scale, and amplifying internal parameters of the camera;

s12: the synthetic cost calculation (S4-S7), hypothesis propagation (S8), plane perturbation optimization (S9), and upsampling steps (S10-S11) are continued until the current scale reaches the original scale of the image.

Preferably, the S4 includes:

s4.1: for a multi-view image set I to be reconstructed, one image is selected from the multi-view image set I as a reference image I in turn _ref The remaining pictures are designated collectively as source pictures I _src To 1, pair _ref And I _src Performing dense matching of pairs;

s4.2: calculating a homography matrix H:

wherein K _ref Is I _ref Internal reference of camera, K _srcj Is jth sheet I _src R denotes a camera rotation matrix of the corresponding image, R _srcj A camera rotation matrix representing a source image,

a transpose of a camera rotation matrix representing the reference image, c is a coordinate of a column vector representing the optical center of the corresponding camera in the world coordinate system, c _ref Coordinates representing a reference image, c _srcj The coordinates of the source image are represented,

is a line vector, represents a normal vector, dist is c _ref Distance to plane assumption;

s4.3: passing I through a homography matrix H _ref All pixel points x in a fixed-size window centered on p _i Mapping to I _srcj Pixel point y in _i I.e. y _i ＝Hx _i ；

S4.4: the weight of the united bilateral filtering algorithm is used as the weight of a pixel at a certain point in a window, and the weight calculation formula is as follows:

||p-x _i || ₂ denotes x _i And the L2 distance, | C, between the p coordinates _p -C _xi I represents the absolute value of the difference between pixel values between two points, σ _s And σ _c For fixed parameters, I is finally calculated by weighting the Normalized Cross-Correlation function (NCC) _ref And I _srcj The similarity of the pixel values in the corresponding two windows:

where j denotes that the corresponding source image is I _srcj P is I _ref A certain pixel point in, pi _p Is the plane hypothesis for the point, W _p Is a window of size 11X 11, C, centered on p _p Representing the pixel value corresponding to the point,

represents an average value of pixel values within the window;

s4.5: the window mean value in the NCC is improved, the pixel of the window center value replaces the window mean value and is named as NCCC, and the calculation formula is as follows:

where p' denotes p-point mapping to I _srcj Comparing the similarity results calculated by NCC and NCCC, selecting the result with high numerical value as the similarity of two points, rewriting the similarity into a cost function form, and calculating to obtain the photometric consistency cost (photometric consistency cost) by the following formula

Has a variation range of [0,2 ]]。

Preferably, the S5 includes:

s5.1: statistics of

I less than 2 _srcj Is named as N, if N is greater than zero, then pi _p At a photometric cost of：

If N is equal to zero:

e _p (p,π _p )＝2

eight candidate hypothesis points are selected within the p-point neighborhood.

Preferably, the S6 includes:

s6.1: for the assumption of eight candidate planes corresponding to the m source images and the eight candidate points, calculating the cost loss, and obtaining a cost matrix with the size of 8 × m:

wherein a is _i,j ＝e _p (p,π _ni ) I 1,2, 8, j is a positive integer no greater than m, pi _ni Represents the ith plane hypothesis out of 8 plane hypotheses, and j represents I _ref And j (I) _srcj Performing dense matching when a _i,j Is less than

Hour best point S _g When a is _i,j Greater than τ ₁ Judging as a dead pixel, wherein t represents the current iteration times; for a particular viewing angle src _j The weight for calculating the cost value is as follows:

wherein | S _g (j) I denotes src _j Number of points, σ _v Is a parameter used to adjust the size of the weights.

Preferably, the S7 includes:

s7.1: calculating the depth consistency cost between multiple views:

first calculate I _ref A certain image inPlain dot correspondence I _src Two-dimensional coordinates of the projection of (a) followed by inverting I _src Coordinate points are projected to I _ref Calculating the distance between a starting point and an end point as the depth consistency cost;

s7.2: and (3) calculating a normal vector loss term to obtain the consistent cost of the normal vector as follows:

s7.3: and integrating all the cost items to obtain a comprehensive cost, namely:

wherein λ _d And λ _n The weights for the depth error and the normal vector error are adjusted,

in order to achieve a consistent cost in terms of luminosity,

in order to achieve a consistent cost in depth,

the cost is consistent with the normal vector;

s7.4: and performing cost aggregation according to the weight of each view in the step S6 to obtain the comprehensive cost of the multiple views:

wherein, ω (src) _j ) Is the weight of each view in S6, e ^j (p,π _ni ) Is a reference picture I _ref Upper points p and I _srcj The sum of the costs between corresponding points, i 1,2, 8, j is a positive integer no greater than m, pi _ni Represents the ith plane hypothesis of the eight plane hypotheses, and j represents I _ref And a firstj sheets of paper I _srcj Dense matching is performed.

Preferably, the S8 includes: if e (p, π) _ni ) The smallest value of (d) is less than e (p, pi) _p ) And pi _ni The depth of the corresponding point is in the interval between the maximum depth and the minimum depth (d) _min ,d _max ) In the interior, then use pi _ni Substituted n _p Namely:

preferably, step S9 includes: for plane hypothesis pi _p ＝(n _x ,n _y ,n _z D) performing a perturbation, calculating a planar assumption of the perturbation _pi Corresponding to the sum of the costs in step S7.4, if π _pi Has a composite cost less than the planar hypothesis pi _p The composite cost of (2) is pi _pi Substituted n _p 。

According to another aspect of the present invention, there is also provided a three-dimensional reconstruction apparatus based on cross-correlation function and normal vector loss, comprising the following modules

The data preprocessing module is used for down-sampling the original image and reducing the camera parameters;

the initialization plane hypothesis module is used for randomly initializing a plane hypothesis corresponding to each pixel point on the minimum scale;

the luminosity consistency cost calculation module is used for calculating luminosity consistency costs among the multi-view pictures according to the plane hypothesis;

the pixel-level visual angle selection module is used for calculating the weights of different visual angles according to the luminosity consistency cost;

the hypothesis propagation module is used for selecting one plane hypothesis with the minimum comprehensive cost in the candidate plane hypotheses of the eight neighborhoods;

the plane perturbation module is used for trying to find a better plane hypothesis through random perturbation;

and the up-sampling module is used for up-sampling the plane hypothesis with low scale through a joint bilateral up-sampling algorithm.

The method starts from the construction of a more accurate and reliable cost function, and basically depends on the fact that imaging points of a three-dimensional point in different viewing angles have the same normal vector and the pixel values of the neighborhood of the imaging points have higher similarity. Firstly, image down-sampling is carried out, a plane hypothesis is initialized randomly for the down-sampled image, the quality of the plane hypothesis is evaluated through a cost function, then a checkerboard propagation algorithm is utilized to propagate the accurate plane hypothesis in the plane to the neighborhood, then plane disturbance optimization is carried out, and finally the image and the corresponding plane hypothesis are up-sampled and are continuously updated in a propagation and optimization mode until the image reaches the original size. When evaluating the properties of the plane hypothesis, the central value is used to normalize the cross-correlation function, and the similarity between the central points of the two windows is evaluated more accurately. Meanwhile, normal vector loss among multiple visual angles is introduced, dense matching ambiguity of weak texture areas is further relieved, and completeness and accuracy of point cloud reconstruction are improved.

The technical scheme provided by the invention has the following beneficial effects:

1. the method can solve the problem that dense matching of the image weak texture area has ambiguity, and improve the accuracy of dense matching.

2. Dense matching among multiple pictures can be rapidly completed by utilizing GPU parallel computing, and a large scene three-dimensional point cloud containing tens of millions of elements is generated.

Drawings

The specific effects of the present invention will be further explained with reference to the drawings and examples, wherein:

FIG. 1 is a flow chart of a method for three-dimensional reconstruction based on cross-correlation function and normal vector loss in accordance with the present invention;

FIG. 2 is a general block diagram of the three-dimensional reconstruction method of the present invention based on cross-correlation functions and normal vector penalties;

FIG. 3 is a screenshot of the point cloud generated by the present invention;

FIG. 4 is a depth contrast plot for the present invention and conventional methods;

fig. 5 is a diagram of normal vector comparison between the present invention and the conventional method.

Detailed Description

For a more clear understanding of the technical features, objects, and effects of the present invention, embodiments of the present invention will now be described in detail with reference to the accompanying drawings.

Referring to fig. 1 and fig. 2, this example provides a more accurate dense matching similarity measurement formula, and introduces normal vector loss between multi-view images, that is, provides a three-dimensional reconstruction method based on cross-correlation function and normal vector loss, which mainly includes the following steps:

step S1 specifically includes:

the camera pose (K, R, t) and the sparse point cloud are obtained by performing sparse reconstruction through a recovery Structure From Motion (SFM), wherein K is an internal reference of the camera and is a 3 x 3 matrix which represents the corresponding relation between two-dimensional projections of three-dimensional coordinates, R is a rotation matrix of the camera and is a 3 x 3 unit orthogonal matrix, and 3 row vectors in the matrix are standard orthogonal bases which respectively represent coordinates of vectors of unit lengths on an x axis, a y axis and a z axis of a camera coordinate system in a world coordinate system. t is a translation vector, is a 3 × 1 column vector, and represents the coordinates of the world coordinate system origin in the camera coordinate system.

Determining the depth variation range (d) of each picture of the camera by the depth of the characteristic points obtained by SFM reconstruction _min ,d _max )。

And counting the matched characteristic points in the multi-view images to obtain the initial similarity between the multi-view images. Specifically, the more the feature points with the same name are in the two images, the higher the similarity degree is.

S2: down-sampling the image and correspondingly reducing the internal parameters of the camera;

the step 2 specifically comprises the following steps:

and (3) carrying out down-sampling on the image for multiple times in a bilinear interpolation mode until the length and the width of the image are not more than 1200 pixel points, and multiplying the offset of the focal length and the optical center in the camera internal reference on the imaging surface by a down-sampling proportion in each down-sampling.

S3: establishing an initial planar hypothesis pi in a random initialization manner at a minimum scale _p ；

The step 3 specifically comprises:

plane hypothesis pi _p Includes four elements, depth value d and normal vector n ═ n (n) _x ,n _y ,n _z )。I _ref Each pixel point in the three-channel normal vector graph corresponds to a plane hypothesis, all the depths are arranged according to two-dimensional coordinates to form a depth graph, and three channels of RGB respectively correspond to three components of a normal vector.

Reciprocal construction of the maximum and minimum depths intervals (1/d) determined with sparse reconstruction _max ,1/d _min ) Initializing the depth values at uniformly distributed random samples within the interval, i.e.

Since most points are concentrated close to the camera optical center and fewer points are far from the optical center, uniform sampling by inverse number focuses more on smaller depth values, and larger depth values are ignored appropriately.

In order to make the initialized normal vectors uniformly distributed on the unit hemisphere, random sampling is first performed on the interval (-1,1), and the initialization parameter q is randomized ₁ And q is ₂ To ensure

The normal vector can be initialized to the form:

since an object can only be viewed by the camera if the normal vector of the object's surface is oriented towards the camera's imaging plane, let n equal-n if the cross product of the normal vector and the camera's principal optical axis is greater than zero. So far the plane assumes initialization is complete, i.e.

S4: one image is selected from the multi-view images to be reconstructed in turn as a reference image, the reference image and the rest images (named as source images) are subjected to paired dense matching, and the photometric consistency cost is calculated by utilizing a central value normalized cross-correlation function.

The step 4 specifically comprises the following steps:

for a multi-view image set I to be reconstructed, one image is selected from the multi-view image set I as a reference image I in turn _ref (reference image), the remaining images being designated collectively as source images I _src (source image). In order to recover the depth information lost by the camera in the process of three-dimensional to two-dimensional projection, the multi-view geometrical principle is utilized to carry out the projection on the depth information _ref And I _src Dense matching of pairs is performed.

From the camera pose in S1 and the initialized plane hypothesis in S3, the homography matrix H can be calculated, H being a 3 x 3 matrix, which can be represented by I _ref Pixel point p in (1) is mapped to I _src Pixel point p' in (1). The formula for the homography matrix is as follows:

wherein K is _ref Is a 1 _ref Internal reference of camera, K _srcj Is jth sheet I _src R denotes a camera rotation matrix of the corresponding image, R _srcj A camera rotation matrix representing a source image,

is a line vector, represents a normal vector, dist is c _ref Go to flatThe distance of the face hypothesis. R denotes the camera rotation matrix for the corresponding image, and c is the coordinate of the column vector representing the optical center of the corresponding camera in the world coordinate system.

Are row vectors, representing the normal vectors of the plane hypothesis in S3. dist is c _ref The distance to the plane assumption, i.e.:

wherein d is _init Depth initialized for S3, p is pixel point at I _ref Coordinates of (2).

Although the direct purpose of the invention is to evaluate the matching degree between two pixel points, since the information amount of two pixel values is too small to accurately evaluate the matching degree between two points, considering that the pixel values in the image are continuously changed, the information amount of a window with a certain point as the center is rich, and the window has better similarity in the neighborhood window of the same point among multiple visual angles, so that the I can be represented by the homography matrix H _ref All pixel points x in a fixed-size window centered on p _i Is mapped to I _srcj Pixel point y in _i I.e. y _i ＝Hx _i 。

Calculating the similarity of pixel values in two windows through a weighted Normalized Cross Correlation (NCC), wherein the weight of a joint bilateral filter algorithm is used as the weight of a pixel calculation result at a certain point in the window in consideration of the fact that the pixel value which is possibly obviously changed in color in the window has a large influence on the result, and the weight calculation formula is as follows:

wherein

Denotes x _i And a central pointThe degree of approximation of p, | | p _i -x|| ₂ Denotes x _i And the L2 distance, | C, between the p coordinates _p -C _xi I represents the absolute value of the difference between pixel values between two points, σ _s And σ _c Are fixed parameters. When x is _i And the closer the color and distance between p, x _i The greater the contribution to the final result and vice versa. Final calculation of I _ref And I _srcj The function of the similarity of the corresponding two windows in (1) is:

the higher the similarity, the closer the corresponding plane hypothesis is to the true value, the numerator is the covariance representing the elements between the two windows, and the denominator plays a normalization role.

Because the NCC approximately measures the similarity between the pixels at the central points of two windows according to the similarity between the windows, when the size of the windows is increased, the approximation inevitably brings unacceptable errors, in order to more accurately measure the similarity relation between the central points, the window mean value in the NCC is improved, and the pixels of the window central value are used for replacing the window pixel mean value, so that the cost function more accurately evaluates the similarity between the central points. The modified formula is named as a central Normalized Cross Correlation function (NCCC) and is calculated as follows:

where p' denotes p-point mapping to I _srcj The corresponding point on. Although the NCCC is more accurate, it is not robust enough compared with the NCC, so for the plane assumption of a certain point, the similarity results calculated by the NCC and the NCCC are compared, the result with a high value is selected as the similarity of two points, and for the convenience of integration with other cost items, the similarity is rewritten into a cost function form, that is:

wherein

Has a variation range of [0,2 ]]Named photometric consistent cost, denotes a planar assumption of π _p In I _ref And I _srcj The cost of the calculation in (c).

S5: selecting a plane hypothesis pi corresponding to the point with the lowest cost in eight directions in the neighborhood of the central point in a checkerboard mode _pj A plane hypothesis as a candidate;

the step 5 specifically comprises the following steps:

for I _ref The plane in (1) assumes pi _p At different viewing angles I _srcj Unity cost of luminosity of

Conducting polymerization and statistics

I less than 2 _srcj Is named as N, if N is greater than zero, then pi _p The cost of (a) is:

if N is equal to zero:

e _p (p,π _p )＝2

find e in eight candidate areas (up, down, left, right, and four directions at near and up, down, left, and right at far) _p (p,π _p ) The smallest point is used as a candidate hypothesis point on the region. Eight candidate hypothesis points are obtained.

S6: selecting a pixel-level visual angle by using luminosity loss, and calculating the weights of different visual angles;

the step 6 specifically comprises the following steps:

for m source images and eight candidate plane hypotheses, calculating the cost penalty will result in a cost matrix of 8 × m size:

wherein a is _i,j ＝e _p (p,π _ni )，π _ni Denotes the ith hypothesis of eight plane hypotheses, j denotes I _ref And j (I) _src And performing dense matching (dense matching is different from feature matching, the feature matching needs to find the same feature points in the picture, and the dense matching refers to trying to find corresponding pixel points in other source images for each pixel point in a reference image). When a is _i,j Is less than

Hour is judged good spot S _g When a is _i,j Greater than τ ₁ And judging as a dead pixel, wherein t represents the current iteration times. For a particular viewing angle src _j I.e. the jth column in the matrix a, wherein if the number of good points is greater than 2 and the number of bad points is less than 3, this view is used to perform multi-view stereo matching (the multi-view stereo matching is actually a multi-group binocular stereo matching, and one reference image needs to perform dense matching pairwise with other source images at a time), and the weight of the calculated cost is:

wherein | S _g (j) I denotes src _j Number of well points, σ _v Is a parameter for adjusting the weight, and the specific setting value is set according to experiments and experience.

If the number of good points is less than 2 and the number of bad points is less than 3, the view angle is used for multi-view angle stereo matching, and the weight of the calculated cost value is as follows:

ω(src _j )＝τ(t)

if the number of the dead pixels is more than 3, the view angle is not used for multi-view stereo matching, and the weight is 0.

S7: calculating the luminosity consistent cost, the depth consistent cost and the normal vector consistent cost of binocular stereo matching, and carrying out weighted average on the cost of the multi-view image according to the weight in S6;

the step 7 specifically comprises the following steps:

the depth consistency penalty between multiple views (i.e., multi-view) is computed. The similarity calculation is performed through the color information in both the NCC and the NCCC, and besides, the accuracy of dense matching can be improved by using the geometric relationship between the depth values corresponding to the multi-view image. Firstly, I is _ref The two-dimensional coordinate of a certain pixel point in the system is calculated out a three-dimensional coordinate through depth and camera parameters, namely:

then through I _src Calculating three-dimensional point X by using camera parameters _ref (p) in I _src The coordinates of the imaging point in (1), namely: p is a radical of _src ＝P _src X _ref In which P is _src ＝K _src [R _src |t]＝K _src [R _src |-R _src C _src ]To two-dimensional coordinate p _src Perform similar operations to calculate it at I _ref The distance between x and p is used as a depth cost function of dense matching to improve the matching accuracy. The depth cost is calculated by the formula:

when the occlusion condition occurs in multiple views, the depth cost is very large, and therefore, a truncation threshold value delta is introduced, so that the cost function is more robust.

The computational vector loss term, namely:

and integrating all the cost items to obtain a comprehensive cost, namely:

wherein λ _d ，λ _n Are weights that adjust for depth errors and normal vector errors.

And performing cost aggregation on the comprehensive cost values of the multiple views according to the weight of each view in the step S6, that is:

wherein, ω (src) _j ) Is the weight of each view in S6, e ^j (p,π _ni ) Is a reference picture I _ref Upper points p and I _srcj The sum of the costs between corresponding points, i 1,2, 8, j is a positive integer no greater than m, pi _ni Represents the ith plane hypothesis of the eight plane hypotheses, and j represents I _ref And j (I) _srcj Performing dense matching;

the step 8 specifically comprises:

if e (p, π) _ni ) The smallest value of (d) is less than e (p, pi) _p ) And pi _ni The depth of the corresponding point is (d) _min ,d _max ) Within the interval, then use pi _ni Substituted n _p Namely:

s9: and finding a plane hypothesis with lower cost through random disturbance and updating the plane hypothesis.

The step 9 specifically comprises:

for plane hypothesis pi _p ＝(n _x ,n _y ,n _z D) performing a perturbation, let d _perturb U (d- γ, d + γ), γ being the perturbation amplitude.

Wherein q is ₁ ～U(-1,1),q ₂ ～U(-1,1),

n _perturb ＝Rn

To speed up the computation, we reduce R to a single matrix:

where R is the vector of rotation, θ _x ,θ _y ,θ _z Is the angle of rotation of the vector about the x, y, z axes.

And (3) arranging and combining the new normal vectors and depths obtained by perturbation and random to form a new plane hypothesis:

π _p1 ＝(n,d _rand ),

π _p2 ＝(n _rand ,d),

π _p3 ＝(n _perturb ,d _rand ),

π _p4 ＝(n _perturb ,d),

π _p5 ＝(n,d _perturb )

respectively calculating the synthetic cost corresponding to the new plane hypothesis, together with the initial pi _p Selecting the least costly plane hypothesis as the new pi _p And completing the propagation of the checkerboard hypothesis.

s11: sampling the hypothesis with the optimal current scale by a combined bilateral sampling method, taking the sampling as an initialization result of the plane hypothesis of the next scale, and amplifying internal parameters of the camera;

the step 11 specifically includes:

the obtained planar hypotheses are jointly bilateral upsampled. The coordinate of a certain pixel point p in the up-sampled image corresponding to the point of the low-scale image is o, the up-sampling proportion is k, and any point p in a window with the size of k multiplied by k and taking o as the center is traversed _i And calculating the spatial similarity:

calculating the color similarity:

C _i is the color similarity, C _pi And C _o Are the pixel values of the pi-point and the o-point, respectively. The depth of the p-point is then:

d _i is a point p _i Corresponding depth values, three normal phasor components for p points can be calculated for the same reason:

n _xi is p _i The x-component of the point-corresponding normal vector, then for camera-intrinsic parametersThe offset of the focal length and the optical center on the imaging plane is multiplied by the magnification.

S12: and continuing to perform the steps of cost calculation, hypothesis propagation, plane disturbance optimization and upsampling until the current scale reaches the original scale of the image, namely the maximum scale of the image, removing inconsistent depth estimation according to the depth map and the normal vector, and synthesizing point cloud according to the depth map and the camera parameters to obtain a reconstructed three-dimensional map.

In order to test the performance of the method, experiments are respectively carried out on the own unmanned aerial vehicle surface data set ZIGUI and the three-dimensional reconstruction public data set ETH3D,

ZIGUI data set is gathered by unmanned aerial vehicle aerial photography, and the shooting object is the earth's surface, shoots the image of a hundred more different visual angles to an area, and most region is strong texture region in the image. In FIG. 3, the left image shows a reconstructed point cloud screenshot of a pipe scene in ETH3D, the qualitative performance of the method on an indoor data set is shown, the reconstructed point cloud of a certain area in a ZIGUI data set is shown on the right side, and the performance of the method on an outdoor unmanned aerial vehicle data set is shown.

(1) Experimental parameters

The evaluation indexes in the quantitative experiment mainly comprise three indexes: accuracy (accuracy), integrity (completeness), and F-score.

Under the error of epsilon (default is 2cm), if a certain point P in the point cloud falls into a sphere with a certain point G as a center and the radius of epsilon in a real label (Ground Truth), P is judged as an interior point. Let the number of points in the real label be N _G The number of interior points is N _in Then the integrity is defined as:

in the same way, the roles of the GT of the point cloud are exchanged, and if a certain point G in the real label (Ground Truth) falls in a sphere with a radius epsilon taking the certain point P of the point cloud as the center, G is judged as an interior point. Let the number of the point cloud midpoints be N _p The number of interior points is N _in The definition of accuracy is then:

f-score is a comprehensive assessment of alignment accuracy and integrity, and is defined as:

after multiple times of parameter debugging of experiments, the hyper-parameter in the experiments is set as sigma _s ＝1、σ _c ＝20.5、σ _v ＝0.28、γ＝0.2、θ _x ＝θ _y ＝θ _z ＝0.04π、τ ₀ ＝1.2、σ ₁ ＝0.5、σ _t ＝20、λ _d ＝0.8、λ _n ＝0.42。

(2) Three-dimensional reconstruction quantitative and qualitative testing

Finally, the quantitative results on ETH3D are shown in Table I, and it can be seen that the method provided by the invention has obvious improvement in both integrity and accuracy, and F-score improves two points, thereby significantly improving the accuracy of the reconstructed point cloud.

Qualitative experimental results are shown in fig. 3 and 4, the original experimental results of the modified method are shown on the left side, the experimental results of the method are shown on the right side, a normal vector comparison graph of the invention and the conventional method is shown in fig. 5, and the obtained quantitative experimental results are shown in table 1:

Method	completeness	accuracy	F-score
				NCC	0.726445	0.903786	0.805469
cost of center value	0.736656	0.906098	0.812638
				Central value cost and normal vector cost	0.753447	0.904288	0.822005

In conclusion, the experimental result of the method is better than that of the original method no matter the depth, the normal vector or the finally generated point cloud, and the central value normalization cross-correlation function and the normal vector consistency cost function are very helpful to the quality improvement of the reconstructed point cloud.

In some embodiments, a three-dimensional reconstruction apparatus of a center-valued normalized cross-correlation function and normal vector loss is also provided, comprising the following modules:

and the data preprocessing module is used for down-sampling the original image and reducing the camera parameters.

And the plane hypothesis initializing module is used for randomly initializing the plane hypothesis corresponding to each pixel point on the minimum scale.

the hypothesis propagation module is used for selecting one hypothesis with the minimum comprehensive cost in the candidate plane hypotheses of the eight neighborhoods;

and the up-sampling module is used for up-sampling the plane hypothesis with the low scale to the high scale through a joint bilateral up-sampling algorithm. For example, 400 × 400 is used for a low-scale picture, each point corresponds to a plane hypothesis, so the dimension of the plane hypothesis is 400 × 400 as the scale of the picture, when the low-scale picture is upsampled, the plane hypothesis is regarded as a "multi-channel picture", and the upsampling is performed by the same method to obtain a higher-dimensional "multi-channel picture".

The method starts from the construction of a more accurate and reliable cost function, and basically depends on the fact that imaging points of a three-dimensional point in different viewing angles have the same normal vector and the pixel values of the neighborhood of the imaging points have higher similarity. Firstly, image down-sampling is carried out, a plane hypothesis is initialized randomly for the down-sampled image, the quality of the plane hypothesis is evaluated through a cost function, then a checkerboard propagation algorithm is utilized to propagate the accurate plane hypothesis in the plane to the neighborhood, then plane disturbance optimization is carried out, and finally the image and the corresponding plane hypothesis are up-sampled and are continuously updated in a propagation and optimization mode until the image reaches the original size. When evaluating the properties of the plane hypothesis, the central value is used to normalize the cross-correlation function, and the similarity between the central points of the two windows is evaluated more accurately. Meanwhile, normal vector loss among multiple visual angles is introduced, dense matching ambiguity of weak texture areas is further relieved, and completeness and accuracy of reconstructed point cloud are improved

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third and the like do not denote any order, but rather the words first, second and the like may be interpreted as indicating any order.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A three-dimensional reconstruction method based on cross-correlation function and normal vector loss is characterized by comprising the following steps:

s2: down-sampling the image and reducing the internal parameters of the camera;

s5: selecting a point with the lowest cost as a candidate point p in eight directions in the neighborhood of a pixel point p to be processed in a checkerboard mode _j The plane corresponding to the candidate point is assumed to be pi _pj And the initial plane hypothesis pi _p Forming a set S of candidate plane hypotheses _π ；

s10, repeatedly executing the steps S4-S9 for four times under the current scale, wherein the comprehensive cost is gradually reduced along with the increase of the iteration times;

s12: the synthetic cost calculation similar to steps S4-S7, the hypothesis propagation of step S8, the plane perturbation optimization of step S9, and the iteration and upsampling operations of steps S10-S11 are continued until the current scale reaches the original scale of the image.

2. The three-dimensional reconstruction method based on the cross-correlation function and the normal vector penalty of claim 1, wherein said S4 comprises:

s4.2: calculating a homography matrix H:

wherein K is _ref Is I _ref Internal reference of camera, K _srcj Is jth sheet I _src R denotes a camera rotation matrix of the corresponding image, R _srcj A camera rotation matrix representing a source image,

a transpose of a camera rotation matrix representing the reference image, c is the coordinate of the column vector representing the optical center of the corresponding camera in the world coordinate system, c _ref Coordinates representing a reference image, c _srcj The coordinates of the source image are represented,

s4.3: by means of a homography matrix H, I _ref All pixel points x in a fixed-size window centered on p _i Mapping to I _srcj Pixel point y in _i I.e. y _i ＝Hx _i ；

||p-x _i || ₂ represents x _i And the L2 distance, | C, between the p coordinates _p -C _xi I represents the absolute value of the difference between pixel values between two points, σ _s And σ _c For fixed parameters, I is finally calculated by a weighted normalized cross-correlation function _ref And I _srcj The similarity of the pixel values in the corresponding two windows:

wherein j denotes that the corresponding source image is I _srcj P is I _ref A certain pixel point in (n), n _p Is the plane hypothesis for the point, W _p Is a window of size 11X 11, C, centered on p _p Representing the pixel value corresponding to the point,

represents an average value of pixel values within the window;

where p' denotes p-point mapping to I _srcj Comparing the similarity results calculated by NCC and NCCC, selecting the result with high numerical value as the similarity of two points, rewriting the similarity into a cost function form, and calculating by the following formula to obtain the cost with consistent luminosity

Has a variation range of [0,2 ]]。

3. The method for three-dimensional reconstruction based on cross-correlation function and normal vector penalty of claim 2, wherein said S5 includes:

s5.1: statistics of

I less than 2 _srcj Is named as N, if N is greater than zero, then pi _p The photometric consistency penalty of (1) is:

if N is equal to zero:

e _p (p,π _p )＝2

eight candidate hypothesis points are selected within the p-point neighborhood.

4. The three-dimensional reconstruction method based on the cross-correlation function and the normal vector penalty of claim 1, wherein said S6 comprises:

wherein a is _i,j ＝e _p (p,π _ni ) I 1,2, 8, j is a positive integer no greater than m, pi _ni Represents the ith plane hypothesis of the eight plane hypotheses, and j represents I _ref And j (I) _srcj Performing dense matching when a _i,j Is less than

Hour is judged good spot S _g When a is _i,j Greater than τ ₁ Judging as a dead pixel, wherein t represents the current iteration times; for a particular viewing angle src _j The weight for calculating the cost value is as follows:

wherein | S _g (j) I denotes src _j Number of well points, σ _v Is a parameter used to adjust the size of the weights.

5. The method for three-dimensional reconstruction based on cross-correlation function and normal vector penalty of claim 4, wherein said S7 includes:

s7.1: calculating the depth consistency cost between multiple views:

first calculate I _ref A certain pixel point in (1) corresponds to _src Two-dimensional coordinates of the projection of (a) followed by inverting I _src Coordinate points are projected to I _ref Calculating the distance between a starting point and an end point as the depth consistency cost;

wherein λ _d And λ _n Respectively weights for adjusting the depth error and the normal vector error,

in order to achieve a consistent cost in terms of luminosity,

in order to achieve a consistent cost in depth,

the cost is consistent with the normal vector;

wherein, ω (src) _j ) Is the weight of each view in S6, e ^j (p,π _ni ) Is a reference picture I _ref Upper points p and I _srcj The sum of the costs between corresponding points, i 1,2, 8, j is a positive integer no greater than m, pi _ni Represents the ith plane hypothesis of the eight plane hypotheses, j represents I _ref And j (I) _srcj And performing dense matching.

6. The three-dimensional reconstruction method based on the cross-correlation function and the normal vector penalty of claim 5, wherein said S8 comprises: if e (p, π) _ni ) The smallest value of (d) is less than e (p, pi) _p ) And n is n _ni The depth of the corresponding point is in the interval between the maximum depth and the minimum depth (d) _min ,d _max ) In the interior, then use pi _ni Substituted n _p Namely:

7. the three-dimensional reconstruction method based on the cross-correlation function and the normal vector penalty of claim 6, wherein said S9 comprises: for plane hypothesis pi _p ＝(n _x ,n _y ,n _z D) performing a perturbation, calculating a planar assumption of the perturbation _pi Corresponding to the sum of the costs in step S7.4, if π _pi Has a composite cost less than the planar hypothesis pi _p The comprehensive cost of (2) is n _pi Substituted n _p 。

8. A three-dimensional reconstruction method based on cross-correlation function and normal vector loss is characterized by comprising the following modules:

the initialization plane hypothesis module is used for adjusting the image to the minimum scale, and then randomly initializing the plane hypothesis corresponding to each pixel point;

and the upsampling module is used for upsampling the plane hypothesis with low scale through a joint bilateral upsampling algorithm.