CN115619951A

CN115619951A - Dense synchronous positioning and mapping method based on voxel neural implicit surface

Info

Publication number: CN115619951A
Application number: CN202211263616.9A
Authority: CN
Inventors: 章国锋; 杨兴锐; 李海; 翟宏佳
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2022-10-16
Filing date: 2022-10-16
Publication date: 2023-01-17

Abstract

The invention discloses a dense synchronous positioning and mapping method based on a voxel neural implicit surface. The invention decomposes a three-dimensional scene into geometric units taking a voxel block as a unit, stores the internal geometric and texture information in the voxel block in a characteristic vector form, acquires the characteristics of corresponding three-dimensional points in an interpolation mode, and acquires a Symbol Distance Field (SDF) and corresponding colors through a geometric analysis network and a texture analysis network. On the basis, the invention provides cross iterative optimization through two processes of positioning and map building, and transmits latent map feature vectors between the two processes in a variable sharing mode; the invention innovatively introduces an octree method based on Morton coding to further improve the efficiency of map updating. The invention can render the surface and texture effects after editing by interactively editing the generated voxel blocks, thereby being applied to applications such as virtual reality, augmented reality and the like.

Description

Dense synchronous positioning and mapping method based on voxel neural implicit surface

Technical Field

The invention relates to the field of computer vision and computer graphics, in particular to a dense positioning and mapping method based on a voxel neural implicit surface.

Background

Dense positioning and map building (DSLAM) is the basis of a plurality of three-dimensional applications, and based on an accurate map of three-dimensional reconstruction, some interactive displays such as shielding, collision and the like can be completed in a scene with fusion of virtual and real, so that a more vivid effect is achieved in the enhanced display application.

The traditional DSLAM method usually adopts a characteristic matching-based mode and an optimization method of minimizing an energy function to solve a camera pose and optimize a map structure, the method usually adopts a discrete point cloud, a bin or a continuous Symbolic Distance Field (SDF) to represent a dense map, but the existing problems are obvious.

Local scene information is stored in compressed codes by methods based on depth features, such as code-slam and di-fusion, and the coded fields are optimized through multi-view constraints, so that the map is updated.

With the rise of the neural radiation field technology (NeRF), the trend of developing a new one is to store scene information by using an MLP network and generate a realistic rendering effect at each view, while, for example, the iMap method has completed a DSLAM system based on a neural implicit field by using such ideas, but the system has the problems that the whole scene is stored in a single MLP, the size of the scene needs to be firstly provided with prior information, so that the methods cannot model an unknown scene, and further operations such as editing of the scene become very difficult because the scene is implicitly stored in the MLP.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a dense synchronous positioning and mapping method based on a voxel neural implicit surface. When the system starts up, the global map is initialized by running some mapping iterations for the first frame. The system receives a sequence of RGBD images as input, only establishes voxels aiming at regions with depth information, optimizes surface and texture information in the voxels, aligns the surface and the texture of the existing map of the current frame in the front-stage tracking process, gradually optimizes the pose of the camera, jointly optimizes the frame with the estimated pose and the existing map in the rear-stage mapping process, and updates the map.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention firstly provides dense synchronous positioning and mapping based on a voxel nerve implicit surface, which comprises the following steps:

step 1: acquiring an RGB-D image of a first frame, and back-projecting the depth corresponding to each pixel point in the first frame image into a three-dimensional space, so as to obtain an initial three-dimensional point cloud in a map; setting a coordinate system where the initial three-dimensional point cloud is located as a reference coordinate system, and constructing a plurality of non-overlapped voxel blocks aligned with coordinate axes of the reference coordinate system based on the initial three-dimensional point cloud; constructing an octree structure based on the voxel blocks, and inserting Moron codes corresponding to the voxel blocks into the octree; meanwhile, fixed-length feature vectors are distributed to 8 vertexes of each voxel block, and the fixed-length feature vectors are used for storing geometric and texture information of a scene to be constructed;

step 2: randomly sampling M pixel points from the acquired image, generating rays which penetrate through each pixel point from a camera center corresponding to the image, and calculating the intersection of the rays and the constructed voxel block; uniformly sampling in the intersection area of the ray and the voxel block to obtain a sampled three-dimensional point, acquiring the characteristic vectors of 8 vertexes of the voxel block where the three-dimensional point is located through the three-dimensional coordinates of the three-dimensional point, and acquiring the characteristic vectors corresponding to the three-dimensional point through a characteristic extraction function; obtaining a Symbol Distance Field (SDF) and intermediate information through a geometric analysis network, and obtaining a color through a texture analysis network from the obtained intermediate information; calculating a space density value corresponding to the three-dimensional point through the SDF, and performing weight accumulation on the color and the depth of the three-dimensional point on the ray in a volume rendering mode to finally obtain the predicted color and the predicted depth of the pixel corresponding to the ray; comparing the predicted color and depth with the true color and depth, thereby optimizing the fixed-length feature vectors and the geometric analysis network and the texture analysis network on the vertexes of the voxel blocks;

and 3, step 3: after the step 2 is finished, starting a tracking process, wherein the tracking process is as follows: repeating the step 2 on the image obtained from the second frame, keeping the fixed-length feature vector, the geometric analysis network and the texture analysis network on the vertex of the voxel block unchanged, optimizing the camera 6 freedom degree pose corresponding to the image, completing positioning after optimization, constructing the optimized camera 6 freedom degree pose and the corresponding RGB-D image into a frame, and putting the frame into a candidate key frame list;

and 4, step 4: starting a graph building process, wherein the graph building process is as follows: acquiring a key frame list from the step 3, traversing the candidate key frame list, back-projecting the depth corresponding to the pixel point of each frame of image into a three-dimensional space according to the camera 6 degree-of-freedom pose corresponding to the image, and acquiring a three-dimensional point cloud corresponding to each frame; judging whether the three-dimensional points are contained in the created voxel block or not aiming at each three-dimensional point in the three-dimensional point cloud, if not, judging whether the three-dimensional points are contained in the created voxel block or not

Creating a new voxel block and updating the octree structure in the step 1, thereby achieving the purposes of dynamically creating the voxel block and expanding the map building area;

selecting a plurality of proper frames from the key frame list as key frames, and optimizing the proper frames together with the latest frames in the candidate key frame list; and (3) repeating the step (2) for the images in all the frames to be optimized, and optimizing the 6-degree-of-freedom pose of the frame while optimizing the fixed-length feature vector, the geometric analysis network and the texture analysis network on the vertex of the voxel block.

Further, the step 1 of constructing a plurality of non-overlapping voxel blocks aligned with the coordinate axes of the reference coordinate system based on the initial three-dimensional point cloud specifically includes:

initial three-dimensional point cloud is composed of a set of voxel blocks

Division, each voxel block having three-dimensional coordinates V _k = (x, y, z); the three-dimensional coordinates are converted into 64-bit binary coding information through Morton coding; each voxel block has 8 vertexes, and each vertex contains a feature vector of a fixed length

Geometric and texture information of the scene to be constructed, L, represented _e Is the length of the feature vector; thus, for any voxel V _i Arbitrary three-dimensional point inside

Neighboring voxel blocks share 4 vertex eigenvectors.

As a preferred embodiment of the present invention, the intersection of the computed ray in step 2 and the voxel block constructed in step 1 specifically includes:

a ray passing through a pixel on the image in the d direction from the camera center o is defined as r (t) = o + dt, t being the depth in the direction of the ray; and each Ray calculates the depth of the intersection point of the Ray and the voxel block through a Ray-AABB intersection detection algorithm, so that the region with the intersection of the Ray and the voxel block on the emergent line is divided.

Further, the step 2 of obtaining a feature vector corresponding to the three-dimensional point through a feature extraction function, obtaining a Symbol Distance Field (SDF) and intermediate information through a geometric analysis network, and obtaining a color from the obtained intermediate information through a texture analysis network specifically includes:

feature extraction function

Mapping a three-dimensional point p to a length L _e Feature vector of

The feature extraction function is realized by trilinear interpolation, and feature vectors contained by 8 vertexes of the voxel block are interpolated according to the three-dimensional coordinate of p and the relative position of the voxel block where p is located, so that a feature vector e of p is obtained;

representing a geometry-resolving network F using a multi-layer perceptron network (MLP) _σ And a texture resolution network F _c (ii) a Geometric resolution network

Generating its symbolic distance field by the feature vector e of p

Geometric feature vector with sum length of Lf

The notation of σ denotes whether p is inside or outside the surface S; the surface S of the scene is extracted by:

wherein operation [0]Means from F _σ To obtain a symbol distance field sigma at a position p; the ray directions d of the geometric characteristic vectors F and p of the three-dimensional point p and the characteristic vector e of p are connected as a texture analysis network F _c Get the color c at p.

Further, in step 2, the color and depth of the three-dimensional point on the ray are subjected to weight accumulation in a volume rendering manner, and the color and depth of the predicted pixel corresponding to the ray are finally obtained, specifically:

using a function phi _s (σ) converting the SDF of the three-dimensional point p into a density, φ _s (σ) is a function of the symbol distance σ for point p, where

Wherein

Is Sigmoid function, tr is predefined truncation distance, and the value of the point close to the surface is larger than the weight of the far point;

based on phi _s (σ) normalizing the density on the same ray and for N on the ray _p Volume rendering is carried out on the three-dimensional sampling points to obtain accumulated color C (r) and depth D (r):

wherein c is _i Is the color of point i on the ray, d _i Is the distance from point i on the ray to the optical center.

Compared with the prior art, the invention has the advantages that:

1) The invention utilizes the Morton coding scene voxel structure to accelerate the index speed of the voxel block, thereby accelerating the speed of positioning and mapping.

2) The method based on the voxel neural implicit surface can construct a more complete surface structure with realistic colors and support dynamic voxel block construction and map expansion.

Drawings

FIG. 1 is a schematic diagram of the process of the present invention;

fig. 2 is a diagram showing the reconstruction effect of the present invention.

Detailed Description

The invention is described in detail below with reference to the accompanying drawings. The technical features of the embodiments of the present invention can be combined correspondingly without mutual conflict.

Referring to fig. 1, the present invention utilizes two processes of front-end tracking and back-end mapping, and information related to a scene is stored in a data area shared by the front end and the back end and is dynamically updated along with the operation. The front section tracking process aligns the surface and texture of the existing map of the current frame, gradually optimizes the pose of the camera, and the rear end mapping process performs combined optimization on the frame with the estimated pose and the existing map and updates the map. The present invention will be described in detail below. The dense synchronous positioning and mapping method based on the voxel neural implicit surface comprises the following steps:

step 1: and acquiring an RGB-D image of a first frame, and back-projecting the depth corresponding to each pixel point in the image of the first frame into a three-dimensional space, thereby acquiring an initial three-dimensional point cloud in the map. And setting a coordinate system where the initial three-dimensional point cloud is located as a reference coordinate system, and dividing the initial three-dimensional point cloud into a plurality of non-overlapping voxel blocks which are aligned with the coordinate axes of the reference coordinate system. Specifically, each voxel block

With three-dimensional coordinates V _k = (x, y, z). These three-dimensional coordinates are converted to 64-bit binary coded information by Morton coding. Each voxel block has 8 vertexes, and each vertex contains a feature vector of a fixed length

Neighboring voxel blocks share 4 vertex eigenvectors. Based on thisConstructing an octree structure by the voxel blocks, and inserting Morton codes corresponding to the voxel blocks into the octree; meanwhile, fixed-length feature vectors are distributed to 8 vertexes of each voxel block, and the fixed-length feature vectors are used for storing geometric and texture information of a scene to be constructed, as shown in the left drawing of fig. 2.

Step 2: as shown in the volume rendering part of fig. 1, M pixel points are randomly sampled from the image, a ray passing through each pixel point from the camera center corresponding to the image is generated, and the intersection of the ray and the pixel block constructed in step 1 is calculated. Specifically, a ray passing through a pixel on an image in the d direction from the camera center o is defined as r (t) = o + dt, and t is a depth in the ray direction; and each Ray calculates the depth of the intersection point of the Ray and the voxel block through a Ray-AABB intersection detection algorithm, so that the region with the intersection of the Ray and the voxel block on the emergent line is divided.

And sampling a three-dimensional point p in a region with intersection with the voxel block according to uniform probability and obtaining a characteristic vector e of the p through a characteristic extraction function. In particular, feature extraction functions are defined

Mapping a three-dimensional point p to a length L _e Feature vector of

The feature extraction function is realized through tri-linear interpolation, and feature vectors contained by 8 vertexes of the voxel block are interpolated according to the three-dimensional coordinate of p and the relative position of the voxel block where p is located, so that the feature vector of p is obtained.

Symbol Distance Fields (SDF) and intermediate information are obtained through a geometric analysis network, and then the obtained intermediate information is used for obtaining colors through a texture analysis network. And calculating a space density value corresponding to the three-dimensional point through the SDF, and performing weight accumulation on the color and the depth of the three-dimensional point on the ray in a volume rendering mode to finally obtain the predicted color and the predicted depth of the pixel corresponding to the ray. In particular, using the function φ _s (σ) converting the SDF of the three-dimensional point p into a density, φ _s (σ) is a function of the symbol distance σ for point p, where

Wherein

Is the Sigmoid function and tr is the predefined truncation distance. The value of the point near the surface is greater than the weight of the far point;

based on phi _s (σ) normalizing the density on the same ray and for N on the ray _p Volume rendering is carried out on the three-dimensional sampling points, and accumulated color C (r) and depth D (r) can be obtained:

The predicted color and depth are compared to the true color and depth, thereby optimizing the fixed-length feature vectors and the geometric and texture resolving networks on the vertices of the voxel blocks.

And step 3: as shown in the tracking process of fig. 1, the process repeats the process of step 2 for the image starting from the second frame, but keeps the fixed-length feature vectors and the geometric analysis network and the texture analysis network on the vertices of the pixel blocks unchanged, optimizes only the pose of the camera 6 degree of freedom corresponding to the image, and constructs the optimized pose of the camera 6 degree of freedom and the corresponding RGBD image into a frame and puts it into the candidate keyframe list, as shown in the shared data area of fig. 1.

And 4, step 4: as shown in fig. 1, the process selects a plurality of suitable frames from the candidate key frame list in step 3 as key frames, constructs new voxel blocks, and optimizes the new voxel blocks together with the latest frames in the candidate key frame list. And (3) repeating the process of the step (2) for all the images in the frame to be optimized, optimizing the 6-degree-of-freedom pose of the frame while optimizing the fixed-length feature vector, the geometric analysis network and the texture analysis network on the vertex of the voxel block, and gradually increasing the map in the mapping process until the scene to be reconstructed is included in the voxel block.

Examples

Compared with the existing method, the method has the advantages that the reconstruction precision, the integrity and the camera pose estimation precision are obviously improved, the speed is higher and the storage is less. The invention also carries out generalization test in outdoor scene, and can obtain better reconstruction result.

The following table shows the positioning effect of the invention in the replay data set, wherein the table compares three measurement indexes (RMSE, mean) of the track accuracy of 8 scenes, and the smaller the corresponding value is, the higher the accuracy is. Compared with the two existing methods (iMap and NICE-SLAM), the method provided by the invention is superior to the existing method in three indexes, and the effect of the invention is better.

The reconstruction effect of the invention in the Replica data set is shown in the following table, in which the reconstruction accuracy and integrity (Acc, comp, comp Ratio) of 8 scenes are compared, wherein the smaller the values corresponding to the first two indexes are, the higher the reconstruction accuracy and integrity is, and the higher the integrity corresponding to the last index is. Compared with the two existing methods, the method provided by the invention is superior to the existing method in three indexes, and the effect of the invention is better.

The method can be used for positioning and mapping indoor and outdoor environments, editing reconstructed scenes and performing virtual-real fusion in augmented reality. The foregoing lists merely illustrate specific embodiments of the invention. It is obvious that the invention is not limited to the above embodiments, but that many variations are possible. All modifications which can be derived or suggested by a person skilled in the art from the disclosure of the present invention are to be considered within the scope of the invention.

Claims

1. A dense synchronous positioning and mapping method based on a voxel neural implicit surface is characterized by comprising the following steps:

step 1: acquiring an RGB-D image of a first frame, and back-projecting the depth corresponding to each pixel point in the image of the first frame into a three-dimensional space, so as to obtain an initial three-dimensional point cloud in a map; setting a coordinate system where the initial three-dimensional point cloud is located as a reference coordinate system, and constructing a plurality of non-overlapped voxel blocks aligned with coordinate axes of the reference coordinate system based on the initial three-dimensional point cloud; constructing an octree structure based on the voxel blocks, and inserting Morton codes corresponding to the voxel blocks into the octree; meanwhile, fixed-length feature vectors are distributed to 8 vertexes of each voxel block, and the fixed-length feature vectors are used for storing geometric and texture information of a scene to be constructed;

and step 3: after the step 2 is completed, starting a tracking process, wherein the tracking process is as follows: repeating the step 2 on the image obtained from the second frame, keeping the fixed-length feature vector, the geometric analysis network and the texture analysis network on the vertex of the voxel block unchanged, optimizing the camera 6 freedom degree pose corresponding to the image, completing positioning after optimization, constructing the optimized camera 6 freedom degree pose and the corresponding RGB-D image into a frame, and putting the frame into a candidate key frame list;

and 4, step 4: starting a graph building process, wherein the graph building process is as follows: acquiring a key frame list from the step 3, traversing the candidate key frame list, back-projecting the depth corresponding to the pixel point of each frame of image into a three-dimensional space according to the camera 6 degree-of-freedom pose corresponding to the image, and acquiring a three-dimensional point cloud corresponding to each frame; judging whether the three-dimensional points are contained in the created voxel blocks or not aiming at each three-dimensional point in the three-dimensional point cloud, if not, creating new voxel blocks, and updating the octree structure in the step 1, thereby achieving the purposes of dynamically creating voxel blocks and expanding image-creating areas;

selecting a plurality of proper frames from the key frame list as key frames, and optimizing the proper frames together with the latest frames in the candidate key frame list; and (3) repeating the step (2) for all the images in the frame to be optimized, and optimizing the 6-degree-of-freedom pose of the frame while optimizing the fixed-length feature vectors, the geometric analysis network and the texture analysis network on the vertexes of the voxel blocks.

2. The method for dense synchronous localization and mapping based on voxel neuro-implicit surface according to claim 1, wherein the step 1 of constructing a plurality of non-overlapping voxel blocks aligned with coordinate axes of a reference coordinate system based on the initial three-dimensional point cloud is specifically:

initial three-dimensional point cloud is composed of a set of voxel blocks

Dividing, each voxel block having three-dimensional coordinates V _k = (x, y, z); the three-dimensional coordinates are converted into 64-bit binary coding information through Morton coding; each voxel block has 8 vertexes, and each vertex contains a feature vector with a fixed length

Neighboring voxel blocks share the feature vectors of 4 vertices.

3. The method for dense synchronous localization and mapping based on the voxel neural implicit surface according to claim 1, wherein the intersection of the computed ray in step 2 and the voxel block constructed in step 1 is specifically:

4. The method for dense synchronous localization and mapping based on the voxel neural implicit surface according to claim 1, wherein the feature vectors corresponding to the three-dimensional points are obtained through a feature extraction function in the step 2, a Symbol Distance Field (SDF) and intermediate information are obtained through a geometric analysis network, and then the obtained intermediate information is used for obtaining colors through a texture analysis network, specifically:

feature extraction function

Mapping a three-dimensional point p to a length L _e Feature vector of

Generating its symbolic distance field by the eigenvector e of p

And a length of L _f Geometric feature vector

The sign of σ indicates whether p is inside or outside the surface S; the surface S of the scene is extracted by:

wherein operation [0]Means from F _σ Obtaining a symbol distance field sigma at a position p; the ray directions d of the geometric characteristic vectors F and p of the three-dimensional point p and the characteristic vector e of p are connected as a texture analysis network F _c Get the color c at p.

5. The method according to claim 1, wherein in step 2, the color and depth of the three-dimensional point on the ray are weighted and accumulated in a volume rendering manner, and the color and depth of the pixel corresponding to the predicted ray are finally obtained, specifically:

using a function phi _s (σ) converting the SDF of the three-dimensional point p into a density, φ _s (σ) is a symbol for point pA function of the distance σ, wherein

Wherein τ is the Sigmoid function, tr is a predefined truncation distance, and the value of the point near the surface is greater than the weight of the far point;