CN115953551A

CN115953551A - Sparse grid radiation field representation method based on point cloud initialization and depth supervision

Info

Publication number: CN115953551A
Application number: CN202211653775.XA
Authority: CN
Inventors: 王越; 张群康; 熊蓉
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2022-12-22
Filing date: 2022-12-22
Publication date: 2023-04-11

Abstract

The invention discloses a sparse grid radiation field representation method based on point cloud initialization and depth supervision, wherein the radiation field can be used for rendering of a new visual angle, the method comprises the steps of obtaining at least more than 2 camera color images, obtaining 2 camera depth images, generating the depth images into three-dimensional point clouds, utilizing camera parameters (including internal parameters and external parameters) of all the obtained camera images, generating rays for each pixel, obtaining a ray set of all the pixels and the like. Because most of real scenes are always blank, the sparse grid is used for representing the scenes, the memory cost is reduced, the completed three-dimensional reconstruction is used as geometric constraint to enhance the geometric learning capacity, the completed three-dimensional reconstruction is used as initialization guidance, the parameter quantity is reduced in the grid generation stage, the optimization efficiency is improved, and the storage cost is reduced.

Description

Sparse grid radiation field representation method based on point cloud initialization and depth supervision

Technical Field

The invention relates to a representation technology of a scene radiation field, in particular to a sparse grid radiation field representation method based on point cloud initialization and depth supervision.

Background

Perception and interaction research of three-dimensional space is a research hotspot in the past. Whether the tasks such as robot positioning, drawing and navigation, and the like, or the XR tasks such as virtual reality AR, VR, MR, and the like, do not need to sense and interact with the environment. Particularly in recent years, XR has been on the market air outlet, 11/1/2022, the ministry of industry and communications, and five departments such as "virtual reality and industry application integration development action plans (2022-2026)" were proposed, and by 2026, the overall scale of the virtual reality industry in China exceeds 3500 billion yuan, and the sales volume of the virtual reality terminal exceeds 2500 ten thousand. XR is generally called "extended-Reality", i.e., extended Reality, and is a general name for various forms of Virtual-Reality (Virtual-Reality), augmented-Reality (Augmented Reality), and Mixed-Reality (Mixed-Reality). XR requires creating a virtual human-computer interaction environment through the combination of software and hardware. To realize a human-computer interaction environment, the core steps are mainly two steps: context awareness and context interaction. Firstly, three-dimensional reconstruction of the environment is completed, namely active perception of the environment is realized; then the interaction is performed in the known environment, i.e. environment interactive rendering. The existing radiation field has the following problems:

1. the nerve radiation field based on the multilayer perceptron usually needs long training time and has low efficiency;

2. at present, a pure visual image is used for supervision, so that the geometry can be correctly learned only by a large number of pictures;

3. the radiation field based on a voxel grid often requires a large memory. The invention provides an efficient radiation field representation method for a three-dimensional scene, overcomes the problems, can be used for rendering in a new view angle, and can help XR application to be better developed.

Disclosure of Invention

In order to solve the problems of the existing radiation field, the invention respectively solves the corresponding problems by the following approaches, and the core idea of the invention is as follows: 1. voxel-based characterization of the radiation field; 2. carrying out geometric initialization and sparse grid; 3. the sampling mode based on sparse grid is realized; 4. deep surveillance (geometric constraints). The invention is realized by the following technical scheme:

the invention discloses a sparse grid radiation field representation method based on point cloud initialization and depth supervision, which comprises the following steps:

obtaining at least 2 or more camera color images;

obtaining 2 camera depth images, and generating the depth images into three-dimensional point clouds;

generating rays for each pixel by using camera parameters (including internal parameters and external parameters) of all the obtained camera images to obtain a ray set of all the pixels;

inputting the three-dimensional point cloud into an occupation network to generate occupation grids, storing the occupation probability (0-1) of the point at each grid vertex, and using the occupation grids as the geometric initialization prior of the voxel radiation field to sparsify the voxel grids in the voxel radiation field to obtain a sparse voxel radiation field;

sampling all rays, using a sparse voxel grid as an auxiliary in sampling, optimizing a sampling strategy, only sampling a non-blank part in the sparse grid, and performing interpolation sampling on adjacent voxel vertexes by utilizing a trilinear interpolation value on a sampling point to obtain a geometric parameter and a color parameter of each point;

performing voxel rendering on all sampling points on a ray to respectively obtain an RGB color image and a depth image;

constructing a loss function (color constraint and geometric constraint) by using the acquired color image and depth image and the rendered RGB color image and depth image;

and (4) gradient transfer is carried out by using a loss function, parameters of the voxel radiation field are optimized until all the parameters are converged, and finally sparse voxel radiation field representation of the current scene is obtained.

As a further improvement, the sparse voxel radiation field described in the present invention specifically includes:

the sparse voxels are obtained by thinning from the dense voxel grid, the sparse voxels only reserve part of the space occupied by the object in the scene, the purpose is to greatly reduce the unnecessary information storage cost, meanwhile, the storage information comprises 1-dimensional geometric information and more than 3-dimensional color information on each voxel, and the voxel grid, the stored geometric information and the stored color information jointly form a sparse voxel radiation field.

As a further improvement, the method of the present invention uses a sparse voxel grid as an auxiliary, and the optimal sampling strategy specifically comprises:

for a ray passing through the sparse voxel field in space, all blank grids in space are skipped during ray sampling, and only sampling is performed in the sparse grid. The method can greatly reduce the sampling quantity and ensure the sampling quality.

As a further improvement, the constructing of the loss function by using the color image and the depth image obtained by the acquisition and the RGB color image and the depth image obtained by the rendering according to the present invention specifically comprises:

calculating square errors of an RGB color image rendered by the radiation field and a color image shot really to form color constraints, and calculating square errors of a depth image rendered by the radiation field and a depth image shot really to form depth constraints, wherein the specific formula is as follows:

as a further improvement, the geometric initialization prior using the occupied grid as the voxel radiation field described in the present invention specifically includes:

generating a three-dimensional point cloud by using the acquired depth image through image parameters, generating an occupation probability grid by using the three-dimensional point cloud through an occupation network, further setting an occupation probability threshold, and deleting grids smaller than the threshold in the occupation probability grid, thereby obtaining the sparse voxel radiation field described in claim 2. In addition, the occupation probability value in the occupation probability grid is used as the initial value of the geometric density during initialization.

Compared with the existing nerve radiation field representation method, the method has the beneficial effects that:

1) Because most of real scenes are blank, the method uses the sparse grid to represent the scenes, thereby reducing the memory cost;

2) Because a large number of images are needed for implicitly learning a geometric structure in a voxel rendering mode, the method strengthens the geometric learning capacity by using the completed three-dimensional reconstruction as geometric constraint;

3) The invention uses the completed three-dimensional reconstruction as the initialization guide, reduces the parameter quantity in the grid generation stage, improves the optimization efficiency and reduces the storage cost;

4) Because invalid sampling in ray sampling is avoided in the sparse grid, the number of samples required by each line is greatly reduced, and therefore the method has remarkable improvement speed in training and rendering, the speed is increased by 1000 times for a neural radiation field based on a multilayer perceptron, and the speed is increased by 5 times for a radiation field based on a voxel grid. When in rendering, the speed of the invention can reach 20Hz, and is close to real time.

5) Because the invention uses the three-dimensional point cloud of the scene as the geometric prior, the invention can recover the target effect by using fewer images. And a large amount of data acquisition is avoided, and the scene reconstruction efficiency is improved.

Drawings

FIG. 1 is a schematic representation of sparse initialization of a mesh using a point cloud as introduced by the present invention;

101 represents a three-dimensional point cloud of a scene; 102 represents placing the three-dimensional point cloud in the scene into a dense grid according to the maximum bounding box; 103, thinning the dense grids by using the three-dimensional point cloud to obtain sparse grids;

FIG. 2 is a schematic representation of the present invention as applied to rendering;

204 represents the sparse grid radiation field; 201 denotes the current camera position; 202 represents the elevational direction of the current ray; 203, a final rendering picture (comprising a color map and a depth map);

FIG. 3 is a flow chart of the algorithm application of the present invention at the time of actual training and rendering.

Detailed Description

firstly, the invention uses a scene representation form based on voxels, namely, a voxel grid is used for representing the scene, thereby avoiding the use of a multilayer perceptron, avoiding redundant network parameters and calculation and improving the speed. Meanwhile, the invention utilizes the spherical harmonic coefficient as the color parameter during the voxel grid, and can effectively model the condition of different colors observed at different viewing angles for the same point.

Secondly, the method combines the existing three-dimensional reconstruction information (such as point cloud and depth map) of the scene. As shown in fig. 1, 101 is used as a three-dimensional point cloud of which the scene is reconstructed, and the scene point cloud is aligned with a dense voxel grid 102.

Further, when rendering an image using a mesh, as shown in fig. 2. 201 is the position of the current camera, 202 represents the ray direction of the current pixel, the points in the space are sampled along the current pixel direction, the color and distance of all sampling points are weighted and summed, the color and depth information of the current pixel point can be obtained, and the color image and depth image of the camera at the current position can be obtained as shown in 203 by sampling all pixel points. Because the voxel grid shown as 204 is sparse, only the grid region through which the ray passes needs to be sampled, and generally, the number of the grids through which one ray passes is small, the sparse grid-based sampling method can greatly increase the convergence rate of training, and in rendering, the radiation field provided by the invention can be rendered at a real-time speed due to the rapid acceleration of the sparse grid.

Finally, comparing the color image and the depth image rendered by the current radiation field with the real color image and the depth image provided in the data set, respectively calculating RGB errors and geometric errors, carrying out gradient calculation on the errors, transmitting the gradient to the vertex of each voxel, optimizing the geometric parameters and the color parameters of the voxel, and finally effectively training the radiation field of the invention by the cooperation of color constraint and geometric constraint as shown in FIG. 3 to finally enable the radiation field to render scene images and structures with reality.

The technical solution of the present invention is further illustrated by the following specific embodiments:

the method comprises the following steps: the invention uses the sensor to collect the color image and the depth image respectively, and records the collected color image as I _color Depth image is marked as I _depth . After data acquisition is completed, positioning all acquired images by using an image positioning algorithm, converting the depth image into a three-dimensional point cloud of a scene, and recording the three-dimensional point cloud as P _scene . In the conversion of the depth image to a three-dimensional point cloud, the following formula is used for calculation:

step two: and converting each pixel of the color image and the depth image into a ray under a camera coordinate system by utilizing the positioned color image and the positioned depth image and combining internal reference and external reference of the camera, and then expressing each ray under a world coordinate system by using the external reference of the camera to obtain a ray set under the world coordinate system corresponding to all image pixels.

Step three: will P _scene Utilizing an occupancy network, wherein the occupancy grid is composed of a three-dimensional convolutional neural network and a full-connection network, generating an occupancy probability grid, and recording as G _occ The resolution of the occupation probability mesh is set to be nxnxnxn, and the vertex of each mesh stores the occupation probability of the spatial point corresponding to the mesh, wherein the range is 0-1, wherein 1 represents that the point is occupied by the object, and 0 represents that the point is a blank area.

Step four: using occupancy probability grid G _occ Selecting a proper threshold (such as 0.5), and removing all grids with the occupation probability less than 0.5 in the grids so as to construct sparse voxel grid radiationThe method includes the steps that the geometrical information of the vertex of a radiation field of the sparse grid is initialized by using an occupation probability value, the sparse information of the grid is stored by using a tensor of NxNxNx1, and for one grid (i, j, k) in the dense grid, namely if a storage value in the tensor is-1, the grid is empty. Thus, a total of n points whose stored values are integers from 0 to n and are not repeated have a value greater than-1. Then, the geometry of the scene is represented by an n × 1 matrix, the stored information, i.e., the probability of the current mesh being occupied, is denoted as o, and the n × c matrix represents the color of the scene, where c is 3 when directly stored by RGB, and if represented by a spherical harmonic coefficient, c may be 3, 12, 27, and so on. These geometric and color parameters together constitute the sparse voxel grid radiation field of the present invention. After initialization is completed, the geometric parameters in the sparse voxel grid radiation field are the occupation probability, the initialization values are the values in the corresponding occupation probability grid, and the color parameters are uniformly initialized to 0.

Step five: and sampling all the rays in the second step, wherein the closest point and the farthest point of the sampling are preset according to the scene, generally, the closest point of the ray is the origin of the current ray, and the farthest point is obtained according to the intersection point of the ray and the scene grid. And in the process of ray sampling, using the sparse voxel grid obtained in the fourth step to perform auxiliary acceleration, and skipping the current sampling point if the grid corresponding to the position of the current sampling point does not fall within the range of the sparse voxel grid. Therefore, by using the sparse voxel grid, invalid sampling in the sampling process can be avoided, and the sampling process can be greatly accelerated on the premise of ensuring the sampling quality. And for the sampling points in the sparse voxel grid, carrying out trilinear interpolation on the grid vertex of each sampling point to obtain the parameters (the occupation probability and the color parameters) of the sampling points. The tri-linear interpolation is to calculate the weight relationship between the current point and each vertex through the distance between the current point and the vertex of the voxel, generally speaking, the closer the distance, the larger the weight is, and finally, the weights are used to perform weighted summation on the parameters of the vertex of the voxel, so as to obtain the parameters of the current sampling point.

Step six: all the sampling points in the five steps are subjected to voxel rendering,assuming that a certain ray effectively samples k points, recording the occupation probability o of the ith point _i Color of c _i A depth distance of t _i . First, the occupancy probability o is used _i The sampling weight of the current sampling point is calculated, and the weight calculation method is that the occupation probability of the current point is multiplied by the non-occupation probability of all points before the current point. After the weight calculation is completed, for depth rendering, the depth value of the current pixel is the weighted sum of the depth values of all sampling points on the ray corresponding to the current pixel; and rendering the color value of the current pixel by the color to be the weighted sum of the color values of all sampling points on the corresponding ray of the current pixel. Therefore, in the voxel integration, the following formula is used for weight calculation and weighted summation to obtain the voxel rendering result of the ray:

and performing voxel rendering on rays corresponding to all pixels, and generating a prediction image and a prediction depth image.

Step seven: constructing errors (color errors and depth errors) on the images obtained by the six predictions and the images actually acquired in the first step, wherein the color errors represent the difference values of RGB three channels on each pixel, the depth errors represent the difference values of depth on each pixel, the errors respectively correspond to texture information and geometric information in a scene, and the calculation formulas of the color errors and the depth errors are as follows:

Loss＝Loss _color +Loss _depth

step eight: and carrying out gradient back propagation on the error function for updating parameters in the sparse grid radiation field. When the gradient of all errors is solved, one method is to automatically solve by using a library which can be automatically derived such as a Pythrch, and the other method is to theoretically derive the first derivative of each loss term to each voxel parameter and then correct all voxel parameters by using an optimization method.

Step nine: when the sparse grid radiation field finishes parameter training, and the current radiation field can be used for representing the current scene, if a brand new camera pose and camera internal parameters are given, the method can use the second step, the fifth step and the sixth step to render the camera image with any brand new pose and camera internal parameters, and therefore the image rendering result can be obtained at any view angle in the scene.

The above examples are only preferred embodiments of the present invention, and the present invention is not limited to the above examples, and other modifications and variations directly derived or suggested by those skilled in the art without departing from the spirit and concept of the present invention should be considered to be included in the scope of the present invention.

Claims

1. A sparse grid radiation field representation method based on point cloud initialization and depth supervision is characterized by comprising the following steps:

obtaining at least 2 or more camera color images;

sampling all rays, using a sparse voxel grid as an auxiliary in the sampling, optimizing a sampling strategy, only sampling a non-blank part in the sparse grid, and performing interpolation sampling on the vertex of an adjacent voxel by utilizing a trilinear interpolation value on a sampling point to obtain a geometric parameter and a color parameter of each point;

performing voxel rendering on all sampling points on the ray to respectively obtain an RGB color image and a depth image;

and (4) carrying out gradient transfer by using a loss function, optimizing parameters of the voxel radiation field until all the parameters are converged, and finally obtaining the sparse voxel radiation field representation of the current scene.

2. The sparse grid radiation field representation method based on point cloud initialization and depth supervision as claimed in claim 1, wherein the sparse voxel radiation field is specifically:

the sparse voxels are obtained by thinning from the dense voxel grid, the sparse voxels only reserve part of the space occupied by the object in the scene for greatly reducing the unnecessary information storage cost, meanwhile, the storage information comprises 1-dimensional geometric information and more than 3-dimensional color information on each voxel, and the voxel grid, the stored geometric information and the stored color information jointly form a sparse voxel radiation field.

3. The sparse grid radiation field representation method based on point cloud initialization and depth supervision as claimed in claim 1 or 2, wherein said sparse voxel grid is used as an aid, and the optimized sampling strategy is specifically:

for a ray passing through a sparse voxel field in space, all blank grids in space can be skipped during ray sampling, and only sampling is performed in the sparse grid.

4. The method as claimed in claim 3, wherein the constructing the loss function by using the color image and the depth image obtained by the acquisition and the RGB color image and the depth image obtained by the rendering comprises:

calculating square errors of an RGB color image rendered by a radiation field and a color image really shot to form color constraint, and calculating square errors of a depth image rendered by the radiation field and a depth image really shot to form depth constraint, wherein the specific formula is as follows:

/>

5. the sparse grid radiation field representation method based on point cloud initialization and depth supervision as claimed in claim 1, 2 or 4, wherein the geometrical initialization prior using the occupied grid as the voxel radiation field specifically is:

generating three-dimensional point cloud by using the acquired depth image through image parameters, generating an occupation probability grid by the three-dimensional point cloud through an occupation network, setting an occupation probability threshold, and deleting grids smaller than the threshold in the occupation probability grid, thereby obtaining a sparse voxel radiation field; and at the time of initialization, the occupancy probability value in the occupancy probability grid is used as the initial value of the geometric density.