CN116805349A

CN116805349A - Indoor scene reconstruction method and device, electronic equipment and medium

Info

Publication number: CN116805349A
Application number: CN202310544904.XA
Authority: CN
Inventors: 齐越; 曲延松; 王君义; 段宛彤; 王宇泽
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2023-05-15
Filing date: 2023-05-15
Publication date: 2023-09-26

Abstract

The application provides an indoor scene reconstruction method, an indoor scene reconstruction device, electronic equipment and a medium. The method comprises the following steps: acquiring training visual angle parameters according to the training images and establishing scene point clouds; selecting a plurality of position points as training sampling points on rays which start from the origin of the training view and pass through the position points of each pixel point in the training image; determining point cloud points in scene point clouds in a preset range near training sampling points as a training point set, and inputting current input information of the training point set into a current scene reconstruction model to obtain the color and light field intensity of the training point set; after interpolation is carried out to obtain the color and the light field intensity of the training sampling points, rendering is carried out to obtain a rendered image under training visual angle parameters; and training the current input information and the scene reconstruction model according to the loss between the rendered image and the training image under the training view angle parameters. The method solves the problem that the quality of the generated indoor scene rendering image is poor when the number of the training images is small.

Description

Indoor scene reconstruction method and device, electronic equipment and medium

Technical Field

The present application relates to the field of virtual reality technologies, and in particular, to a method and apparatus for reconstructing an indoor scene, an electronic device, and a medium.

Background

The Virtual Reality technology (VR) is mainly computer technology, utilizes and integrates various high-tech latest development achievements such as three-dimensional graphics technology, multimedia technology, simulation technology, display technology, servo technology and the like, and generates a realistic Virtual world with various sensory experiences such as three-dimensional vision, touch sense, smell sense and the like by means of equipment such as a computer and the like, so that people in the Virtual world generate an immersive sensation. With the continuous development of social productivity and scientific technology, VR technology is increasingly required by various industries. VR technology has also made tremendous progress and has gradually become a new scientific and technological area. The new view synthesis of indoor scenes plays an important role in the interaction of virtual reality with people, so the task has great potential in many VR applications, such as indoor scene virtual roaming or guest navigation.

In the prior art, a great deal of neural radiation fields are adopted for new view angle synthesis of indoor scenes, continuous multi-layer perceptrons are used for coding the brightness and density of three-dimensional scenes, and light fields are reconstructed through ray tracing.

In the above scheme, in order to reconstruct a room-sized scene, at least several hundred RGB images densely acquired around the scene are required, which requires a large manpower cost and time cost and has a high requirement on shooting quality. If the number of input images is insufficient, many holes and wrong rendering can occur in the generated rendering images, so that the problem of poor quality of the generated rendering images of indoor scenes exists when the number of training images is small.

Disclosure of Invention

The application provides an indoor scene reconstruction method, an indoor scene reconstruction device, electronic equipment and a medium, which are used for solving the problem of poor quality of an indoor scene rendering image generated when the number of training images is small.

In one aspect, the present application provides an indoor scene reconstruction method, including:

image acquisition is carried out indoors, and a training image is obtained; based on a motion restoration structure technology, acquiring training visual angle parameters corresponding to each training image according to the training images and establishing scene point clouds; the scene point cloud is a point cloud representing indoor scene structure information;

selecting a plurality of position points as training sampling points on rays which start from the training view angle origin and pass through the position points corresponding to each pixel point in the training image under the training view angle parameters in the scene point cloud; determining point cloud points in the scene point cloud positioned in a preset range near the training sampling points based on a semantic prediction technology as a training point set, and acquiring current input information of the training point set; wherein the input information includes spatial location, neural characteristics, and ray direction;

inputting the current input information of the training point set to a current scene reconstruction model to obtain the color and the light field intensity of the training point set output by the scene reconstruction model; obtaining the color and the light field intensity of the training sampling point through interpolation according to the color and the light field intensity of the training point set; according to the color and the light field intensity of the training sampling points, rendering images under the training visual angle parameters are obtained through rendering;

And training and correcting the current input information and the current scene reconstruction model according to the loss between the rendered image under the training view angle parameter and the training image under the training view angle parameter until the trained input information and the trained scene reconstruction model are obtained.

Optionally, the determining, based on the semantic prediction technology, point cloud points in the scene point cloud located in a predetermined range around the plurality of training sampling points as a training point set includes:

based on a semantic prediction technology, semantic information of each pixel point in the training image under the training view angle parameters and semantic information of each point cloud point in the scene point cloud are obtained; the semantic information of the pixel points comprises object categories, and the semantic information of the point cloud points comprises the object categories and corresponding probability values;

for each training sampling point, acquiring all point cloud points in a preset range around the training sampling point in the scene point cloud; calculating the selection probability of each point cloud point in a preset range around the training sampling point, and selecting a plurality of point cloud points with preset values as the training point set according to the sequence of the selection probability from high to low; wherein calculating the selection probability comprises: if the object class of the point cloud point is the same as the object class of the pixel point corresponding to the training sampling point, the selection probability of the point cloud point is one; if the object class of the point cloud point is different from the object class of the pixel point corresponding to the training sampling point, the selection probability of the point cloud point is a probability value corresponding to subtracting the object class of the pixel point corresponding to the point cloud point.

Optionally, the neural features include a first neural feature and a second neural feature, and the acquiring current input information of the training point set includes:

inputting all the training images and the scene point cloud into a semantic prediction network to obtain first neural characteristics of each point cloud point in the scene point cloud;

and acquiring the training image corresponding to the training view angle origin with the nearest point cloud point, and inputting the training image into a convolutional neural network to obtain a second neural characteristic of the point cloud point.

Optionally, the method further comprises:

acquiring target view angle parameters of a scene to be reconstructed; determining virtual pixel points according to the target visual angle parameters;

selecting a plurality of position points as target sampling points on rays which start from the target view angle origin and pass through the position points corresponding to each virtual pixel point in the scene point cloud; and determining point cloud points in the scene point cloud positioned in a preset range near the plurality of target sampling points as a target point set, and acquiring the input information of the target point set;

inputting the input information of the target point set into the scene reconstruction model to obtain the color and the light field intensity of the target point set output by the scene reconstruction model; obtaining the color and the light field intensity of the target sampling point through interpolation according to the color and the light field intensity of the target point set; and rendering according to the color and the light field intensity of the target sampling point to obtain a rendered image under the target visual angle parameter.

Optionally, the determining, as the target point set, a point cloud point in the scene point cloud located in a predetermined range around the plurality of target sampling points includes:

and acquiring all point cloud points in the scene point cloud, which are positioned in a preset range near the plurality of target sampling points, and selecting a preset value closest to each sampling point as the target point set.

Optionally, the training image satisfies the following conditions: the training image includes a room surface; the overlap between training images at any two adjacent training perspective parameters is not less than thirty percent of the overall image size.

In another aspect, the present application provides an indoor scene reconstruction apparatus, including:

the acquisition module is used for acquiring images indoors to obtain training images; based on a motion restoration structure technology, acquiring training visual angle parameters corresponding to each training image according to the training images and establishing scene point clouds; the scene point cloud is a point cloud representing indoor scene structure information;

the sampling module is used for selecting a plurality of position points as training sampling points on rays which start from the training view angle origin and pass through the position points corresponding to each pixel point in the training image under the training view angle parameters in the scene point cloud; determining point cloud points in the scene point cloud positioned in a preset range near the training sampling points based on a semantic prediction technology as a training point set, and acquiring current input information of the training point set; wherein the input information includes spatial location, neural characteristics, and ray direction;

The rendering module is used for inputting the current input information of the training point set to a current scene reconstruction model to obtain the color and the light field intensity of the training point set output by the scene reconstruction model; obtaining the color and the light field intensity of the training sampling point through interpolation according to the color and the light field intensity of the training point set; according to the color and the light field intensity of the training sampling points, rendering images under the training visual angle parameters are obtained through rendering;

and the correction module is used for carrying out training correction on the current input information and the current scene reconstruction model according to the loss between the rendered image under the training view angle parameter and the training image under the training view angle parameter until the trained input information and the trained scene reconstruction model are obtained.

Optionally, the sampling module is specifically configured to:

Optionally, the neural features include a first neural feature and a second neural feature, and the sampling module is further specifically configured to:

and acquiring the training image corresponding to the training view angle origin point with the nearest point distance, and inputting the training image into a convolutional neural network to obtain a second neural characteristic of the point cloud point.

Optionally, the apparatus further includes a reconstruction module, configured to:

Optionally, the reconstruction module is specifically configured to:

Optionally, the training image under the known viewing angle parameter satisfies the following conditions: the training image comprises a room surface; the overlap between training images at any two adjacent known viewing angle parameters is not less than thirty percent of the overall image size.

In yet another aspect, the present application provides an electronic device, comprising: a processor, and a memory communicatively coupled to the processor; the memory stores computer-executable instructions; the processor executes computer-executable instructions stored in the memory to implement the method as described above.

In yet another aspect, the application provides a computer-readable storage medium having stored therein computer-executable instructions for performing the method as described above when executed by a processor.

In the indoor scene reconstruction method, the device, the electronic equipment and the medium provided by the application, image acquisition is carried out indoors to obtain training images; based on a motion restoration structure technology, acquiring training visual angle parameters corresponding to each training image according to the training images and establishing scene point clouds; the scene point cloud is a point cloud representing indoor scene structure information; selecting a plurality of position points as training sampling points on rays which start from the training view angle origin and pass through the position points corresponding to each pixel point in the training image under the training view angle parameters in the scene point cloud; determining point cloud points in the scene point cloud positioned in a preset range near the training sampling points based on a semantic prediction technology as a training point set, and acquiring current input information of the training point set; wherein the input information includes spatial location, neural characteristics, and ray direction; inputting the current input information of the training point set to a current scene reconstruction model to obtain the color and the light field intensity of the training point set output by the scene reconstruction model; obtaining the color and the light field intensity of the training sampling point through interpolation according to the color and the light field intensity of the training point set; according to the color and the light field intensity of the training sampling points, rendering images under the training visual angle parameters are obtained through rendering; and training and correcting the current input information and the current scene reconstruction model according to the loss between the rendered image under the training view angle parameter and the training image under the training view angle parameter until the trained input information and the trained scene reconstruction model are obtained. According to the scheme, the training point set is selected around the training sampling points based on the semantic prediction technology, the possibility that the object types of the point cloud points are the same as the object types of the corresponding training sampling points is improved, the color and the light field intensity of the training sampling points obtained through interpolation according to the color and the light field intensity of the point cloud points are more accurate, the quality of the generated rendering image is improved, the input information of the point cloud points and the scene reconstruction model can be better trained and corrected according to the training image and the corresponding rendering image, the requirement on the number of the training images is reduced, and therefore the problem that the quality of the rendering image of the indoor scene is poor when the number of the training images is small is solved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

Fig. 1 is a schematic flow chart illustrating an indoor scene reconstruction method according to a first embodiment of the present application;

an exemplary diagram for generating a scene point cloud according to the first embodiment of the present application is illustrated in fig. 2;

an exemplary view of a scenario for ray tracing provided by a first embodiment of the present application is illustrated in fig. 3;

an exemplary diagram of a scenario for semantic prediction provided by embodiment one of the present application is illustrated in FIG. 4;

fig. 5 is a schematic flow chart of acquiring neural characteristics of point cloud points according to a first embodiment of the present application;

an exemplary diagram of a scene rendered in real time with dynamic resolution according to a first embodiment of the present application is schematically shown in fig. 6;

fig. 7 is a schematic structural diagram of an indoor scene reconstruction device according to a second embodiment of the present application;

fig. 8 is a schematic structural diagram of an indoor scene reconstruction electronic device according to a third embodiment of the present application.

Specific embodiments of the present application have been shown by way of the above drawings and will be described in more detail below. The drawings and the written description are not intended to limit the scope of the inventive concepts in any way, but rather to illustrate the inventive concepts to those skilled in the art by reference to the specific embodiments.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.

In order to reconstruct a room-sized scene, the existing indoor scene new view angle synthesis technology needs at least several hundred RGB images densely collected around the scene, and needs larger labor cost and time cost and has higher requirements on shooting quality. If the number of input images is insufficient, many holes and wrong rendering can occur in the generated rendering images, so that the problem of poor quality of the generated rendering images of indoor scenes exists when the number of training images is small.

The technical scheme of the application is illustrated in the following specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments.

Example 1

Fig. 1 is a flowchart of an indoor scene reconstruction method according to an embodiment of the present application. As shown in fig. 1, the indoor scene reconstruction method provided in this embodiment may include:

s101, image acquisition is carried out indoors, and training images are obtained; based on a motion restoration structure technology, acquiring training visual angle parameters corresponding to each training image according to the training images and establishing scene point clouds; the scene point cloud is a point cloud representing indoor scene structure information;

S102, selecting a plurality of position points as training sampling points on rays which start from the training view angle origin and pass through the position points corresponding to each pixel point in the training image under the training view angle parameters in the scene point cloud; determining point cloud points in the scene point cloud positioned in a preset range near the training sampling points based on a semantic prediction technology as a training point set, and acquiring current input information of the training point set; wherein the input information includes spatial location, neural characteristics, and ray direction;

s103, inputting the current input information of the training point set into a current scene reconstruction model to obtain the color and the light field intensity of the training point set output by the scene reconstruction model; obtaining the color and the light field intensity of the training sampling point through interpolation according to the color and the light field intensity of the training point set; according to the color and the light field intensity of the training sampling points, rendering images under the training visual angle parameters are obtained through rendering;

and S104, training and correcting the current input information and the current scene reconstruction model according to the loss between the rendered image under the training view angle parameter and the training image under the training view angle parameter until the trained input information and the trained scene reconstruction model are obtained.

In practical application, the execution body of the embodiment may be an indoor scene reconstruction device, and the device may be implemented by a computer program, for example, application software or the like; alternatively, the computer program may be implemented as a medium storing a related computer program, for example, a usb disk, a cloud disk, or the like; still alternatively, it may be implemented by a physical device, e.g., a chip, a server, etc., integrated with or installed with the relevant computer program.

Specifically, shooting is carried out in a room by using a camera to obtain RGB training images; based on a motion restoration structure technology (Structure from Motion, sfM) such as Colmap, training visual angle parameters corresponding to each training image are obtained according to the training images, and a 3D scene point cloud representing an indoor scene structure is established.

In practical application, the viewing angle parameter may be a camera parameter, where the camera parameter includes a camera internal parameter and a camera external parameter; the camera internal parameters comprise an internal parameter matrix, and the camera external parameters represent the pose of the camera and comprise a rotation matrix and a translation vector. Based on a certain determined view angle parameter, the corresponding spatial position of each pixel point in the image imaged by the corresponding camera under the view angle parameter under the world coordinate system can be uniquely determined.

Starting from the origin of the training visual angle, emitting a ray to a position point corresponding to each pixel point in the training image under the training visual angle parameter as light rays, and selecting a plurality of training sampling points on each light ray; the training view angle origin, the position points corresponding to the pixel points and the scene point cloud are all located in the same coordinate system, and each ray passes through the scene point cloud; in practical applications, the origin of the training view angle may be the optical center of the camera, and the spatial position thereof may be determined according to the training view angle parameter. Selecting a plurality of training point cloud points in a preset range near each training sampling point based on a semantic prediction technology, taking the training point cloud points as a training point set, and acquiring current input information of the training point set; wherein the input information includes spatial location, neural characteristics, and ray direction.

Inputting the current input information of the training point set into a current scene reconstruction model to obtain the color and the light field intensity of the training point set output by the scene reconstruction model; in practical application, the scene reconstruction model can comprise a light field intensity decoder and a color decoder, and specific parameters can be selected according to the actual production requirement; obtaining the color and the light field intensity of the training sampling points through interpolation according to the color and the light field intensity of the training point set; and according to the color of the training sampling point and the light field intensity, the color of each pixel point in the rendered image under the training visual angle parameter is obtained through rendering, and the rendered image is generated.

And correcting the input information of the point cloud point and the scene reconstruction model by calculating loss according to the training image and the rendering image under the training visual angle parameters until the trained input information of the point cloud point and the scene reconstruction model are obtained.

For example, in a room of about 10 square meters, 40 training images under different training viewing angle parameters are acquired for indoor scene reconstruction.

Based on the SfM technology, training visual angle parameters corresponding to the training images are obtained according to the training images, and scene point clouds are generated. Fig. 2 is an exemplary diagram of generating a scene point cloud according to an embodiment of the present application, as shown in fig. 2, according to a training image including an indoor scene, a 3D scene point cloud characterizing an indoor scene structure may be obtained based on an SfM technology.

FIG. 3 is a diagram illustrating a light ray tracing scene according to an embodiment of the present application, as shown in FIG. 3, determining a training view angle origin and corresponding position points of each pixel point in the training image under a coordinate system where a scene point cloud is located according to a training view angle parameter, taking rays from the training view angle origin through the position points of each pixel point, and uniformly sampling N sampling points q in a set region on each ray ₁ ,q ₂ ,…,q _N And selecting a preset number of point cloud points in the sphere with the radius r around each sampling point based on a semantic prediction technology as a training point set. In one possible implementation, 128 training sampling points are selected on each ray, and 8 point cloud points are selected around each training sampling point as a training point set.

And inputting the current spatial position, the nerve characteristic and the ray direction of the training point set into a decoder to obtain the color and the light field intensity of each point cloud point in the training point set output by the decoder. Wherein the decoders are divided into two groups of a light field intensity sigma decoder and a color c decoder, which respectively generate light field intensity and color. The decoder consists of six fully connected hidden layers and input/output layers, wherein the number of neurons of the light field intensity decoder is 128-256-256-96-32-16, and the number of neurons of the color decoder is 128-256-256-128-64-8.

And obtaining the color and the light field intensity of the training sampling points through inverse distance interpolation according to the color and the light field intensity of the training point set. And performing volume rendering integration on the color of the training sampling point and the square intensity to obtain the color of the corresponding pixel point.

For a pixel, the volume rendering integral calculation formula of the color c is as follows: .

c＝∑ _N τ _j (1-exp(-σ _j δ _j ))r _j

Wherein N is the value of the pixel point from the ray passing through the origin of the cameraSelecting the number of sampling points sigma _j Is the light field intensity of the jth sampling point, r _j Is the color of the j-th sampling point, delta _j Is the distance between adjacent sampling points.

And after the colors of all the pixel points in the rendering image are obtained, the rendering image can be generated. Calculating a loss function between rendered image and training view real RGB values

Wherein, the liquid crystal display device comprises a liquid crystal display device,for rendering an image +.>Is the corresponding real training image. According to->And correcting the input information of the point cloud point and the scene reconstruction model.

In practical application, the acquisition of the training image needs to meet certain requirements, and in one example, the training image under the training visual angle parameters meets the following conditions: the training image comprises a room surface; the overlap between training images at any two adjacent training perspective parameters is not less than thirty percent of the overall image size.

Specifically, the room surface features are contained in at least one training image, and the training images under two adjacent training visual angle parameters which are continuously shot are overlapped, and the size of the overlapped area is not less than 30% of the size of the whole image, so that the acquired training images can achieve a good training effect.

For example, for a 10-m-sized room, 35-40 RGB images are captured using a monocular camera, satisfying that the room surface is captured by at least one RGB image, and that the overlapping portion of the images at two consecutive camera perspectives is not less than 30% of the image size.

The determining, based on the semantic prediction technique, a point cloud point in the scene point cloud located in a predetermined range around the plurality of training sampling points as the training point set may include:

Specifically, the training image and the scene point cloud are input into a semantic prediction network to obtain semantic information of each pixel point in the training image, wherein the semantic information comprises predicted object types of the pixel points, and semantic information of each point cloud point in the scene point cloud comprises predicted object types of the point cloud points and corresponding probability values.

And for each training sampling point, acquiring all the point cloud points in a preset range around the training sampling point, calculating the selection probability of each point cloud point, and selecting a preset number of point cloud points according to the sequence of the selection probability from high to low as a training point set.

Wherein, for a point cloud point, the probability p is selected _chosen The calculation formula of (2) is as follows:

wherein the method comprises the steps ofFor the predicted object category of the point cloud point output by the semantic prediction network, the predicted object category of the ray where the current training sampling point is positioned is enabled to be +.>Predicted object class for pixel point corresponding to the location point traversed by the raySimilarly, let ray's predicted object class +.>Corresponding probability value->Predicted object class of pixel point corresponding to the position point where the ray passes +.>Corresponding probability value->The same applies.

For example, fig. 4 is a schematic diagram of a scenario of semantic prediction according to an embodiment of the present application, as shown in fig. 4, a training image and a scenario point cloud are input into a semantic prediction module to obtain semantic information of each pixel point in the output training image and semantic information of each point cloud point in the scenario point cloud, where the pixel point and each point cloud point represent different predicted object types through different gray scales, and probability values corresponding to the predicted object types of the point cloud points are not represented in the figure, which is only illustrated herein.

As an example, all the point cloud points in the sphere with the radius r around each training sampling point are obtained, the selection probability of the point cloud points is calculated, and 8 point cloud points with the highest selection probability are selected as a training point set.

The probability that the predicted object category of the point cloud point is the same as the predicted object category of the ray is positively correlated with the selection probability of the point cloud point, and the point cloud point with high selection probability is used as a training point set, so that the probability that the training point cloud point is the same as the object category corresponding to the training sampling point is improved, and the color and the light field intensity of the training sampling point obtained according to the training point cloud point parameters are more accurate.

In one example, the neural features include a first neural feature and a second neural feature, and the obtaining current input information of the training point set may include:

Fig. 5 is a schematic flow chart of acquiring neural characteristics of point cloud points according to an embodiment of the present application, where, as shown in fig. 5, the neural characteristics of the point cloud points may include a first neural characteristic and a second neural characteristic. Acquiring the first neural feature of the point cloud point comprises: inputting all training images and scene point clouds into a semantic prediction network to obtain first neural features of all point clouds in the scene point clouds. Acquiring the second neural feature of the point cloud point comprises: and acquiring a training image corresponding to the training view angle origin closest to the point cloud point, and inputting the training image into a convolutional neural network to obtain a second neural characteristic of the point cloud point. The semantic prediction network is the same as the semantic prediction network, and in practical application, training images and scene point clouds are input into the semantic prediction network, so that second neural characteristics, predicted object types and corresponding probability values of all point clouds in the scene point clouds output by the semantic prediction network can be obtained.

For example, assuming that the resolution of the training image is 640×480, inputting the training image and the scene point cloud into a semantic prediction network to obtain first neural features s of each point cloud point in the scene point cloud; wherein the semantic prediction network is a pre-trained BP-Net; inputting a training image corresponding to a training view angle origin with the nearest point cloud point distance into a convolutional neural network to obtain a 32-dimensional second neural feature n; the convolutional neural network can extract a convolutional network for the pre-trained image features and consists of six convolutional modules, the channel number is 64-128-128-256-128-32, and the convolutional modules of each layer consist of convolutional layer-linear rectification-instance normalization units.

The structural information and the semantic information except the space position of the point cloud point are obtained by convolving the training image corresponding to the training view angle origin with the closest point cloud point distance and using the semantic information of the point cloud point obtained by semantic prediction as the neural characteristics of the point cloud point, so that the input information of the point cloud point input scene reconstruction model is richer, and more accurate color and light field intensity of the point cloud point are obtained.

Based on the corrected input information and the scene reconstruction model, the reconstruction of the indoor new view angle scene can be performed. In one example, the method may further comprise:

Specifically, after determining a target view angle of a scene to be reconstructed, acquiring a target view angle parameter. And determining corresponding position points of the target view point origin and the virtual pixel points in a coordinate system where the scene point clouds are located according to the target view angle parameters, starting from the target view point origin, taking rays through the scene point clouds and through the position points corresponding to each virtual pixel point, selecting target sampling points in a preset area of the rays, and selecting point cloud points in the scene point clouds in a preset range around each target sampling point to obtain a target point set.

Inputting the trained input information of the target point set into a trained scene reconstruction model to obtain the color and light field intensity of the target point set; interpolating the color and the light field intensity of the target point set to obtain the color and the light field intensity of the corresponding target sampling point; and rendering according to the color and the light field intensity of the target sampling point to obtain a rendered image of the new view angle.

And rendering the new view angle image of the indoor scene based on the trained input information and the trained scene reconstruction model, and generating a rendering image with a good effect, thereby improving the quality of the rendering image of the new view angle of the indoor scene generated under the condition of less training images.

For example, assuming that a new view angle image is desired to be generated currently, a training image of an indoor scene under training view angle parameters is obtained, and input information of trained point cloud points and a trained scene reconstruction model are obtained according to the training image.

After the target view angle parameters of the new view angle are determined, the spatial positions of the position points corresponding to the target view angle origin and the virtual pixel points under the coordinate system where the scene point cloud is located are determined according to the target view angle parameters, rays which start from the target view angle origin and pass through the position points of each virtual pixel point are taken, 128 target sampling points are uniformly selected in a ray preset area, and point cloud points in the scene point cloud are selected in a sphere with the radius r of each target sampling point, so that a target point set is obtained.

Inputting the trained input information of the target point set into a trained scene reconstruction model to obtain the color and light field intensity of the target point set; performing inverse distance interpolation on the colors and the light field intensities of the target point sets to obtain the colors and the light field intensities of the corresponding target sampling points; and performing volume rendering integration according to the color of the target sampling point and the light field intensity to obtain a rendering image of the new view angle.

Furthermore, the scheme can provide an indoor scene efficient acquisition and reconstruction roaming system, a user inputs RGB images of the sparsely acquired indoor scene, and real-time roaming and high-quality new view angle rendering of the scene can be realized after training. The system can render a new view angle, including moving the position of the sight line, moving the direction of the sight line, enlarging and reducing the visible range of the sight line, and the like.

The system framework is realized based on Python language, and the picture corresponding to the current visual angle is visualized through software provided with an interactive control panel. The neural network model is realized based on PyTorch and is trained on NVIDIA Quadro P6000GPU,24G memory single card. The code is migrated to the GPU to accelerate through Pycuda, so that real-time rendering can be realized.

FIG. 6 is a diagram illustrating an exemplary scene rendered in real time with dynamic resolution according to an embodiment of the present application, where as shown in FIG. 6, the system may enable a user to browse the scene in real time, and when the user opens a real-time rendering button, the system may dynamically adjust the resolution of a rendered image according to the rendering time, for example, when the rendering time exceeds a preset value, and select a low resolution; and determining the position point of each pixel point in the rendered image according to the resolution ratio of the rendered image and the target visual angle parameter, and generating the rendered image, so that the user is ensured to browse the indoor scene in real time.

The method for selecting the target point set may be various, and in one example, the determining the target point set included in the predetermined range around the plurality of target sampling points may include:

and acquiring all point cloud points in a preset range near the target sampling points, and selecting a preset value point cloud point closest to each sampling point as the target point set.

Specifically, for each target sampling point, all point cloud points in a preset range around the target sampling point are acquired, and a preset number of point cloud points are selected as a target point set according to the sequence from near to far.

For example, all point cloud points in a sphere with a radius r around the target sampling point are acquired, and 8 point cloud points closest to the target sampling point are selected as the target point set.

In the indoor scene reconstruction method provided by the application, image acquisition is carried out indoors to obtain training images; based on a motion restoration structure technology, acquiring training visual angle parameters corresponding to each training image according to the training images and establishing scene point clouds; the scene point cloud is a point cloud representing indoor scene structure information; selecting a plurality of position points as training sampling points on rays which start from the training view angle origin and pass through the position points corresponding to each pixel point in the training image under the training view angle parameters in the scene point cloud; determining point cloud points in the scene point cloud positioned in a preset range near the training sampling points based on a semantic prediction technology as a training point set, and acquiring current input information of the training point set; wherein the input information includes spatial location, neural characteristics, and ray direction; inputting the current input information of the training point set to a current scene reconstruction model to obtain the color and the light field intensity of the training point set output by the scene reconstruction model; obtaining the color and the light field intensity of the training sampling point through interpolation according to the color and the light field intensity of the training point set; according to the color and the light field intensity of the training sampling points, rendering images under the training visual angle parameters are obtained through rendering; and training and correcting the current input information and the current scene reconstruction model according to the loss between the rendered image under the training view angle parameter and the training image under the training view angle parameter until the trained input information and the trained scene reconstruction model are obtained. According to the scheme, the training point set is selected around the training sampling points based on the semantic prediction technology, the possibility that the object types of the point cloud points are the same as the object types of the corresponding training sampling points is improved, the color and the light field intensity of the training sampling points obtained through interpolation according to the color and the light field intensity of the point cloud points are more accurate, the quality of the generated rendering image is improved, the input information of the point cloud points and the scene reconstruction model can be better trained and corrected according to the training image and the corresponding rendering image, the requirement on the number of the training images is reduced, and therefore the problem that the quality of the rendering image of the indoor scene is poor when the number of the training images is small is solved.

Example two

Fig. 7 is a schematic structural diagram of an indoor scene reconstruction device according to an embodiment of the present application. As shown in fig. 7, the indoor scene reconstruction device 70 provided in this embodiment may include:

the acquisition module 71 is used for acquiring images indoors to obtain training images; based on a motion restoration structure technology, acquiring training visual angle parameters corresponding to each training image according to the training images and establishing scene point clouds; the scene point cloud is a point cloud representing indoor scene structure information;

the sampling module 72 is configured to select, as training sampling points, a plurality of location points on a ray that starts from the origin of the training view angle and passes through the location point corresponding to each pixel point in the training image under the training view angle parameter in the scene point cloud; determining point cloud points in the scene point cloud positioned in a preset range near the training sampling points based on a semantic prediction technology as a training point set, and acquiring current input information of the training point set; wherein the input information includes spatial location, neural characteristics, and ray direction;

the rendering module 73 is configured to input current input information of the training point set to a current scene reconstruction model, so as to obtain a color and a light field intensity of the training point set output by the scene reconstruction model; obtaining the color and the light field intensity of the training sampling point through interpolation according to the color and the light field intensity of the training point set; according to the color and the light field intensity of the training sampling points, rendering images under the training visual angle parameters are obtained through rendering;

And the correction module 74 is configured to perform training correction on the current input information and the current scene reconstruction model according to the loss between the rendered image under the training view angle parameter and the training image under the training view angle parameter until the trained input information and the trained scene reconstruction model are obtained.

In practical application, the indoor scene reconstruction device may be implemented by a computer program, for example, application software or the like; alternatively, the computer program may be implemented as a medium storing a related computer program, for example, a usb disk, a cloud disk, or the like; still alternatively, it may be implemented by a physical device, e.g., a chip, a server, etc., integrated with or installed with the relevant computer program.

There are a variety of ways to determine the training point set, and in one example, the sampling module 72 may be configured to:

wherein the method comprises the steps ofFor the predicted object category of the point cloud point output by the semantic prediction network, the current training sampling point is made to be the radialPredicted object class of line->Predicted object class for pixel point corresponding to the location point traversed by the raySimilarly, let ray's predicted object class +.>Corresponding probability value->Predicted object class of pixel point corresponding to the position point where the ray passes +.>Corresponding probability value->The same applies.

In one example, the neural features include a first neural feature and a second neural feature, and the sampling module 72 may be further configured to:

In particular, the neural features of the point cloud point may include a first neural feature and a second neural feature. Acquiring the first neural feature of the point cloud point comprises: inputting all training images and scene point clouds into a semantic prediction network to obtain first neural features of all point clouds in the scene point clouds. Acquiring the second neural feature of the point cloud point comprises: and acquiring a training image corresponding to the training view angle origin closest to the point cloud point, and inputting the training image into a convolutional neural network to obtain a second neural characteristic of the point cloud point. The semantic prediction network is the same as the semantic prediction network, and in practical application, training images and scene point clouds are input into the semantic prediction network, so that second neural characteristics, predicted object types and corresponding probability values of all point clouds in the scene point clouds output by the semantic prediction network can be obtained.

Based on the corrected input information and the scene reconstruction model, the reconstruction of the indoor new view angle scene can be performed. In one example, the apparatus further comprises a reconstruction module operable to:

And rendering a new view angle image of the indoor scene based on the trained input information and the trained scene reconstruction model, so that a rendering image with a good effect can be generated.

The manner of selecting the target point set may be various, and in one example, the reconstruction module is specifically configured to:

In the indoor scene reconstruction device provided by the application, image acquisition is carried out indoors to obtain training images; based on a motion restoration structure technology, acquiring training visual angle parameters corresponding to each training image according to the training images and establishing scene point clouds; the scene point cloud is a point cloud representing indoor scene structure information; selecting a plurality of position points as training sampling points on rays which start from the training view angle origin and pass through the position points corresponding to each pixel point in the training image under the training view angle parameters in the scene point cloud; determining point cloud points in the scene point cloud positioned in a preset range near the training sampling points based on a semantic prediction technology as a training point set, and acquiring current input information of the training point set; wherein the input information includes spatial location, neural characteristics, and ray direction; inputting the current input information of the training point set to a current scene reconstruction model to obtain the color and the light field intensity of the training point set output by the scene reconstruction model; obtaining the color and the light field intensity of the training sampling point through interpolation according to the color and the light field intensity of the training point set; according to the color and the light field intensity of the training sampling points, rendering images under the training visual angle parameters are obtained through rendering; and training and correcting the current input information and the current scene reconstruction model according to the loss between the rendered image under the training view angle parameter and the training image under the training view angle parameter until the trained input information and the trained scene reconstruction model are obtained. According to the scheme, the training point set is selected around the training sampling points based on the semantic prediction technology, the possibility that the object types of the point cloud points are the same as the object types of the corresponding training sampling points is improved, the color and the light field intensity of the training sampling points obtained through interpolation according to the color and the light field intensity of the point cloud points are more accurate, the quality of the generated rendering image is improved, the input information of the point cloud points and the scene reconstruction model can be better trained and corrected according to the training image and the corresponding rendering image, the requirement on the number of the training images is reduced, and therefore the problem that the quality of the rendering image of the indoor scene is poor when the number of the training images is small is solved.

Example III

Fig. 8 is a schematic structural diagram of an electronic device provided in an embodiment of the disclosure, as shown in fig. 8, where the electronic device includes:

a processor 291, the electronic device further comprising a memory 292; a communication interface (Communication Interface) 293 and bus 294 may also be included. The processor 291, the memory 292, and the communication interface 293 may communicate with each other via the bus 294. Communication interface 293 may be used for information transfer. The processor 291 may call logic instructions in the memory 292 to perform the methods of the above-described embodiments.

Further, the logic instructions in memory 292 described above may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product.

The memory 292 is a computer-readable storage medium that may be used to store a software program, a computer-executable program, and program instructions/modules corresponding to the methods in the embodiments of the present disclosure. The processor 291 executes functional applications and data processing by running software programs, instructions and modules stored in the memory 292, i.e., implements the methods of the method embodiments described above.

Memory 292 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created according to the use of the terminal device, etc. Further, memory 292 may include high-speed random access memory, and may also include non-volatile memory.

The disclosed embodiments provide a non-transitory computer readable storage medium having stored therein computer-executable instructions that, when executed by a processor, are configured to implement the method of the previous embodiments.

Example IV

The disclosed embodiments provide a computer program product comprising a computer program which, when executed by a processor, implements the method provided by any of the embodiments of the disclosure described above.

Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. An indoor scene reconstruction method, comprising:

2. The method of claim 1, wherein the determining, based on the semantic prediction technique, a point cloud point in the scene point cloud that is within a predetermined range around the plurality of training sample points as a training point set comprises:

3. The method of claim 1, wherein the neural features include a first neural feature and a second neural feature, the obtaining current input information for the set of training points comprising:

4. The method according to claim 1, wherein the method further comprises:

5. The method of claim 4, wherein the determining as the set of target points a point cloud point in the scene point cloud that is within a predetermined range around the plurality of target sampling points comprises:

6. The method according to any one of claims 1-5, wherein the training image satisfies the following condition: the training image includes a room surface; the overlap between training images at any two adjacent training perspective parameters is not less than thirty percent of the overall image size.

7. An indoor scene reconstruction device, comprising:

8. The apparatus of claim 7, wherein the sampling module is specifically configured to:

9. An electronic device, comprising: a processor, and a memory communicatively coupled to the processor;

the memory stores computer-executable instructions;

the processor executes computer-executable instructions stored in the memory to implement the method of any one of claims 1-6.

10. A computer readable storage medium having stored therein computer executable instructions which when executed by a processor are adapted to carry out the method of any one of claims 1-6.