CN115631489A - Three-dimensional semantic scene completion method, device, equipment and medium - Google Patents

Three-dimensional semantic scene completion method, device, equipment and medium Download PDF

Info

Publication number
CN115631489A
CN115631489A CN202211371118.6A CN202211371118A CN115631489A CN 115631489 A CN115631489 A CN 115631489A CN 202211371118 A CN202211371118 A CN 202211371118A CN 115631489 A CN115631489 A CN 115631489A
Authority
CN
China
Prior art keywords
dimensional
feature
features
target
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211371118.6A
Other languages
Chinese (zh)
Inventor
黄锐
陈勇全
孙宇翔
李�杰
宋琪
许振兴
许龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chinese University of Hong Kong Shenzhen
Shenzhen Institute of Artificial Intelligence and Robotics
Original Assignee
Chinese University of Hong Kong Shenzhen
Shenzhen Institute of Artificial Intelligence and Robotics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chinese University of Hong Kong Shenzhen, Shenzhen Institute of Artificial Intelligence and Robotics filed Critical Chinese University of Hong Kong Shenzhen
Priority to CN202211371118.6A priority Critical patent/CN115631489A/en
Publication of CN115631489A publication Critical patent/CN115631489A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The application discloses a three-dimensional semantic scene completion method, a device, equipment and a medium, which relate to the field of three-dimensional scene completion and comprise the following steps: performing depth completion on the depth image in the target RGB-D image by using a depth estimation image corresponding to the two-dimensional image in the target RGB-D image, and extracting two-dimensional target features in the depth image after completion by using a preset feature extractor; inputting the two-dimensional target feature and the two-dimensional semantic feature obtained by semantically segmenting the two-dimensional image into a preset projection layer, and obtaining the three-dimensional target feature and the three-dimensional semantic feature output by the preset projection layer; inputting the three-dimensional target feature and the three-dimensional semantic feature into a 3D backbone network to obtain extracted features output by the 3D backbone network; and performing feature fusion on the extracted features, the three-dimensional target features and the three-dimensional semantic features, and completing three-dimensional semantic scene completion by using the fused features. The invention excavates the completion depth prior, realizes the fusion with the semantic prior and completes the completion of the three-dimensional semantic scene.

Description

Three-dimensional semantic scene completion method, device, equipment and medium
Technical Field
The invention relates to the field of three-dimensional scene completion, in particular to a three-dimensional semantic scene completion method, a three-dimensional semantic scene completion device, three-dimensional semantic scene completion equipment and a three-dimensional semantic scene completion medium.
Background
Semantic Scene Completion (SSC) refers to the task of completing missing structures in a three-dimensional scene while inferring the semantic label of each voxel in the scene. Understanding the geometry and semantic information of three-dimensional scenes is a core challenge in computer vision research, is crucial for mobile agents to interact with the real world, and has a wide range of applications such as augmented reality, robotic grasping and navigation, etc.
SSC was originally used to jointly infer scene geometry and semantics from a single depth image, and the present invention classifies such methods as depth-only methods. Some research work has shown that the performance of SSC tasks can be significantly improved using RGB and corresponding depth, i.e. RGB-D image pairs, which this invention refers to as RGB-D based methods. The RGB-D based SSC technique generally gives better results than the depth-only technique because the RGB part introduces additional color, texture and semantic information. Among the RGB-D based methods, some prior art techniques such as SATNet, TS3D, IMENet, etc., utilize 2D (i.e., two-dimensional) dense semantics in RGB images to help improve 3D (i.e., three-dimensional) SSC accuracy, which the present invention defines as a semantic-based method. However, the results of these semantic-based methods are still unsatisfactory, and there are problems of semantic inconsistency and scene incompleteness.
Therefore, in the process of completing semantic scenes, how to avoid the situations that the semantic inconsistency and incomplete scenes exist in scene results after completion due to a semantic-based method is a problem to be solved in the field.
Disclosure of Invention
In view of this, the present invention aims to provide a method, an apparatus, a device, and a medium for completing a three-dimensional semantic scene, which can deeply mine a depth prior of completion, and explore a depth fusion with the semantic prior, so as to use a final fused feature and perform accurate prediction, thereby completing the three-dimensional semantic scene completion. The specific scheme is as follows:
in a first aspect, the present application discloses a three-dimensional semantic scene completion method, including:
performing depth completion on a depth image in a target RGB-D image by using a depth estimation image corresponding to a two-dimensional image in the target RGB-D image containing a three-dimensional scene to determine a completed depth image, and extracting two-dimensional target features in the completed depth image by using a preset feature extractor;
inputting the two-dimensional target feature and a two-dimensional semantic feature obtained by semantically segmenting the two-dimensional image into a preset projection layer to obtain a three-dimensional target feature and a three-dimensional semantic feature which are output by the preset projection layer and respectively correspond to the two-dimensional target feature and the two-dimensional semantic feature;
inputting the three-dimensional target feature and the three-dimensional semantic feature into a 3D backbone network to obtain extracted features output by the 3D backbone network;
and performing feature fusion on the extracted features, the three-dimensional target features and the three-dimensional semantic features, and completing three-dimensional semantic scene completion by using the fused features.
Optionally, the extracting, by using a preset feature extractor, the two-dimensional target feature in the complemented depth image includes:
and extracting the two-dimensional target features in the supplemented depth image by using a preset feature extractor consisting of preset two-dimensional convolution layers and a preset number of cascaded dimension decomposition residual modules.
Optionally, the inputting the two-dimensional target feature and the two-dimensional semantic feature obtained by semantically segmenting the two-dimensional image into a preset projection layer to obtain a three-dimensional target feature and a three-dimensional semantic feature, which are output by the preset projection layer and respectively correspond to the two-dimensional target feature and the two-dimensional semantic feature, includes:
inputting the two-dimensional target feature, the two-dimensional semantic feature obtained by performing semantic segmentation on the two-dimensional image and the supplemented depth image into a preset projection layer so as to determine a three-dimensional surface voxel corresponding to each two-dimensional pixel in the supplemented depth image based on a camera projection equation predefined in the preset projection layer, establishing a three-dimensional space based on the three-dimensional surface voxel, discretizing the three-dimensional space, determining a projection coefficient between the supplemented depth image and the discretized three-dimensional space, and determining the three-dimensional target feature and the three-dimensional semantic feature respectively corresponding to the two-dimensional target feature and the two-dimensional semantic feature based on the projection coefficient.
Optionally, the inputting the three-dimensional target feature and the three-dimensional semantic feature into a 3D backbone network to obtain an extracted feature output by the 3D backbone network includes:
and inputting the three-dimensional target feature and the three-dimensional semantic feature into a 3D SATNet-TNet so as to obtain an extracted feature output by the 3D SATNet-TNet.
Optionally, the performing feature fusion on the extracted features, the three-dimensional target features, and the three-dimensional semantic features includes:
performing feature fusion on the extracted features and the three-dimensional target features by using a preset feature fusion method to obtain a target matrix; each element in the target matrix represents the probability that the corresponding voxel is occupied in three-dimensional space;
determining a predicted coarse feature based on the extracted features;
coding the three-dimensional semantic features based on a preset coding function to determine coded features;
and adding the predicted rough features and the coded features, performing pixel-level multiplication on the added features and the target matrix, and taking the result of the pixel-level multiplication as a fused feature.
Optionally, the performing feature fusion on the extracted features and the three-dimensional target features by using a preset feature fusion method to obtain a target matrix includes:
expanding the three-dimensional target features to determine first features to be connected in series, and inputting the extracted features into a preset conversion module to obtain second features to be connected in series output by the conversion module; the preset conversion module consists of a three-dimensional convolution layer, a regularization layer and an activation function layer;
connecting the first feature to be connected in series with the second feature to be connected in series to obtain a connected feature;
inputting the series-connected characteristics into the conversion module, connecting the conversion module with a preset convolution layer, and executing a preset activation function after the convolution layer to generate a target matrix.
Optionally, the method for completing a three-dimensional semantic scene further includes:
constructing a binary cross entropy loss function as a first auxiliary loss function to optimize the target matrix;
constructing a first cross entropy loss function as a second auxiliary loss function to optimize the predicted rough feature;
constructing a second cross entropy loss function as a main loss function to optimize the fused features;
constructing a total loss function using the first auxiliary loss function, the second auxiliary loss function, and the main loss function.
In a second aspect, the present application discloses a three-dimensional semantic scene completion apparatus, including:
the depth enhancement module is used for performing depth completion on a depth image in a target RGB-D image by using a depth estimation image corresponding to a two-dimensional image in the target RGB-D image containing a three-dimensional scene to determine a completed depth image, and extracting two-dimensional target features in the completed depth image by using a preset feature extractor;
the feature projection module is used for inputting the two-dimensional target feature and a two-dimensional semantic feature obtained by semantically segmenting the two-dimensional image into a preset projection layer so as to obtain a three-dimensional target feature and a three-dimensional semantic feature which are output by the preset projection layer and respectively correspond to the two-dimensional target feature and the two-dimensional semantic feature;
the feature extraction module is used for inputting the three-dimensional target features and the three-dimensional semantic features into a 3D backbone network so as to obtain extracted features output by the 3D backbone network;
and the feature fusion module is used for performing feature fusion on the extracted features, the three-dimensional target features and the three-dimensional semantic features and completing three-dimensional semantic scene completion by using the fused features.
In a third aspect, the present application discloses an electronic device, comprising:
a memory for storing a computer program;
and the processor is used for executing the computer program to realize the three-dimensional semantic scene completion method.
In a fourth aspect, the present application discloses a computer storage medium for storing a computer program; wherein the computer program, when executed by a processor, implements the steps of the three-dimensional semantic scene completion method disclosed above.
The method comprises the steps of conducting depth completion on a depth image in a target RGB-D image by using a depth estimation image corresponding to a two-dimensional image in the target RGB-D image containing a three-dimensional scene to determine a completed depth image, and extracting two-dimensional target features in the completed depth image by using a preset feature extractor; inputting the two-dimensional target feature and a two-dimensional semantic feature obtained by semantically segmenting the two-dimensional image into a preset projection layer to obtain a three-dimensional target feature and a three-dimensional semantic feature which are output by the preset projection layer and respectively correspond to the two-dimensional target feature and the two-dimensional semantic feature; inputting the three-dimensional target features and the three-dimensional semantic features into a 3D backbone network to obtain extracted features output by the 3D backbone network; and performing feature fusion on the extracted features, the three-dimensional target features and the three-dimensional semantic features, and completing three-dimensional semantic scene completion by using the fused features. Therefore, in the embodiment, an additional depth enhancement mode is introduced, the SSC precision is improved by complementing holes and denoising an original depth, the conversion from a two-dimensional target feature and a two-dimensional semantic feature to a three-dimensional feature is completed through a 2D-3D preset projection layer, the extracted feature in the three-dimensional feature is extracted by using a 3D backbone network, the feature fusion is performed by using the two-dimensional target feature, the two-dimensional semantic feature and the extracted feature, the complemented depth prior of the whole process is deeply excavated, the depth fusion with the semantic prior is explored, the final fused feature is used for performing accurate prediction, and the three-dimensional semantic scene complementation is completed.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart of a three-dimensional semantic scene completion method provided in the present application;
FIG. 2 is a schematic view of a depth enhancement process provided herein;
fig. 3 is a flowchart of a specific three-dimensional semantic scene completion method provided in the present application;
FIG. 4 is an overall view of semantic and depth fusion provided by the present application;
FIG. 5 is an overall frame view provided by the present application;
fig. 6 is a schematic structural diagram of a three-dimensional semantic scene completion apparatus provided in the present application;
fig. 7 is a block diagram of an electronic device provided in the present application.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The existing SSC technology usually ignores noise and holes existing in the original depth, and fails to well utilize two-dimensional dense semantic segmentation prior, and the defects cause the problems of semantic inconsistency, incomplete structure and the like in the prediction of the final SSC. In the application, the original depth is subjected to complement enhancement based on an RGB-D method, and the semantics and the structural branches are deeply coupled, so that the mIoU (Mean interaction over Union) precision of a three-dimensional semantic segmentation task is greatly improved.
The embodiment of the invention discloses a three-dimensional semantic scene completion method, which is described with reference to FIG. 1 and comprises the following steps:
step S11: the method comprises the steps of conducting depth completion on a depth image in a target RGB-D image by using a depth estimation image corresponding to a two-dimensional image in the target RGB-D image containing a three-dimensional scene to determine a completed depth image, and extracting two-dimensional target features in the completed depth image by using a preset feature extractor.
In this step, the target RGB-D image including the three-dimensional scene includes a two-dimensional image and a depth image. In this step, the process of performing depth completion and feature extraction on the depth image is completed, and in the present invention, the depth value of the depth image is marked as D raw Let D denote the depth value of the depth estimation image pre Recording the depth value of the completed depth image as D aug Recording the two-dimensional target feature obtained by extracting the feature of the supplemented depth image by using a preset feature extractor as F aug
It should be noted that, in this step, the depth estimation image is a predicted depth generated after depth estimation of a two-dimensional image in the target RGB-D imageIn a specific embodiment, the prediction depth map may be generated by performing monocular depth estimation using a pre-trained depth estimation model based on a DispNet network. Obtaining the depth value D of the depth map at the same time of obtaining the predicted depth map depth estimation image pre
In addition, the two-dimensional image is subjected to semantic segmentation while the depth estimation is performed on the two-dimensional image, and in a specific embodiment, advanced 2D semantic segmentation can be performed by using a pre-trained segmentation model based on IMENet, and two-dimensional semantic features are finally obtained.
It should be noted that, when performing the depth estimation and the semantic segmentation, the weights of the two networks are fixed during the training of the two models, and the two networks do not participate in the parameter update of the networks.
In this embodiment, when the depth estimation image corresponding to the two-dimensional image in the target RGB-D image including the three-dimensional scene is used to perform depth compensation on the depth image in the target RGB-D image, the depth estimation image depth value D corresponding to the two-dimensional image in the target RGB-D image including the three-dimensional scene may be specifically based on the depth estimation image corresponding to the two-dimensional image in the target RGB-D image including the three-dimensional scene pre And depth values D of depth images in the target RGB-D image raw And performing depth completion on the depth image in the target RGB-D image.
As shown in fig. 2, the left dotted frame is a depth completion process, the right dotted frame is a feature extraction process of the feature extractor, and in the figure, the left side D raw And D pre Generating an enhanced depth map D by depth enhancement aug . In specific practice, D raw Generally, a depth map with holes, D may be used in this embodiment pre Alternative D raw The hole in the image is used for realizing depth completion, then the depth of the hole boundary is smoothly processed, a depth image after completion is obtained, and the depth value D of the depth image after completion is obtained aug Then, the two-dimensional target feature F corresponding to the supplemented depth image is extracted in a feature extractor aug
In this embodiment, the extracting, by using a preset feature extractor, the two-dimensional target feature in the supplemented depth image may include: and extracting the two-dimensional target features in the supplemented depth image by using a preset feature extractor consisting of a preset two-dimensional convolution layer and a preset number of cascaded dimension decomposition residual modules. As shown in fig. 2, the default feature extractor includes a 2D convolutional layer and four serialized DDR modules (i.e., dimension Decomposition Residual modules) to extract more advanced enhanced depth features, where parameters in the convolutional layer and the DDR are shown as (kernel size, expansion, stride). The first 2D convolutional layer in the feature extractor increases the number of channels of the depth feature map, and the four 2D DDR modules can be used for residual learning and reduce the number of parameters.
Step S12: and inputting the two-dimensional target feature and the two-dimensional semantic feature obtained by semantically segmenting the two-dimensional image into a preset projection layer so as to obtain a three-dimensional target feature and a three-dimensional semantic feature which are output by the preset projection layer and respectively correspond to the two-dimensional target feature and the two-dimensional semantic feature.
In the invention, the two-dimensional semantic features obtained by performing semantic segmentation on the two-dimensional image are recorded as S pre
In this embodiment, the preset projection layer may be understood as a 2D-3D projection layer, which is used to map the two-dimensional feature output in the previous step to a three-dimensional space proj Recording the three-dimensional target feature generated after the projection of the two-dimensional target feature as F proj
In this embodiment, the inputting the two-dimensional target feature and the two-dimensional semantic feature obtained by semantically segmenting the two-dimensional image into a preset projection layer to obtain a three-dimensional target feature and a three-dimensional semantic feature output by the preset projection layer and respectively corresponding to the two-dimensional target feature and the two-dimensional semantic feature includes: inputting the two-dimensional target feature, the two-dimensional semantic feature obtained by performing semantic segmentation on the two-dimensional image and the supplemented depth image into a preset projection layer so as to determine a three-dimensional surface voxel corresponding to each two-dimensional pixel in the supplemented depth image based on a camera projection equation predefined in the preset projection layer, establishing a three-dimensional space based on the three-dimensional surface voxel, discretizing the three-dimensional space, determining a projection coefficient between the supplemented depth image and the discretized three-dimensional space, and determining the three-dimensional target feature and the three-dimensional semantic feature respectively corresponding to the two-dimensional target feature and the two-dimensional semantic feature based on the projection coefficient.
In this embodiment, the predefined camera projection equation is specifically based on an intrinsic camera matrix K of a camera that captures the RGBD image 3x3 And extrinsic camera matrix [ R | t] 3x4 Determining, the camera projection equation may be: p is a radical of uv =K 3×3 [R|t] 3×4 P XYZ . Each two-dimensional pixel point P in the 2D image uv ([u,v,1] T Homogeneous coordinates) can be easily projected to the corresponding 3D point P by the camera projection equation XYZ [X,Y,Z,1] T Homogeneous coordinates).
When the projected 3D scene space is discretized into a volume having a certain voxel size (e.g., 0.02 m), it will generate an incomplete three-dimensional space, which will assign its corresponding two-dimensional feature vector to each three-dimensional surface voxel. For depth image D which is not completed in three-dimensional space aug Any voxels occupied by a depth value in (2) will have their feature vector set to zero. During training, the projection coefficients between the two-dimensional features and the three-dimensional space generated from the complemented depth map will be recorded in a table for gradient backpropagation.
Step S13: and inputting the three-dimensional target features and the three-dimensional semantic features into a 3D backbone network to obtain extracted features output by the 3D backbone network.
In this embodiment, the inputting the three-dimensional target feature and the three-dimensional semantic feature into a 3D backbone network to obtain an extracted feature output by the 3D backbone network includes: and inputting the three-dimensional target feature and the three-dimensional semantic feature into a 3D SATNet-TNet so as to obtain an extracted feature output by the 3D SATNet-TNet. That is, the 3D backbone network in the present invention is preferably 3D SATNet-TNet.
In a specific embodiment, the extracted features F are extracted feat ∈R C ' ×W×H×D Wherein C' is the number of channels, W, H, D is the width, height and depth of the three-dimensional space, respectively, F feat It can be seen as an aggregation of global geometric context information and semantic representations.
Step S14: and performing feature fusion on the extracted features, the three-dimensional target features and the three-dimensional semantic features, and completing three-dimensional semantic scene completion by using the fused features.
In this step, the extracted features F feat The three-dimensional target feature F proj And the three-dimensional semantic feature S proj And performing feature fusion to realize better fusion of semantics and enhanced depth, obtaining the fused features and performing final prediction.
In the embodiment, depth completion is performed on a depth image in a target RGB-D image by using a depth estimation image corresponding to a two-dimensional image in the target RGB-D image containing a three-dimensional scene to determine a completed depth image, and a preset feature extractor is used for extracting two-dimensional target features in the completed depth image; inputting the two-dimensional target feature and a two-dimensional semantic feature obtained by semantically segmenting the two-dimensional image into a preset projection layer to obtain a three-dimensional target feature and a three-dimensional semantic feature which are output by the preset projection layer and respectively correspond to the two-dimensional target feature and the two-dimensional semantic feature; inputting the three-dimensional target feature and the three-dimensional semantic feature into a 3D backbone network to obtain extracted features output by the 3D backbone network; and performing feature fusion on the extracted features, the three-dimensional target features and the three-dimensional semantic features, and completing three-dimensional semantic scene completion by using the fused features. Therefore, in the embodiment, an additional depth enhancement mode is introduced, the SSC precision is improved by complementing holes and denoising an original depth, the conversion from a two-dimensional target feature and a two-dimensional semantic feature to a three-dimensional feature is completed through a 2D-3D preset projection layer, the extracted feature in the three-dimensional feature is extracted by using a 3D backbone network, the feature fusion is performed by using the two-dimensional target feature, the two-dimensional semantic feature and the extracted feature, the complemented depth prior of the whole process is deeply excavated, the depth fusion with the semantic prior is explored, the final fused feature is used for performing accurate prediction, and the three-dimensional semantic scene complementation is completed.
Fig. 3 is a flowchart of a specific three-dimensional semantic scene completion method provided in an embodiment of the present application. Referring to fig. 3, the method includes:
step S21: the method comprises the steps of conducting depth completion on a depth image in a target RGB-D image by using a depth estimation image corresponding to a two-dimensional image in the target RGB-D image containing a three-dimensional scene to determine a completed depth image, and extracting two-dimensional target features in the completed depth image by using a preset feature extractor.
For a more specific processing procedure of step S21, reference may be made to corresponding contents disclosed in the foregoing embodiments, and details are not repeated here.
Step S22: and inputting the two-dimensional target feature and the two-dimensional semantic feature obtained by semantically segmenting the two-dimensional image into a preset projection layer so as to obtain a three-dimensional target feature and a three-dimensional semantic feature which are output by the preset projection layer and respectively correspond to the two-dimensional target feature and the two-dimensional semantic feature.
For a more specific processing procedure of step S22, reference may be made to corresponding contents disclosed in the foregoing embodiments, and details are not repeated here.
Step S23: and inputting the three-dimensional target features and the three-dimensional semantic features into a 3D backbone network to obtain extracted features output by the 3D backbone network.
For a more specific processing procedure of step S23, reference may be made to corresponding contents disclosed in the foregoing embodiments, and details are not repeated here.
Step S24: and performing feature fusion on the extracted features and the three-dimensional target features by using a preset feature fusion method to obtain a target matrix.
In this embodiment, the extracted features are fused by using a preset feature fusion methodAnd performing feature fusion on the three-dimensional target features to obtain a target matrix, which may include: expanding the three-dimensional target features to determine first features to be connected in series, and inputting the extracted features into a preset conversion module to obtain second features to be connected in series output by the conversion module; the preset conversion module consists of a three-dimensional convolution layer, a regularization layer and an activation function layer; connecting the first feature to be connected in series with the second feature to be connected in series to obtain a connected feature; inputting the series-connected characteristics into the conversion module, connecting the conversion module with a preset convolution layer, and executing a preset activation function after the convolution layer to generate a target matrix. The invention marks the target matrix as A 3d
Fig. 4 is a schematic diagram of the whole semantic and depth fusion proposed by the present invention. In the figure, the upper box is a semantic branch and the lower box is a structural branch. Referring to the structure branch flow below in the step, the three-dimensional target feature F obtained in the previous step is firstly subjected to proj Performing expansion to F 3d '∈R C ' ×W×H×D . At the same time, the extracted features F feat Applying a transformation module to generate F 3d . After that, F is put 3d ' and F 3d Performing series connection, inputting the series-connected characteristics into the conversion module again, adding a preset convolution layer behind the conversion module, and finally executing a sigmoid activation function after the convolution layer to generate a target matrix A 3d . Wherein the object matrix A 3d Each value of [0,1]This represents the probability of each voxel space occupation, hence A 3d It can also be understood as a space occupancy matrix. A is described 3d Can be specifically expressed as: a. The 3d =sigmoid(C 1×1×1 CBR([D 3d ,F 3d ]) Therein [,.]Indicating series operation, channel C' is 128. Note that the 3D convolution parameters in fig. 4 are (input channel, output channel, kernel size).
In this embodiment, the transformation module may also be called CBR module, which specifically represents Conv3d-BatchNorm3d-ReLU, i.e. a module composed of a three-dimensional convolution layer, a regularization layer, and an activation function layer.
Step S25: and determining a predicted rough feature based on the extracted features, encoding the three-dimensional semantic features based on a preset encoding function to determine encoded features, then adding the predicted rough feature and the encoded features, performing pixel-level multiplication on the added features and the target matrix, and taking the pixel-level multiplication result as the feature after fusion.
This step can refer to the branch flow of the structure below FIG. 4, and the extracted features F are first extracted feat After convolution with a kernel of 1x1x1 (denoted as C) 1x1x1 ) Generating F from the 3D convolutional layer pre . A coarse SSC prediction feature F is then generated using the softmax activation function coarse Namely, the above predicted roughness characteristics have the following specific formula: f coarse =softmax(C 1×1×1 (F feat ) ); for the projected three-dimensional semantic features, it is first encoded by one-hot function to obtain the region of ROI (region of interest) of a specific class. In the present invention, the 3D semantic S proj Is encoded into 12 channels to obtain encoded features F one-hot . Wherein in each channel, the voxel value within the ROI area of the corresponding category is set to 1; otherwise, it is set to zero. In the step, one-hot coding introduces spatial boundary constraint into each category of network, thereby improving the prediction of 3D semantics.
In this embodiment, the target matrix A is obtained 3d Predicted roughness feature F coarse Coded feature F one-hot Then, we first apply F coarse And F one-hot Adding and then adding with A 3d Pixel level multiplication to generate final SSC result F fine I.e. F fine =A 3D *(F coarse +F one-hot )。
It is noted that the 3D backbone network in the present invention can be replaced by 3D convolution of any popular semantic scene completion model, except that the last 12-channel prediction layer is removed and the F is preserved feat The number of channels of (c). But the 3D SATNet-TNet is generally used as a 3D backbone network.
Step S26: and completing three-dimensional semantic scene completion by using the fused features.
In this embodiment, the method for completing a three-dimensional semantic scene may further include: constructing a binary cross entropy loss function as a first auxiliary loss function to optimize the target matrix; constructing a first cross entropy loss function as a second auxiliary loss function to optimize the predicted rough feature; constructing a second cross entropy loss function as a main loss function to optimize the fused features; constructing a total loss function using the first auxiliary loss function, the second auxiliary loss function, and the main loss function.
To improve performance and make 3D networks easier to optimize, the present invention employs multiple loss functions for supervision. In a specific embodiment, we use the principal loss function L fine To supervise improved predictions F fine And two specific auxiliary loss functions, i.e. a first auxiliary loss function L, are added binary And a second auxiliary loss function L coarse Function to supervise A 3d And F coarse Wherein L is fine And L coarse Is a cross entropy loss function, as shown below, and L binary Is a binary cross entropy loss function.
The specific formula is as follows:
Figure BDA0003924789120000111
wherein y is nc Is the true vector of one-hot, i.e. if the nth voxel belongs to class c, then y nc =1, otherwise y nc And =0. Second, C and N are the total number of classes and voxels, respectively. w is a c Is the category weight.
Finally, we define the total loss function L using the primary and secondary loss functions total :L total =L fine +L binary +L coarse
The embodiment provides a specific process of a 3D semantic and depth fusion mold, which comprises a structural branch and a semantic branch, wherein the structural branch can definitely predict the scene occupancy, the semantic branch comprises a one-hot encoding flow and a rough prediction flow, and more accurate SSC prediction can be completed by utilizing the result fusion of the two branches. Finally, loss is rationally designed to show supervised space occupancy and SSC prediction. Finally, the three-dimensional semantic scene completion method based on semantic and depth enhancement can provide more comprehensive understanding for the three-dimensional scene by enhancing the original two-dimensional depth input of the hole-containing noise and combining the two-dimensional dense semantic. In practical implementation, by performing completion enhancement on the original depth and deeply coupling semantics and structural branches, the mIoU precision of the three-dimensional semantic segmentation task is more than 60%, and the IoU of the three-dimensional scene completion task exceeds 83%.
The invention can be divided into a preprocessing module, a depth enhancing module and a 3D semantic and depth fusion module according to the processing flow. Fig. 5 shows an overall framework diagram of the present invention, in which a is a preprocessing module including a 2D semantic segmentation sub-network and a 2D monocular depth estimation sub-network, which respectively predict semantic features and estimated depth from a single RGB; b is a depth enhancement module which uses the predicted depth to enhance the incomplete original depth and combines the convolution block to find the enhanced depth feature representation; then, mapping the 2D features to the corresponding 3D space by utilizing a 2D-3D projection layer; and finally, through a 3D semantic and depth fusion module c, the higher-level semantic scene completion is realized by searching deeper semantic-structure fusion through structure and semantic branches.
Referring to fig. 6, an embodiment of the present application discloses a three-dimensional semantic scene completion apparatus, which may specifically include:
the depth enhancement module 11 is configured to perform depth completion on a depth image in a target RGB-D image containing a three-dimensional scene by using a depth estimation image corresponding to a two-dimensional image in the target RGB-D image to determine a completed depth image, and extract a two-dimensional target feature in the completed depth image by using a preset feature extractor;
the feature projection module 12 is configured to input the two-dimensional target feature and a two-dimensional semantic feature obtained by semantically segmenting the two-dimensional image into a preset projection layer, so as to obtain a three-dimensional target feature and a three-dimensional semantic feature, which are output by the preset projection layer and respectively correspond to the two-dimensional target feature and the two-dimensional semantic feature;
a feature extraction module 13, configured to input the three-dimensional target feature and the three-dimensional semantic feature into a 3D backbone network, so as to obtain an extracted feature output by the 3D backbone network;
and the feature fusion module 14 is configured to perform feature fusion on the extracted features, the three-dimensional target features, and the three-dimensional semantic features, and complete three-dimensional semantic scene completion by using the fused features.
The method comprises the steps of conducting depth completion on a depth image in a target RGB-D image by using a depth estimation image corresponding to a two-dimensional image in the target RGB-D image containing a three-dimensional scene to determine a completed depth image, and extracting two-dimensional target features in the completed depth image by using a preset feature extractor; inputting the two-dimensional target feature and a two-dimensional semantic feature obtained by semantically segmenting the two-dimensional image into a preset projection layer to obtain a three-dimensional target feature and a three-dimensional semantic feature which are output by the preset projection layer and respectively correspond to the two-dimensional target feature and the two-dimensional semantic feature; inputting the three-dimensional target features and the three-dimensional semantic features into a 3D backbone network to obtain extracted features output by the 3D backbone network; and performing feature fusion on the extracted features, the three-dimensional target features and the three-dimensional semantic features, and completing three-dimensional semantic scene completion by using the fused features. Therefore, in the embodiment, an additional depth enhancement mode is introduced, the SSC precision is improved by complementing holes and denoising an original depth, the conversion from a two-dimensional target feature and a two-dimensional semantic feature to a three-dimensional feature is completed through a 2D-3D preset projection layer, the extracted feature in the three-dimensional feature is extracted by using a 3D backbone network, the feature fusion is performed by using the two-dimensional target feature, the two-dimensional semantic feature and the extracted feature, the complemented depth prior of the whole process is deeply excavated, the depth fusion with the semantic prior is explored, the final fused feature is used for performing accurate prediction, and the three-dimensional semantic scene complementation is completed.
Further, an electronic device is also disclosed in the embodiments of the present application, fig. 7 is a block diagram of the electronic device 20 shown in the exemplary embodiments, and the content in the diagram cannot be considered as any limitation to the scope of the application.
Fig. 7 is a schematic structural diagram of an electronic device 20 according to an embodiment of the present disclosure. The electronic device 20 may specifically include: at least one processor 21, at least one memory 22, a power supply 23, a display 24, an input-output interface 25, a communication interface 26, and a communication bus 27. The memory 22 is configured to store a computer program, and the computer program is loaded and executed by the processor 21 to implement relevant steps in the three-dimensional semantic scene completing method disclosed in any one of the foregoing embodiments. In addition, the electronic device 20 in the present embodiment may be specifically an electronic computer.
In this embodiment, the power supply 23 is configured to provide a working voltage for each hardware device on the electronic device 20; the communication interface 26 can create a data transmission channel between the electronic device 20 and an external device, and a communication protocol followed by the communication interface is any communication protocol applicable to the technical solution of the present application, and is not specifically limited herein; the input/output interface 25 is configured to obtain external input data or output data to the outside, and a specific interface type thereof may be selected according to specific application requirements, which is not specifically limited herein.
In addition, the storage 22 may be a carrier for storing resources, such as a read-only memory, a random access memory, a magnetic disk or an optical disk, etc., the resources stored thereon may include an operating system 221, a computer program 222, virtual machine data 223, etc., and the virtual machine data 223 may include various data. The storage means may be a transient storage or a permanent storage.
The operating system 221 is used for managing and controlling each hardware device on the electronic device 20 and the computer program 222, and may be Windows Server, netware, unix, linux, or the like. The computer programs 222 may further include computer programs that can be used to perform other specific tasks in addition to the computer programs that can be used to perform the three-dimensional semantic scene completion method performed by the electronic device 20 disclosed in any of the foregoing embodiments.
Further, the present application discloses a computer-readable storage medium, wherein the computer-readable storage medium includes a Random Access Memory (RAM), a Memory, a Read-Only Memory (ROM), an electrically programmable ROM, an electrically erasable programmable ROM, a register, a hard disk, a magnetic disk, or an optical disk, or any other form of storage medium known in the art. Wherein the computer program, when executed by a processor, implements the three-dimensional semantic scene completion method disclosed above. For the specific steps of the method, reference may be made to corresponding contents disclosed in the foregoing embodiments, and details are not repeated here.
The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description. Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.
The method, the device, the equipment and the storage medium for completing the three-dimensional semantic scene provided by the invention are described in detail, a specific example is applied in the method to explain the principle and the implementation mode of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (10)

1. A three-dimensional semantic scene completion method is characterized by comprising the following steps:
performing depth completion on a depth image in a target RGB-D image by using a depth estimation image corresponding to a two-dimensional image in the target RGB-D image containing a three-dimensional scene to determine a completed depth image, and extracting two-dimensional target features in the completed depth image by using a preset feature extractor;
inputting the two-dimensional target feature and a two-dimensional semantic feature obtained by semantically segmenting the two-dimensional image into a preset projection layer to obtain a three-dimensional target feature and a three-dimensional semantic feature which are output by the preset projection layer and respectively correspond to the two-dimensional target feature and the two-dimensional semantic feature;
inputting the three-dimensional target features and the three-dimensional semantic features into a 3D backbone network to obtain extracted features output by the 3D backbone network;
and performing feature fusion on the extracted features, the three-dimensional target features and the three-dimensional semantic features, and completing three-dimensional semantic scene completion by using the fused features.
2. The method for completing three-dimensional semantic scene according to claim 1, wherein the extracting two-dimensional target features in the depth image after completing by using a preset feature extractor comprises:
and extracting the two-dimensional target features in the supplemented depth image by using a preset feature extractor consisting of preset two-dimensional convolution layers and a preset number of cascaded dimension decomposition residual modules.
3. The method for completing three-dimensional semantic scene according to claim 1, wherein the step of inputting the two-dimensional target feature and the two-dimensional semantic feature obtained by semantically segmenting the two-dimensional image into a preset projection layer to obtain a three-dimensional target feature and a three-dimensional semantic feature, which are output by the preset projection layer and respectively correspond to the two-dimensional target feature and the two-dimensional semantic feature, comprises:
inputting the two-dimensional target feature, the two-dimensional semantic feature obtained by performing semantic segmentation on the two-dimensional image and the supplemented depth image into a preset projection layer so as to determine a three-dimensional surface voxel corresponding to each two-dimensional pixel in the supplemented depth image based on a camera projection equation predefined in the preset projection layer, establishing a three-dimensional space based on the three-dimensional surface voxel, discretizing the three-dimensional space, determining a projection coefficient between the supplemented depth image and the discretized three-dimensional space, and determining the three-dimensional target feature and the three-dimensional semantic feature respectively corresponding to the two-dimensional target feature and the two-dimensional semantic feature based on the projection coefficient.
4. The method for completing three-dimensional semantic scene according to claim 1, wherein the inputting the three-dimensional target feature and the three-dimensional semantic feature into a 3D backbone network to obtain the extracted features output by the 3D backbone network comprises:
and inputting the three-dimensional target feature and the three-dimensional semantic feature into a 3D SATNet-TNet so as to obtain an extracted feature output by the 3D SATNet-TNet.
5. The method for completing three-dimensional semantic scene according to any one of claims 1 to 4, wherein the feature fusion of the extracted features, the three-dimensional target features and the three-dimensional semantic features comprises:
performing feature fusion on the extracted features and the three-dimensional target features by using a preset feature fusion method to obtain a target matrix; each element in the target matrix represents the probability that the corresponding voxel is occupied in three-dimensional space;
determining a predicted coarse feature based on the extracted features;
coding the three-dimensional semantic features based on a preset coding function to determine coded features;
and adding the predicted rough features and the coded features, performing pixel-level multiplication on the added features and the target matrix, and taking the result of the pixel-level multiplication as a fused feature.
6. The method for completing three-dimensional semantic scene according to claim 5, wherein the performing feature fusion on the extracted features and the three-dimensional target features by using a preset feature fusion method to obtain a target matrix comprises:
expanding the three-dimensional target features to determine first features to be connected in series, and inputting the extracted features into a preset conversion module to obtain second features to be connected in series output by the conversion module; the preset conversion module consists of a three-dimensional convolution layer, a regularization layer and an activation function layer;
connecting the first feature to be connected in series with the second feature to be connected in series to obtain a connected feature;
inputting the series-connected characteristics into the conversion module, connecting the conversion module with a preset convolution layer, and executing a preset activation function after the convolution layer to generate a target matrix.
7. The method for completing three-dimensional semantic scenes according to claim 6, further comprising:
constructing a binary cross entropy loss function as a first auxiliary loss function to optimize the target matrix;
constructing a first cross entropy loss function as a second auxiliary loss function to optimize the predicted rough feature;
constructing a second cross entropy loss function as a main loss function to optimize the fused features;
constructing a total loss function using the first auxiliary loss function, the second auxiliary loss function, and the main loss function.
8. A three-dimensional semantic scene completion device, comprising:
the depth enhancement module is used for performing depth completion on a depth image in a target RGB-D image by using a depth estimation image corresponding to a two-dimensional image in the target RGB-D image containing a three-dimensional scene to determine a completed depth image, and extracting two-dimensional target features in the completed depth image by using a preset feature extractor;
the feature projection module is used for inputting the two-dimensional target feature and a two-dimensional semantic feature obtained by semantically segmenting the two-dimensional image into a preset projection layer so as to obtain a three-dimensional target feature and a three-dimensional semantic feature which are output by the preset projection layer and respectively correspond to the two-dimensional target feature and the two-dimensional semantic feature;
the feature extraction module is used for inputting the three-dimensional target features and the three-dimensional semantic features into a 3D backbone network so as to obtain extracted features output by the 3D backbone network;
and the feature fusion module is used for performing feature fusion on the extracted features, the three-dimensional target features and the three-dimensional semantic features and completing three-dimensional semantic scene completion by using the fused features.
9. An electronic device comprising a processor and a memory; wherein the processor, when executing the computer program stored in the memory, implements the three-dimensional semantic scene completion method according to any one of claims 1 to 7.
10. A computer-readable storage medium for storing a computer program; wherein the computer program, when executed by a processor, implements the three-dimensional semantic scene completion method of any one of claims 1 to 7.
CN202211371118.6A 2022-11-03 2022-11-03 Three-dimensional semantic scene completion method, device, equipment and medium Pending CN115631489A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211371118.6A CN115631489A (en) 2022-11-03 2022-11-03 Three-dimensional semantic scene completion method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211371118.6A CN115631489A (en) 2022-11-03 2022-11-03 Three-dimensional semantic scene completion method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN115631489A true CN115631489A (en) 2023-01-20

Family

ID=84908926

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211371118.6A Pending CN115631489A (en) 2022-11-03 2022-11-03 Three-dimensional semantic scene completion method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN115631489A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117422629A (en) * 2023-12-19 2024-01-19 华南理工大学 Instance-aware monocular semantic scene completion method, medium and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117422629A (en) * 2023-12-19 2024-01-19 华南理工大学 Instance-aware monocular semantic scene completion method, medium and device
CN117422629B (en) * 2023-12-19 2024-04-26 华南理工大学 Instance-aware monocular semantic scene completion method, medium and device

Similar Documents

Publication Publication Date Title
Zhang et al. Ga-net: Guided aggregation net for end-to-end stereo matching
CN112927357B (en) 3D object reconstruction method based on dynamic graph network
CN109086683B (en) Human hand posture regression method and system based on point cloud semantic enhancement
WO2021249255A1 (en) Grabbing detection method based on rp-resnet
CN110378338A (en) A kind of text recognition method, device, electronic equipment and storage medium
CN108596919B (en) Automatic image segmentation method based on depth map
CN109005398B (en) Stereo image parallax matching method based on convolutional neural network
CN108648161A (en) The binocular vision obstacle detection system and method for asymmetric nuclear convolutional neural networks
US20220261659A1 (en) Method and Apparatus for Determining Neural Network
CN113487664B (en) Three-dimensional scene perception method, three-dimensional scene perception device, electronic equipment, robot and medium
CN113486887B (en) Target detection method and device in three-dimensional scene
CN115330940B (en) Three-dimensional reconstruction method, device, equipment and medium
CN112287824A (en) Binocular vision-based three-dimensional target detection method, device and system
CN113222033A (en) Monocular image estimation method based on multi-classification regression model and self-attention mechanism
CN115631489A (en) Three-dimensional semantic scene completion method, device, equipment and medium
CN114780768A (en) Visual question-answering task processing method and system, electronic equipment and storage medium
CN108986210B (en) Method and device for reconstructing three-dimensional scene
CN112508821B (en) Stereoscopic vision virtual image hole filling method based on directional regression loss function
CN108921852B (en) Double-branch outdoor unstructured terrain segmentation network based on parallax and plane fitting
CN116168393B (en) Automatic semantic annotation data generation method and device based on point cloud neural radiation field
CN113920270B (en) Layout reconstruction method and system based on multi-view panorama
CN116229448A (en) Three-dimensional target detection method, device, equipment and readable storage medium
CN116228850A (en) Object posture estimation method, device, electronic equipment and readable storage medium
CN114615505A (en) Point cloud attribute compression method and device based on depth entropy coding and storage medium
CN114723809A (en) Method and device for estimating object posture and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination