CN116630953A - Monocular image 3D target detection method based on nerve volume rendering - Google Patents

Monocular image 3D target detection method based on nerve volume rendering Download PDF

Info

Publication number
CN116630953A
CN116630953A CN202310432912.5A CN202310432912A CN116630953A CN 116630953 A CN116630953 A CN 116630953A CN 202310432912 A CN202310432912 A CN 202310432912A CN 116630953 A CN116630953 A CN 116630953A
Authority
CN
China
Prior art keywords
features
feature
image
depth
volume rendering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310432912.5A
Other languages
Chinese (zh)
Inventor
徐骏凯
彭亮
程浩然
钱炜
杨政
何晓飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Fabu Technology Co Ltd
Original Assignee
Hangzhou Fabu Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Fabu Technology Co Ltd filed Critical Hangzhou Fabu Technology Co Ltd
Priority to CN202310432912.5A priority Critical patent/CN116630953A/en
Publication of CN116630953A publication Critical patent/CN116630953A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/64Three-dimensional objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a monocular image 3D target detection method based on nerve volume rendering. The method comprises the steps that an RGB image is input originally, and 2D image features are extracted through a 2D image backbone network; obtaining a position view cone feature by using multi-plane image interval sampling; fusing the 2D image features and the position view cone features to obtain position perception view cone features; processing the position-aware view cone features using a 3D convolutional network, establishing a neuro-volume rendering feature based on a symbol distance field; obtaining optimized nerve volume rendering characteristics according to the nerve volume rendering characteristics, the reconstruction consistency constraint of the nerve volume rendering and the zero level set constraint of the symbol distance field; and finally, obtaining 3D detection characteristics through grid sampling, and inputting the 3D detection characteristics into a universal target detection head to obtain a 3D target detection result. The invention provides a monocular image 3D target detection method based on nerve volume rendering for the first time, which can effectively carry out a monocular image 3D target detection task and simultaneously predict 3D occupation, and particularly has higher detection precision.

Description

Monocular image 3D target detection method based on nerve volume rendering
Technical Field
The invention relates to an image processing method in computer vision and automatic driving, in particular to a monocular 3D target detection method based on nerve volume rendering.
Background
Monocular three-dimensional object detection is one of the very important and troublesome problems in the field of computer vision. In this task, given a single picture, we need to detect objects of the category of interest from them and output their position and size in three-dimensional space. An efficient monocular three-dimensional object detector can be widely applied to various fields based on visual perception, such as automatic driving vehicles, security robots and the like, and can also provide data support for training of other downstream computer visual tasks.
In recent years, monocular three-dimensional object detection has made remarkable progress. Whereas existing monocular methods typically rely on depth estimation to achieve their constraint of relationship from 2D images to 3D space.
There are some limitations to the 3D representation currently resulting from the conversion of display depth estimates.
First, 3D features obtained from depth estimation or pseudo-LiDAR exhibit non-uniform distribution throughout the 3D space. In particular, they are higher in density at close distances, but as distance increases, density decreases.
Second, the final 3D object detection performance is severely dependent on the accuracy of the depth estimation, which remains a significant challenge.
Therefore, this characterization approach does not yield dense and reasonable 3D features for monocular image 3D object detection tasks.
In the field of monocular three-dimensional object detection, there are approaches to enhance the detector with scene geometric cues. However, many existing methods explicitly exploit these cues, e.g. estimating the depth map and back projecting it into 3D space. Such explicit methods can lead to sparsity of the 3D representation due to the increasing dimension from 2D to 3D and can lead to loss of a large amount of information, especially for distant and occluded objects.
Disclosure of Invention
In order to solve and alleviate the problems in the background technology, the invention provides a nerve volume rendering monocular image 3D target detection method, which can effectively perform monocular image 3D target detection tasks and simultaneously generate 3D occupation.
The invention re-expresses the intermediate 3D representation in monocular image 3D target detection into a 3D representation similar to a nerve radiation field (NeRF), thereby generating dense and reasonable 3D geometric and occupation information, and carrying out 3D target detection according to the dense and reasonable 3D geometric and occupation information.
The technical scheme adopted by the invention is as follows:
1) For an original input RGB image, firstly extracting 2D image features through a 2D image backbone network;
the 2D image backbone network adopts a Resnet34 neural network.
2) Obtaining three-dimensional coordinates of a view cone corresponding to an input RGB image by using a multi-plane image interval sampling method, and normalizing the three-dimensional coordinates to serve as a position view cone characteristic;
3) The 2D image features and the position view cone features are fused and processed in a query-key value pair mode, so that the position perception view cone features are obtained;
4) Processing the position-aware view cone features using a 3D convolutional network, establishing a neuro-volume rendering feature based on a symbol distance field;
5) Obtaining optimized nerve volume rendering characteristics according to the nerve volume rendering characteristics combined with reconstruction consistency constraint of volume rendering and zero level set constraint of a symbol distance field;
6) And (3) obtaining 3D detection characteristics after grid sampling the optimized nerve body rendering characteristics, and inputting the 3D detection characteristics into a universal target detection head to obtain a 3D target detection result.
All the above steps of the present invention constitute an integral monocular image 3D object detection network.
The step 2) specifically comprises the following steps: according to a predefined near depth z n And far depth z f Establishing depth ranges, i.e. from near depth z n To a far depth z f Sampling D depths from the depth range, the sampling process using equal depth intervals and random perturbations, each depth creating a depth plane divided into a number of grid pixels in each depth plane, each depth plane being consistent with the 2D image feature resolution (hxw) of step 1), i.e. having (H xw) pixel grids, each grid pixel having its own plane coordinates on the depth plane plus the depth of the depth plane as its own three-dimensional coordinates; and normalizing the three-dimensional coordinates of all grid pixels of all depth planes to obtain a 3D representation three-dimensional matrix serving as a position view cone feature.
In each depth plane, the view cone 3D position coordinates p= [ u, v, z ] ≡t of each grid pixel are composed of grid pixels superimposed with their own plane coordinates [ u, v ] ≡t on the depth plane with depth [ z ] ≡t.
The step 3) of inquiring-key value method specifically comprises a position inquiring module, a key mapping module, a value mapping module and a fusion module containing a softmax function, and is used for fusing position viewing cone characteristics and 2D image characteristics to construct position perception viewing cone characteristics;
inputting the position view cone features into a position query module to obtain a position query feature Q, and respectively inputting the 2D image features into a key mapping module and a value mapping module to obtain a key mapping feature K and a value mapping feature V;
and then multiplying the position query feature Q and the key mapping feature K firstly, namely multiplying corresponding to the same position, scaling the multiplied result in the depth dimension by using a softmax function, and multiplying the scaled result with the value mapping feature V to obtain the position perception cone feature.
The position inquiry module f q The method comprises the following steps:
f q :F pos →Q
wherein F is pos For position view cone features, Q is position findingPolling feature, a learning function f q Realization of the slave F pos Mapping to Q;
the key mapping module f k The method comprises the following steps:
f k :F image →K
wherein F is image For 2D image features, K is key mapping feature, the function f can be learned k Realization of the slave F image Mapping to K;
the value mapping module f v The method comprises the following steps:
f v :F image →V
wherein F is image For 2D image features, V is a value mapping feature, a function f can be learned v Realization of the slave F image Mapping to V;
the fusion module is as follows:
F P =Softmax(QK,dim=D)V
wherein F is P Is a position-aware view cone feature, Q, K, V is distributed as the above-mentioned position query feature, key mapping feature and value mapping feature, dim represents the dimension of depth, and Softmax (·, dim=d) is calculated as a Softmax function along the aforementioned depth sampling axis (dimension of sampling D depths).
In the step 4), the neural volume rendering feature based on the symbol distance field specifically comprises four parts of symbol distance field feature, volume density feature, 3D middle feature and RGB color feature, which are respectively realized by three 3D convolution networks capable of learning parameters and a laplace cumulative distribution function capable of learning parameters:
after the position-aware view cone feature is constructed, the position-aware view cone feature is input into a first 3D convolution network to output a total 3D feature and take any one-dimensional feature as a Symbol Distance Field (SDF) feature, and is also input into a second 3D convolution network to output the total 3D feature and take the feature of the remaining dimension except the Symbol Distance Field (SDF) feature as a 3D intermediate feature, and then the symbol distance field feature is converted into a volume density feature required by volume rendering through a Laplace cumulative distribution function.
Wherein the first 3D convolutional network is:
f 1 :F P →F sdf
wherein F is P Is a location-aware view cone feature, F sdf 3D convolutional network F1 implementation of learnable 3D parameters for symbol distance field features from F P To F sdf Is mapped to;
wherein the second 3D convolutional network is:
f 2 :F P →F 3D
wherein F is P Is a location-aware view cone feature, F 3D 3D convolution network f for 3D intermediate features, which can learn 3D parameters 2 Realization of the slave F P To F 3D Is mapped to;
wherein the laplace cumulative distribution function is:
αΨ β :F sdf →F density
wherein, psi is β For Laplace cumulative distribution function with zero mean and beta as scaling scale, alpha and beta are respectively the first and second leachable parameters, F sdf For sign distance field features, F density For bulk density characteristics, ψ β A mapping from symbol distance to bulk density is achieved for forming a homogeneous bulk density.
A Symbol Distance Field (SDF) feature has a plurality of small lattices, each of which is an element, and each of which has a scalar value as the value of the element. This encodes the position-aware view cone features into three-dimensional space to obtain more informative three-dimensional features for neurostimulation rendering.
In the method, the body density characteristics and the 3D intermediate characteristics are combined to establish loss constraint to carry out supervised training learning, namely, the whole detection network is trained, specifically:
before the training supervision process, each RGB image is provided with sparse laser radar points (LiDAR points) to form a point cloud, and the point cloud is used as depth data of the RGB image and a training label.
Inputting the 3D intermediate features into a third 3D convolution network to be processed to obtain corresponding radiation fields as RGB color features, and recovering a reconstructed RGB image and a corresponding reconstructed depth map from a 3D space by volume rendering aiming at the volume density features and the RGB color features after the volume density features and the RGB color features are obtained;
wherein the third 3D convolutional network is:
f 3 :F 3D →F RGB
wherein F is 3D For 3D intermediate features, F RGB 3D convolution network f for RGB color features, which can learn 3D parameters 3 Realization of the slave F 3D To F RGB Is mapped to;
using an original input RGB image to obtain an original depth map by projection of sparse laser radar points (LiDAR points) carried by the RGB image;
a color consistency loss is established between the reconstructed RGB image and the original input RGB image, a sparse depth map consistency loss is established between the reconstructed depth map and the original depth map obtained by sparse laser radar point projection, and in addition, the following loss constraint is established: the three-dimensional coordinates of each laser radar point carried by the RGB image which is input originally are found to correspond to elements in the symbol distance field characteristics, zero level set constraint is established by the fact that the values of the elements are all 0 under the condition that the elements are found, and loss constraint by the fact that the values of the elements are all 0 is not found.
The LiDAR points LiDAR for the original input RGB image are used to supervise the symbol distance field such that all LiDAR points correspond to SDF element values of 0 (LiDAR points typically only appear on the object surface).
In step 5), the model is trained in a self-supervision manner, and when the loss function converges, model optimization is completed, wherein the loss function is as follows:
arg min λ rgb L rgbdepth L depthsdf L sdf
wherein L is rgb Representing a loss of color consistency, L, between the reconstructed RGB image resulting from the volume rendering and the current original input RGB image depth Representing a reconstructed depth map resulting from volume rendering and a sparse depth map constructed from a laser radar point (LiDAR) projection onto a camera imaging planeConsistency loss, L sdf Representing zero level set constraints; rebuild consistency by L rgb And L depth To ensure that the zero level set constraint of the symbol distance field is defined by L sdf To ensure. Lambda (lambda) rgb 、λ depth 、λ sdf The adjustable weights are respectively color consistency loss, sparse depth map consistency loss and zero level set constraint;
each loss is specifically respectively as follows:
color consistency loss L of reconstructed RGB image obtained by volume rendering and current original input RGB image rgb Including smoothing the absolute error of the mean L smoothL1 And structural similarity error L SSIM
L rgb =λ smoothL1 L smoothL1sSIM L SSIM
Wherein lambda is smoothL1 And lambda (lambda) SSIM Respectively smooth average absolute errors L smoothL1 And structural similarity error L SSIM Corresponding adjustable weights;
a consistency loss L of a reconstructed depth map obtained by volume rendering and a sparse depth map constructed by projecting laser radar points (LiDAR) onto a camera imaging plane depth
Wherein N is depth Representing the number of active points of the lidar point after projection onto the camera imaging plane,representing the rendered reconstructed depth map, Z gt Representing a sparse depth map obtained by projection, the I 1 Represents a 1-norm;
with respect to the zero level set constraint, the symbol distance field describes the geometric surface in the scene with a set of 0 values, all LiDAR points (LiDARs) are on the geometric surface, constraint L sdf The method comprises the following steps:
wherein N is gt Representing the number of effective laser radar points in the 3D target detection range, F sdf Is the sign distance field signature, (x, y, z) represents the three-dimensional coordinates corresponding to the effective lidar points 2 Representing a 2-norm.
Two reconstruction losses are established in the method, one is the reconstruction loss between the reconstructed RGB image and the original input RGB image, and the other is the reconstruction loss between the reconstructed depth map and the original depth map.
The numerical integral formula of the volume rendering used in the step 5) is as follows:
wherein r represents a ray emitted from the optical center of the camera,the representative ray r reflects the RGB color values on the image,representing the depth value of the ray r reflected on the image, T i Representing the cumulative transparency of rays r from depth plane 1 to depth plane i, D being the number of said sampling depth planes, z i Corresponding depth value representing depth plane i +.>Representation according to z i From bulk density characteristics F density Bulk density obtained by sampling ∈>Representation according to z i From RGB color feature F RGB RGB colors obtained by sampling ∈>Representing the depth separation between depth plane j and depth plane j + 1.
In said step 6), feature voxels V are obtained from the 3D intermediate feature in coordinate sampling according to predefined voxel space coordinates f Obtaining density voxels V from the volume density features in coordinate sampling based on predefined voxel space coordinates density With density voxel V density For characteristic voxel V f After special weighting treatment, obtaining 3D detection characteristics finally sent into a general target detection head, and inputting the 3D detection characteristics into the 3D detection head to predict and obtain a detection result of depth, wherein the detection result is specifically expressed as:
V density =G(F density )
V f =G(F 3D )
V 3D =V f ·tanh(V density )
where G (-) represents the grid sampling operation and tanh (-) represents the hyperbolic tangent activation function used to scale the density voxels, V 3D For the 3D detection feature that is ultimately fed into the universal target detection head.
The neuro-volume rendering features input to the universal target detection head include density voxel V density And feature voxel V f From the bulk density features F density And 3D intermediate feature F 3D Is sampled via a predefined 3D grid.
The invention uses the 3D characteristics after volume rendering supervision to input the 3D characteristics into a 3D detection head after volume density weighting to predict the detection result.
The invention combines the extracted 2D image features and the corresponding normalized position view cone features to construct the position-aware view cone features fused with 3D position information, then creates a symbol distance field feature (SDF) and RGB color features (RGB color) by using the position-aware view cone features, and further extracts voxel weighted prediction to obtain a detection result.
The invention adopts the volume rendering of the reconstructed RGB image and depth map generated from the symbol distance field features and RGB color features, performs supervised training learning by establishing loss between the original RGB image and LiDAR points, and simultaneously implicitly models the scene for optimization by zero-level set (zero-level set) of the symbol distance field.
In particular, the present invention models scenes using a signed distance function (SignedDistanceFunctions, SDF), facilitating the generation of dense neuro-volume rendering features. And considering these neuro-volume rendering features as neuro-radiation fields (NeuralRadiance Fields, neRF), then reconstructing RGB images and depth maps using classical volume rendering techniques, this design enables the network to infer dense 3D geometry information and Occupancy (3D-Occupancy) conditions, efficiently perform monocular image 3D object detection tasks, and generate 3D Occupancy.
The beneficial effects of the invention are as follows:
the invention provides a monocular image 3D target detection method based on nerve volume rendering for the first time, and maintains higher detection precision, and 3D occupation can be predicted at the same time; the neural body rendering characteristics are optimized in a self-supervision mode in training, and data annotation is carried out without extra manpower.
Drawings
FIG. 1 is a functional block diagram of an example of the present invention.
Fig. 2 is a flowchart of steps of a method for detecting a 3D target of a monocular image based on rendering of a nerve body according to an embodiment of the present invention.
Detailed Description
The invention is further described below with reference to the drawings and the detailed description.
1) For an original input RGB image, firstly extracting 2D image features through a 2D image backbone network;
the embodied 2D image backbone network employs a Resnet34 neural network.
2) Generating a 3D representation three-dimensional matrix of the near neural radiation field NeRF representation using the 2D image features as a positional view cone feature;
the step 2) specifically comprises the following steps: establishing a depth range according to a predefined near depth z_n and a far depth z_f, namely, sampling D depths from the depth range from the near depth z_n to the far depth z_f, wherein the sampling process uses equal depth intervals and random disturbance, each depth establishes a depth plane, and each depth plane is consistent with the 2D image feature resolution (H multiplied by W) of the step 1), namely, each grid pixel of the (H multiplied by W) pixel grid takes the plane coordinate of the grid pixel on the depth plane per se plus the depth of the depth plane as the three-dimensional coordinate of the grid pixel per se; and normalizing the three-dimensional coordinates of all grid pixels of all depth planes to obtain a 3D representation three-dimensional matrix serving as a position view cone feature.
In each depth plane, the view cone 3D position coordinates p= [ u, v, z ] ≡t of each grid pixel are composed of grid pixels superimposed with their own plane coordinates [ u, v ] ≡t on the depth plane with depth [ z ] ≡t.
3) Fusing the position viewing cone features and the 2D image features to construct position-aware viewing cone features;
the query-key value method specifically comprises a position query module, a key mapping module, a value mapping module, a fusion module and a softmax function, and is used for fusing position view cone features and 2D image features to construct position perception view cone features;
the position inquiry module is as follows:
f q :F pos →Q
wherein F is pos For the position view cone feature, Q is the position query feature, the function f can be learned q Realization of the slave F pos Mapping to Q.
The key mapping module is as follows:
f k :F image →K
wherein F is image For 2D image features, K is key mapping feature, the function f can be learned k Realization of the slave F image Mapping to K.
The value mapping module is as follows:
f v :F image →V
wherein F is image Is 2D image feature, V is value mapping feature, and can learnXi Hanshu f v Realization of the slave F image Mapping to V.
The fusion module is as follows:
F P =Softmax(QK,dim=D)V
wherein F is P Is a position-aware view cone feature, Q, K, V is the above-described position query feature, key-map feature, and value-map feature, softmax (dim=d) is the Softmax function calculation along the aforementioned depth sampling axis (the dimension of sampling D depths).
4) And processing the position-aware view cone features by using a 3D convolution network and a Laplace cumulative distribution function to obtain corresponding nerve volume rendering features, and performing optimization prediction by combining volume rendering processing to obtain a depth detection result.
The neural volume rendering feature based on the symbol distance field in the step 4) specifically comprises four parts of symbol distance field feature, volume density feature, 3D middle feature and RGB color feature, which are respectively realized by 3D convolution networks of 3 learnable parameters and Laplace cumulative distribution functions of 1 learnable parameter.
Wherein the first 3D convolutional network is:
f 1 :F P →F sdf
wherein F is P Is a location-aware view cone feature, F sdf 3D convolutional network F1 implementation of learnable 3D parameters for symbol distance field features from F P To F sdf Is mapped to the mapping of (a).
Wherein the second 3D convolutional network is:
f 2 :F P →F 3D
wherein F is P Is a location-aware view cone feature, F 3D 3D convolution network f for 3D intermediate features, which can learn 3D parameters 2 Realization of the slave F P To F 3D Is mapped to the mapping of (a).
Wherein the third 3D convolutional network is:
f 3 :F 3D →F RGB
wherein F is 3D For 3D intermediate features, F RGB 3D convolution network capable of learning 3D parameters for RGB color characteristicsf 3 Realization of the slave F 3D To F RGB Is mapped to the mapping of (a).
Wherein the laplace cumulative distribution function is:
αΨ β :F sdf →F density
wherein α, β are learnable parameters, ψ β To a Laplace cumulative distribution function with zero mean and beta as scaling scales, F sdf For sign distance field features, F density For bulk density characteristics, ψ β A mapping from symbol distance to bulk density is achieved for forming a homogeneous bulk density.
5) Before the 3D target detection is implemented, training and learning are further performed by the method, including self-supervision training of the volume rendering portion and supervised training of the 3D target detection portion, where the supervised training of the 3D target detection portion is that a general concept is irrelevant to an innovation point, and details are omitted here, and the learning process of the volume rendering portion specifically includes:
the model can be trained in a self-supervision mode, and when the loss function converges, model optimization is completed, wherein the loss function is as follows:
arg min λ rgb L rgbdepth L depthsdf L sdf
each loss is specifically respectively as follows:
color consistency L of reconstructed RGB image obtained by volume rendering and current original input RGB image rgb Including smoothing the absolute error of the mean L smoothL1 And structural similarity error L SSIM ,λ smoothL1 And lambda (lambda) SSIM Respectively corresponding adjustable weights:
L rgb =λ smoothL1 L smoothL1SSIM L SSIM
consistency L of reconstructed depth map obtained by volume rendering and sparse depth map constructed by projection of laser radar points (LiDAR) onto camera imaging plane depth
With respect to the zero level set constraint, the symbol distance field describes the geometric surface in the scene with a set of 0 values, all LiDAR points (LiDARs) are on the geometric surface, constraint L sdf The method comprises the following steps:
6) The neuro-volume rendering features input to the universal target detection head include density voxel V density And feature voxel V f From the bulk density features F density And 3D intermediate feature F 3D Sampling through a predefined 3D grid to obtain:
V density =G(F density )
V f =G(F 3D )
V 3D =V f ·tanh(V density )
wherein G (-) represents the grid sampling operation, tanh (-) represents the hyperbolic tangent activation function used to scale the density voxels, V 3D For the 3D detection feature that is ultimately fed into the universal target detection head.

Claims (7)

1. A monocular image 3D target detection method based on nerve volume rendering is characterized by comprising the following steps of: the method comprises the following steps:
1) For an original input RGB image, firstly extracting 2D image features through a 2D image backbone network;
2) Obtaining three-dimensional coordinates of a view cone corresponding to an input RGB image by using a multi-plane image interval sampling method, and normalizing the three-dimensional coordinates to serve as a position view cone characteristic;
3) The 2D image features and the position view cone features are fused and processed in a query-key value pair mode, so that the position perception view cone features are obtained;
4) Processing the position-aware view cone features using a 3D convolutional network, establishing a neuro-volume rendering feature based on a symbol distance field;
5) Obtaining optimized nerve volume rendering characteristics according to the nerve volume rendering characteristics combined with reconstruction consistency constraint of volume rendering and zero level set constraint of a symbol distance field;
6) And (3) obtaining 3D detection characteristics after grid sampling the optimized nerve body rendering characteristics, and inputting the 3D detection characteristics into a universal target detection head to obtain a 3D target detection result.
2. The method for detecting the 3D target of the monocular image based on the nerve volume rendering according to claim 1, wherein the method comprises the following steps of: the step 2) specifically comprises the following steps: according to a predefined near depth z n And far depth z f Establishing a depth range, sampling D depths from the depth range, wherein the sampling process uses equal depth intervals and random disturbance, each depth is provided with a depth plane, each depth plane is divided into a plurality of grid pixels, and each grid pixel takes the plane coordinate of the grid pixel on the depth plane plus the depth of the depth plane as the three-dimensional coordinate of the grid pixel; and normalizing the three-dimensional coordinates of all grid pixels of all depth planes to obtain a three-dimensional matrix serving as a position view cone feature.
3. The method for detecting the 3D target of the monocular image based on the nerve volume rendering according to claim 2, wherein the method comprises the following steps of: in each depth plane, the view cone 3D position coordinates p= [ u, v, z ] ≡t of each grid pixel are composed of grid pixels superimposed with their own plane coordinates [ u, v ] ≡t on the depth plane with depth [ z ] ≡t.
4. The method for detecting the 3D target of the monocular image based on the nerve volume rendering according to claim 1, wherein the method comprises the following steps of: the step 3) of inquiring-key value method specifically comprises a position inquiring module, a key mapping module, a value mapping module and a fusion module containing a softmax function, and is used for fusing position viewing cone characteristics and 2D image characteristics to construct position perception viewing cone characteristics;
inputting the position view cone features into a position query module to obtain a position query feature Q, and respectively inputting the 2D image features into a key mapping module and a value mapping module to obtain a key mapping feature K and a value mapping feature V;
then multiplying the position query feature Q and the key mapping feature K, scaling the multiplied result in the depth dimension by using a softmax function, and multiplying the scaled result with the value mapping feature V to obtain a position perception cone feature; the method comprises the following steps:
the position inquiry module f q The method comprises the following steps:
f q :F pos →Q
wherein F is pos The Q is a position looking-cone feature, and the Q is a position query feature;
the key mapping module f k The method comprises the following steps:
f k :F image →K
wherein F is image For 2D image features, K is a key map feature;
the value mapping module f v The method comprises the following steps:
f v :F image →V
wherein F is image 2D image features, V is a value mapping feature;
the fusion module is as follows:
F P =Softmax(QK,dim=D)V
wherein F is P Is a location-aware view cone feature, the Q, K, V distributions are the location query feature, the key map feature, and the value map feature described above, dim representing the dimension of depth.
5. The method for detecting the 3D target of the monocular image based on the nerve volume rendering according to claim 1, wherein the method comprises the following steps of:
in the step 4), the neural volume rendering characteristics based on the symbol distance field are specifically:
the position sensing view cone features are input into a first 3D convolution network to output total 3D features, one-dimensional features are taken as symbol distance field features, meanwhile, the position sensing view cone features are also input into a second 3D convolution network to output total 3D features, the features of the remaining dimensions except the symbol distance field features are taken as 3D intermediate features, and the symbol distance field features are converted into volume density features required by volume rendering through Laplace cumulative distribution functions; the method comprises the following steps:
wherein the first 3D convolutional network is:
f 1 :F P →F sdf
wherein F is P Is a location-aware view cone feature, F sdf Is a sign distance field feature;
wherein the second 3D convolutional network is:
f 2 :F P →F 3D
wherein F is P Is a location-aware view cone feature, F 3D Is a 3D intermediate feature;
wherein the laplace cumulative distribution function is:
αΨ β :F sdf →F density
wherein, psi is β For Laplace cumulative distribution function with zero mean and beta as scaling scale, alpha and beta are respectively the first and second leachable parameters, F sdf For sign distance field features, F density Is a bulk density feature.
6. A method for detecting a 3D object of a monocular image based on a rendering of a nerve volume according to claim 1 or 5, wherein:
in the method, the body density characteristics and the 3D intermediate characteristics are combined to establish loss constraint to carry out supervised training learning, specifically:
inputting the 3D intermediate features into a third 3D convolution network to be processed to obtain RGB color features, and recovering a reconstructed RGB image and a corresponding reconstructed depth map from a 3D space by volume rendering aiming at the volume density features and the RGB color features after the volume density features and the RGB color features are obtained;
wherein the third 3D convolutional network is:
f 3 :F 3D →F RGB
wherein F is 3D For 3D intermediate features, F RGB Is an RGB color feature;
using the RGB image of the original input to obtain an original depth map by projection of laser radar points (LiDAR points) carried by the RGB image;
a color consistency loss is established between the reconstructed RGB image and the original input RGB image, a sparse depth map consistency loss is established between the reconstructed depth map and the original depth map obtained by laser radar point projection, and the following loss constraint is also established: the three-dimensional coordinates of each laser radar point carried by the original input RGB image are found to correspond to elements in the symbol distance field features, and zero level set constraint is established with the values of the elements being 0 if the elements are found.
7. The method for detecting the 3D target of the monocular image based on the nerve volume rendering according to claim 6, wherein the method comprises the following steps of:
in said step 6), feature voxels V are obtained from the 3D intermediate feature in coordinate sampling according to predefined voxel space coordinates f Obtaining density voxels V from the volume density features in coordinate sampling based on predefined voxel space coordinates density With density voxel V density For characteristic voxel V f After special weighting treatment, obtaining 3D detection characteristics finally sent into a general target detection head, and inputting the 3D detection characteristics into the 3D detection head to predict and obtain a detection result of depth, wherein the detection result is specifically expressed as:
V density =G(F density )
V f =G(F 3D )
V 3D =V f ·tanh(V density )
where G (-) represents the grid sampling operation and tanh (-) represents the hyperbolic tangent activation function used to scale the density voxels, V 3D For the 3D detection feature that is ultimately fed into the universal target detection head.
CN202310432912.5A 2023-04-21 2023-04-21 Monocular image 3D target detection method based on nerve volume rendering Pending CN116630953A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310432912.5A CN116630953A (en) 2023-04-21 2023-04-21 Monocular image 3D target detection method based on nerve volume rendering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310432912.5A CN116630953A (en) 2023-04-21 2023-04-21 Monocular image 3D target detection method based on nerve volume rendering

Publications (1)

Publication Number Publication Date
CN116630953A true CN116630953A (en) 2023-08-22

Family

ID=87608973

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310432912.5A Pending CN116630953A (en) 2023-04-21 2023-04-21 Monocular image 3D target detection method based on nerve volume rendering

Country Status (1)

Country Link
CN (1) CN116630953A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118071999A (en) * 2024-04-17 2024-05-24 厦门大学 Multi-view 3D target detection method based on sampling self-adaption continuous NeRF

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118071999A (en) * 2024-04-17 2024-05-24 厦门大学 Multi-view 3D target detection method based on sampling self-adaption continuous NeRF

Similar Documents

Publication Publication Date Title
WO2021233029A1 (en) Simultaneous localization and mapping method, device, system and storage medium
CN109791697B (en) Predicting depth from image data using statistical models
CN110675418B (en) Target track optimization method based on DS evidence theory
CN111563415B (en) Binocular vision-based three-dimensional target detection system and method
CN110009674B (en) Monocular image depth of field real-time calculation method based on unsupervised depth learning
CN113052835B (en) Medicine box detection method and system based on three-dimensional point cloud and image data fusion
CN110689562A (en) Trajectory loop detection optimization method based on generation of countermeasure network
JP7166388B2 (en) License plate recognition method, license plate recognition model training method and apparatus
CN111899328B (en) Point cloud three-dimensional reconstruction method based on RGB data and generation countermeasure network
CN108876814B (en) Method for generating attitude flow image
CN111476242B (en) Laser point cloud semantic segmentation method and device
CN110910437B (en) Depth prediction method for complex indoor scene
CN113628348A (en) Method and equipment for determining viewpoint path in three-dimensional scene
CN112862736B (en) Real-time three-dimensional reconstruction and optimization method based on points
Rist et al. Scssnet: Learning spatially-conditioned scene segmentation on lidar point clouds
CN116612468A (en) Three-dimensional target detection method based on multi-mode fusion and depth attention mechanism
CN116630953A (en) Monocular image 3D target detection method based on nerve volume rendering
CN114332796A (en) Multi-sensor fusion voxel characteristic map generation method and system
CN115909255B (en) Image generation and image segmentation methods, devices, equipment, vehicle-mounted terminal and medium
CN116958434A (en) Multi-view three-dimensional reconstruction method, measurement method and system
KR20230098058A (en) Three-dimensional data augmentation method, model training and detection method, device, and autonomous vehicle
CN113920270B (en) Layout reconstruction method and system based on multi-view panorama
CN115330935A (en) Three-dimensional reconstruction method and system based on deep learning
Chai et al. Deep depth fusion for black, transparent, reflective and texture-less objects
Jiang et al. Ffpa-net: Efficient feature fusion with projection awareness for 3d object detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination