CN116630953A

CN116630953A - Monocular image 3D target detection method based on nerve volume rendering

Info

Publication number: CN116630953A
Application number: CN202310432912.5A
Authority: CN
Inventors: 徐骏凯; 彭亮; 程浩然; 钱炜; 杨政; 何晓飞
Original assignee: Hangzhou Fabu Technology Co Ltd
Current assignee: Hangzhou Fabu Technology Co Ltd
Priority date: 2023-04-21
Filing date: 2023-04-21
Publication date: 2023-08-22

Abstract

The invention discloses a monocular image 3D target detection method based on nerve volume rendering. The method comprises the steps that an RGB image is input originally, and 2D image features are extracted through a 2D image backbone network; obtaining a position view cone feature by using multi-plane image interval sampling; fusing the 2D image features and the position view cone features to obtain position perception view cone features; processing the position-aware view cone features using a 3D convolutional network, establishing a neuro-volume rendering feature based on a symbol distance field; obtaining optimized nerve volume rendering characteristics according to the nerve volume rendering characteristics, the reconstruction consistency constraint of the nerve volume rendering and the zero level set constraint of the symbol distance field; and finally, obtaining 3D detection characteristics through grid sampling, and inputting the 3D detection characteristics into a universal target detection head to obtain a 3D target detection result. The invention provides a monocular image 3D target detection method based on nerve volume rendering for the first time, which can effectively carry out a monocular image 3D target detection task and simultaneously predict 3D occupation, and particularly has higher detection precision.

Description

Monocular image 3D target detection method based on nerve volume rendering

Technical Field

The invention relates to an image processing method in computer vision and automatic driving, in particular to a monocular 3D target detection method based on nerve volume rendering.

Background

Monocular three-dimensional object detection is one of the very important and troublesome problems in the field of computer vision. In this task, given a single picture, we need to detect objects of the category of interest from them and output their position and size in three-dimensional space. An efficient monocular three-dimensional object detector can be widely applied to various fields based on visual perception, such as automatic driving vehicles, security robots and the like, and can also provide data support for training of other downstream computer visual tasks.

In recent years, monocular three-dimensional object detection has made remarkable progress. Whereas existing monocular methods typically rely on depth estimation to achieve their constraint of relationship from 2D images to 3D space.

There are some limitations to the 3D representation currently resulting from the conversion of display depth estimates.

First, 3D features obtained from depth estimation or pseudo-LiDAR exhibit non-uniform distribution throughout the 3D space. In particular, they are higher in density at close distances, but as distance increases, density decreases.

Second, the final 3D object detection performance is severely dependent on the accuracy of the depth estimation, which remains a significant challenge.

Therefore, this characterization approach does not yield dense and reasonable 3D features for monocular image 3D object detection tasks.

In the field of monocular three-dimensional object detection, there are approaches to enhance the detector with scene geometric cues. However, many existing methods explicitly exploit these cues, e.g. estimating the depth map and back projecting it into 3D space. Such explicit methods can lead to sparsity of the 3D representation due to the increasing dimension from 2D to 3D and can lead to loss of a large amount of information, especially for distant and occluded objects.

Disclosure of Invention

In order to solve and alleviate the problems in the background technology, the invention provides a nerve volume rendering monocular image 3D target detection method, which can effectively perform monocular image 3D target detection tasks and simultaneously generate 3D occupation.

The invention re-expresses the intermediate 3D representation in monocular image 3D target detection into a 3D representation similar to a nerve radiation field (NeRF), thereby generating dense and reasonable 3D geometric and occupation information, and carrying out 3D target detection according to the dense and reasonable 3D geometric and occupation information.

The technical scheme adopted by the invention is as follows:

1) For an original input RGB image, firstly extracting 2D image features through a 2D image backbone network;

the 2D image backbone network adopts a Resnet34 neural network.

2) Obtaining three-dimensional coordinates of a view cone corresponding to an input RGB image by using a multi-plane image interval sampling method, and normalizing the three-dimensional coordinates to serve as a position view cone characteristic;

3) The 2D image features and the position view cone features are fused and processed in a query-key value pair mode, so that the position perception view cone features are obtained;

4) Processing the position-aware view cone features using a 3D convolutional network, establishing a neuro-volume rendering feature based on a symbol distance field;

5) Obtaining optimized nerve volume rendering characteristics according to the nerve volume rendering characteristics combined with reconstruction consistency constraint of volume rendering and zero level set constraint of a symbol distance field;

6) And (3) obtaining 3D detection characteristics after grid sampling the optimized nerve body rendering characteristics, and inputting the 3D detection characteristics into a universal target detection head to obtain a 3D target detection result.

All the above steps of the present invention constitute an integral monocular image 3D object detection network.

The step 2) specifically comprises the following steps: according to a predefined near depth z _n And far depth z _f Establishing depth ranges, i.e. from near depth z _n To a far depth z _f Sampling D depths from the depth range, the sampling process using equal depth intervals and random perturbations, each depth creating a depth plane divided into a number of grid pixels in each depth plane, each depth plane being consistent with the 2D image feature resolution (hxw) of step 1), i.e. having (H xw) pixel grids, each grid pixel having its own plane coordinates on the depth plane plus the depth of the depth plane as its own three-dimensional coordinates; and normalizing the three-dimensional coordinates of all grid pixels of all depth planes to obtain a 3D representation three-dimensional matrix serving as a position view cone feature.

In each depth plane, the view cone 3D position coordinates p= [ u, v, z ] ≡t of each grid pixel are composed of grid pixels superimposed with their own plane coordinates [ u, v ] ≡t on the depth plane with depth [ z ] ≡t.

The step 3) of inquiring-key value method specifically comprises a position inquiring module, a key mapping module, a value mapping module and a fusion module containing a softmax function, and is used for fusing position viewing cone characteristics and 2D image characteristics to construct position perception viewing cone characteristics;

inputting the position view cone features into a position query module to obtain a position query feature Q, and respectively inputting the 2D image features into a key mapping module and a value mapping module to obtain a key mapping feature K and a value mapping feature V;

and then multiplying the position query feature Q and the key mapping feature K firstly, namely multiplying corresponding to the same position, scaling the multiplied result in the depth dimension by using a softmax function, and multiplying the scaled result with the value mapping feature V to obtain the position perception cone feature.

The position inquiry module f _q The method comprises the following steps:

f _q ：F _pos →Q

wherein F is _pos For position view cone features, Q is position findingPolling feature, a learning function f _q Realization of the slave F _pos Mapping to Q;

the key mapping module f _k The method comprises the following steps:

f _k ：F _image →K

wherein F is _image For 2D image features, K is key mapping feature, the function f can be learned _k Realization of the slave F _image Mapping to K;

the value mapping module f _v The method comprises the following steps:

f _v ：F _image →V

wherein F is _image For 2D image features, V is a value mapping feature, a function f can be learned _v Realization of the slave F _image Mapping to V;

the fusion module is as follows:

F _P ＝Softmax(QK，dim＝D)V

wherein F is _P Is a position-aware view cone feature, Q, K, V is distributed as the above-mentioned position query feature, key mapping feature and value mapping feature, dim represents the dimension of depth, and Softmax (·, dim=d) is calculated as a Softmax function along the aforementioned depth sampling axis (dimension of sampling D depths).

In the step 4), the neural volume rendering feature based on the symbol distance field specifically comprises four parts of symbol distance field feature, volume density feature, 3D middle feature and RGB color feature, which are respectively realized by three 3D convolution networks capable of learning parameters and a laplace cumulative distribution function capable of learning parameters:

after the position-aware view cone feature is constructed, the position-aware view cone feature is input into a first 3D convolution network to output a total 3D feature and take any one-dimensional feature as a Symbol Distance Field (SDF) feature, and is also input into a second 3D convolution network to output the total 3D feature and take the feature of the remaining dimension except the Symbol Distance Field (SDF) feature as a 3D intermediate feature, and then the symbol distance field feature is converted into a volume density feature required by volume rendering through a Laplace cumulative distribution function.

Wherein the first 3D convolutional network is:

f ₁ ：F _P →F _sdf

wherein F is _P Is a location-aware view cone feature, F _sdf 3D convolutional network F1 implementation of learnable 3D parameters for symbol distance field features from F _P To F _sdf Is mapped to;

wherein the second 3D convolutional network is:

f ₂ ：F _P →F _3D

wherein F is _P Is a location-aware view cone feature, F _3D 3D convolution network f for 3D intermediate features, which can learn 3D parameters ₂ Realization of the slave F _P To F _3D Is mapped to;

wherein the laplace cumulative distribution function is:

αΨ _β ：F _sdf →F _density

wherein, psi is _β For Laplace cumulative distribution function with zero mean and beta as scaling scale, alpha and beta are respectively the first and second leachable parameters, F _sdf For sign distance field features, F _density For bulk density characteristics, ψ _β A mapping from symbol distance to bulk density is achieved for forming a homogeneous bulk density.

A Symbol Distance Field (SDF) feature has a plurality of small lattices, each of which is an element, and each of which has a scalar value as the value of the element. This encodes the position-aware view cone features into three-dimensional space to obtain more informative three-dimensional features for neurostimulation rendering.

In the method, the body density characteristics and the 3D intermediate characteristics are combined to establish loss constraint to carry out supervised training learning, namely, the whole detection network is trained, specifically:

before the training supervision process, each RGB image is provided with sparse laser radar points (LiDAR points) to form a point cloud, and the point cloud is used as depth data of the RGB image and a training label.

Inputting the 3D intermediate features into a third 3D convolution network to be processed to obtain corresponding radiation fields as RGB color features, and recovering a reconstructed RGB image and a corresponding reconstructed depth map from a 3D space by volume rendering aiming at the volume density features and the RGB color features after the volume density features and the RGB color features are obtained;

wherein the third 3D convolutional network is:

f ₃ ：F _3D →F _RGB

wherein F is _3D For 3D intermediate features, F _RGB 3D convolution network f for RGB color features, which can learn 3D parameters ₃ Realization of the slave F _3D To F _RGB Is mapped to;

using an original input RGB image to obtain an original depth map by projection of sparse laser radar points (LiDAR points) carried by the RGB image;

a color consistency loss is established between the reconstructed RGB image and the original input RGB image, a sparse depth map consistency loss is established between the reconstructed depth map and the original depth map obtained by sparse laser radar point projection, and in addition, the following loss constraint is established: the three-dimensional coordinates of each laser radar point carried by the RGB image which is input originally are found to correspond to elements in the symbol distance field characteristics, zero level set constraint is established by the fact that the values of the elements are all 0 under the condition that the elements are found, and loss constraint by the fact that the values of the elements are all 0 is not found.

The LiDAR points LiDAR for the original input RGB image are used to supervise the symbol distance field such that all LiDAR points correspond to SDF element values of 0 (LiDAR points typically only appear on the object surface).

In step 5), the model is trained in a self-supervision manner, and when the loss function converges, model optimization is completed, wherein the loss function is as follows:

arg _min λ _rgb L _rgb +λ _depth L _depth +λ _sdf L _sdf

wherein L is _rgb Representing a loss of color consistency, L, between the reconstructed RGB image resulting from the volume rendering and the current original input RGB image _depth Representing a reconstructed depth map resulting from volume rendering and a sparse depth map constructed from a laser radar point (LiDAR) projection onto a camera imaging planeConsistency loss, L _sdf Representing zero level set constraints; rebuild consistency by L _rgb And L _depth To ensure that the zero level set constraint of the symbol distance field is defined by L _sdf To ensure. Lambda (lambda) _rgb 、λ _depth 、λ _sdf The adjustable weights are respectively color consistency loss, sparse depth map consistency loss and zero level set constraint;

each loss is specifically respectively as follows:

color consistency loss L of reconstructed RGB image obtained by volume rendering and current original input RGB image _rgb Including smoothing the absolute error of the mean L _smoothL1 And structural similarity error L _SSIM ：

L _rgb ＝λ _smoothL1 L _smoothL1 +λ _sSIM L _SSIM

Wherein lambda is _smoothL1 And lambda (lambda) _SSIM Respectively smooth average absolute errors L _smoothL1 And structural similarity error L _SSIM Corresponding adjustable weights;

a consistency loss L of a reconstructed depth map obtained by volume rendering and a sparse depth map constructed by projecting laser radar points (LiDAR) onto a camera imaging plane _depth ：

Wherein N is _depth Representing the number of active points of the lidar point after projection onto the camera imaging plane,representing the rendered reconstructed depth map, Z _gt Representing a sparse depth map obtained by projection, the I ₁ Represents a 1-norm;

with respect to the zero level set constraint, the symbol distance field describes the geometric surface in the scene with a set of 0 values, all LiDAR points (LiDARs) are on the geometric surface, constraint L _sdf The method comprises the following steps:

wherein N is _gt Representing the number of effective laser radar points in the 3D target detection range, F _sdf Is the sign distance field signature, (x, y, z) represents the three-dimensional coordinates corresponding to the effective lidar points ₂ Representing a 2-norm.

Two reconstruction losses are established in the method, one is the reconstruction loss between the reconstructed RGB image and the original input RGB image, and the other is the reconstruction loss between the reconstructed depth map and the original depth map.

The numerical integral formula of the volume rendering used in the step 5) is as follows:

wherein r represents a ray emitted from the optical center of the camera,the representative ray r reflects the RGB color values on the image,representing the depth value of the ray r reflected on the image, T _i Representing the cumulative transparency of rays r from depth plane 1 to depth plane i, D being the number of said sampling depth planes, z _i Corresponding depth value representing depth plane i +.>Representation according to z _i From bulk density characteristics F _density Bulk density obtained by sampling ∈>Representation according to z _i From RGB color feature F _RGB RGB colors obtained by sampling ∈>Representing the depth separation between depth plane j and depth plane j + 1.

In said step 6), feature voxels V are obtained from the 3D intermediate feature in coordinate sampling according to predefined voxel space coordinates _f Obtaining density voxels V from the volume density features in coordinate sampling based on predefined voxel space coordinates _density With density voxel V _density For characteristic voxel V _f After special weighting treatment, obtaining 3D detection characteristics finally sent into a general target detection head, and inputting the 3D detection characteristics into the 3D detection head to predict and obtain a detection result of depth, wherein the detection result is specifically expressed as:

V _density ＝G(F _density )

V _f ＝G(F _3D )

V _3D ＝V _f ·tanh(V _density )

where G (-) represents the grid sampling operation and tanh (-) represents the hyperbolic tangent activation function used to scale the density voxels, V _3D For the 3D detection feature that is ultimately fed into the universal target detection head.

The neuro-volume rendering features input to the universal target detection head include density voxel V _density And feature voxel V _f From the bulk density features F _density And 3D intermediate feature F _3D Is sampled via a predefined 3D grid.

The invention uses the 3D characteristics after volume rendering supervision to input the 3D characteristics into a 3D detection head after volume density weighting to predict the detection result.

The invention combines the extracted 2D image features and the corresponding normalized position view cone features to construct the position-aware view cone features fused with 3D position information, then creates a symbol distance field feature (SDF) and RGB color features (RGB color) by using the position-aware view cone features, and further extracts voxel weighted prediction to obtain a detection result.

The invention adopts the volume rendering of the reconstructed RGB image and depth map generated from the symbol distance field features and RGB color features, performs supervised training learning by establishing loss between the original RGB image and LiDAR points, and simultaneously implicitly models the scene for optimization by zero-level set (zero-level set) of the symbol distance field.

In particular, the present invention models scenes using a signed distance function (SignedDistanceFunctions, SDF), facilitating the generation of dense neuro-volume rendering features. And considering these neuro-volume rendering features as neuro-radiation fields (NeuralRadiance Fields, neRF), then reconstructing RGB images and depth maps using classical volume rendering techniques, this design enables the network to infer dense 3D geometry information and Occupancy (3D-Occupancy) conditions, efficiently perform monocular image 3D object detection tasks, and generate 3D Occupancy.

The beneficial effects of the invention are as follows:

the invention provides a monocular image 3D target detection method based on nerve volume rendering for the first time, and maintains higher detection precision, and 3D occupation can be predicted at the same time; the neural body rendering characteristics are optimized in a self-supervision mode in training, and data annotation is carried out without extra manpower.

Drawings

FIG. 1 is a functional block diagram of an example of the present invention.

Fig. 2 is a flowchart of steps of a method for detecting a 3D target of a monocular image based on rendering of a nerve body according to an embodiment of the present invention.

Detailed Description

The invention is further described below with reference to the drawings and the detailed description.

the embodied 2D image backbone network employs a Resnet34 neural network.

2) Generating a 3D representation three-dimensional matrix of the near neural radiation field NeRF representation using the 2D image features as a positional view cone feature;

the step 2) specifically comprises the following steps: establishing a depth range according to a predefined near depth z_n and a far depth z_f, namely, sampling D depths from the depth range from the near depth z_n to the far depth z_f, wherein the sampling process uses equal depth intervals and random disturbance, each depth establishes a depth plane, and each depth plane is consistent with the 2D image feature resolution (H multiplied by W) of the step 1), namely, each grid pixel of the (H multiplied by W) pixel grid takes the plane coordinate of the grid pixel on the depth plane per se plus the depth of the depth plane as the three-dimensional coordinate of the grid pixel per se; and normalizing the three-dimensional coordinates of all grid pixels of all depth planes to obtain a 3D representation three-dimensional matrix serving as a position view cone feature.

3) Fusing the position viewing cone features and the 2D image features to construct position-aware viewing cone features;

the query-key value method specifically comprises a position query module, a key mapping module, a value mapping module, a fusion module and a softmax function, and is used for fusing position view cone features and 2D image features to construct position perception view cone features;

the position inquiry module is as follows:

f _q ：F _pos →Q

wherein F is _pos For the position view cone feature, Q is the position query feature, the function f can be learned _q Realization of the slave F _pos Mapping to Q.

The key mapping module is as follows:

f _k ：F _image →K

wherein F is _image For 2D image features, K is key mapping feature, the function f can be learned _k Realization of the slave F _image Mapping to K.

The value mapping module is as follows:

f _v ：F _image →V

wherein F is _image Is 2D image feature, V is value mapping feature, and can learnXi Hanshu f _v Realization of the slave F _image Mapping to V.

The fusion module is as follows:

F _P ＝Softmax(QK，dim＝D)V

wherein F is _P Is a position-aware view cone feature, Q, K, V is the above-described position query feature, key-map feature, and value-map feature, softmax (dim=d) is the Softmax function calculation along the aforementioned depth sampling axis (the dimension of sampling D depths).

4) And processing the position-aware view cone features by using a 3D convolution network and a Laplace cumulative distribution function to obtain corresponding nerve volume rendering features, and performing optimization prediction by combining volume rendering processing to obtain a depth detection result.

The neural volume rendering feature based on the symbol distance field in the step 4) specifically comprises four parts of symbol distance field feature, volume density feature, 3D middle feature and RGB color feature, which are respectively realized by 3D convolution networks of 3 learnable parameters and Laplace cumulative distribution functions of 1 learnable parameter.

Wherein the first 3D convolutional network is:

f ₁ ：F _P →F _sdf

wherein F is _P Is a location-aware view cone feature, F _sdf 3D convolutional network F1 implementation of learnable 3D parameters for symbol distance field features from F _P To F _sdf Is mapped to the mapping of (a).

Wherein the second 3D convolutional network is:

f ₂ ：F _P →F _3D

wherein F is _P Is a location-aware view cone feature, F _3D 3D convolution network f for 3D intermediate features, which can learn 3D parameters ₂ Realization of the slave F _P To F _3D Is mapped to the mapping of (a).

Wherein the third 3D convolutional network is:

f ₃ ：F _3D →F _RGB

wherein F is _3D For 3D intermediate features, F _RGB 3D convolution network capable of learning 3D parameters for RGB color characteristicsf ₃ Realization of the slave F _3D To F _RGB Is mapped to the mapping of (a).

Wherein the laplace cumulative distribution function is:

αΨ _β ：F _sdf →F _density

wherein α, β are learnable parameters, ψ _β To a Laplace cumulative distribution function with zero mean and beta as scaling scales, F _sdf For sign distance field features, F _density For bulk density characteristics, ψ _β A mapping from symbol distance to bulk density is achieved for forming a homogeneous bulk density.

5) Before the 3D target detection is implemented, training and learning are further performed by the method, including self-supervision training of the volume rendering portion and supervised training of the 3D target detection portion, where the supervised training of the 3D target detection portion is that a general concept is irrelevant to an innovation point, and details are omitted here, and the learning process of the volume rendering portion specifically includes:

the model can be trained in a self-supervision mode, and when the loss function converges, model optimization is completed, wherein the loss function is as follows:

arg _min λ _rgb L _rgb +λ _depth L _depth +λ _sdf L _sdf

each loss is specifically respectively as follows:

color consistency L of reconstructed RGB image obtained by volume rendering and current original input RGB image _rgb Including smoothing the absolute error of the mean L _smoothL1 And structural similarity error L _SSIM ，λ _smoothL1 And lambda (lambda) _SSIM Respectively corresponding adjustable weights:

L _rgb ＝λ _smoothL1 L _smoothL1 +λ _SSIM L _SSIM

consistency L of reconstructed depth map obtained by volume rendering and sparse depth map constructed by projection of laser radar points (LiDAR) onto camera imaging plane _depth ：

6) The neuro-volume rendering features input to the universal target detection head include density voxel V _density And feature voxel V _f From the bulk density features F _density And 3D intermediate feature F _3D Sampling through a predefined 3D grid to obtain:

V _density ＝G(F _density )

V _f ＝G(F _3D )

V _3D ＝V _f ·tanh(V _density )

wherein G (-) represents the grid sampling operation, tanh (-) represents the hyperbolic tangent activation function used to scale the density voxels, V _3D For the 3D detection feature that is ultimately fed into the universal target detection head.

Claims

1. A monocular image 3D target detection method based on nerve volume rendering is characterized by comprising the following steps of: the method comprises the following steps:

2. The method for detecting the 3D target of the monocular image based on the nerve volume rendering according to claim 1, wherein the method comprises the following steps of: the step 2) specifically comprises the following steps: according to a predefined near depth z _n And far depth z _f Establishing a depth range, sampling D depths from the depth range, wherein the sampling process uses equal depth intervals and random disturbance, each depth is provided with a depth plane, each depth plane is divided into a plurality of grid pixels, and each grid pixel takes the plane coordinate of the grid pixel on the depth plane plus the depth of the depth plane as the three-dimensional coordinate of the grid pixel; and normalizing the three-dimensional coordinates of all grid pixels of all depth planes to obtain a three-dimensional matrix serving as a position view cone feature.

3. The method for detecting the 3D target of the monocular image based on the nerve volume rendering according to claim 2, wherein the method comprises the following steps of: in each depth plane, the view cone 3D position coordinates p= [ u, v, z ] ≡t of each grid pixel are composed of grid pixels superimposed with their own plane coordinates [ u, v ] ≡t on the depth plane with depth [ z ] ≡t.

4. The method for detecting the 3D target of the monocular image based on the nerve volume rendering according to claim 1, wherein the method comprises the following steps of: the step 3) of inquiring-key value method specifically comprises a position inquiring module, a key mapping module, a value mapping module and a fusion module containing a softmax function, and is used for fusing position viewing cone characteristics and 2D image characteristics to construct position perception viewing cone characteristics;

then multiplying the position query feature Q and the key mapping feature K, scaling the multiplied result in the depth dimension by using a softmax function, and multiplying the scaled result with the value mapping feature V to obtain a position perception cone feature; the method comprises the following steps:

the position inquiry module f _q The method comprises the following steps:

f _q ：F _pos →Q

wherein F is _pos The Q is a position looking-cone feature, and the Q is a position query feature;

the key mapping module f _k The method comprises the following steps:

f _k ：F _image →K

wherein F is _image For 2D image features, K is a key map feature;

the value mapping module f _v The method comprises the following steps:

f _v ：F _image →V

wherein F is _image 2D image features, V is a value mapping feature;

the fusion module is as follows:

F _P ＝Softmax(QK，dim＝D)V

wherein F is _P Is a location-aware view cone feature, the Q, K, V distributions are the location query feature, the key map feature, and the value map feature described above, dim representing the dimension of depth.

5. The method for detecting the 3D target of the monocular image based on the nerve volume rendering according to claim 1, wherein the method comprises the following steps of:

in the step 4), the neural volume rendering characteristics based on the symbol distance field are specifically:

the position sensing view cone features are input into a first 3D convolution network to output total 3D features, one-dimensional features are taken as symbol distance field features, meanwhile, the position sensing view cone features are also input into a second 3D convolution network to output total 3D features, the features of the remaining dimensions except the symbol distance field features are taken as 3D intermediate features, and the symbol distance field features are converted into volume density features required by volume rendering through Laplace cumulative distribution functions; the method comprises the following steps:

wherein the first 3D convolutional network is:

f ₁ ：F _P →F _sdf

wherein F is _P Is a location-aware view cone feature, F _sdf Is a sign distance field feature;

wherein the second 3D convolutional network is:

f ₂ ：F _P →F _3D

wherein F is _P Is a location-aware view cone feature, F _3D Is a 3D intermediate feature;

wherein the laplace cumulative distribution function is:

αΨ _β ：F _sdf →F _density

wherein, psi is _β For Laplace cumulative distribution function with zero mean and beta as scaling scale, alpha and beta are respectively the first and second leachable parameters, F _sdf For sign distance field features, F _density Is a bulk density feature.

6. A method for detecting a 3D object of a monocular image based on a rendering of a nerve volume according to claim 1 or 5, wherein:

in the method, the body density characteristics and the 3D intermediate characteristics are combined to establish loss constraint to carry out supervised training learning, specifically:

inputting the 3D intermediate features into a third 3D convolution network to be processed to obtain RGB color features, and recovering a reconstructed RGB image and a corresponding reconstructed depth map from a 3D space by volume rendering aiming at the volume density features and the RGB color features after the volume density features and the RGB color features are obtained;

wherein the third 3D convolutional network is:

f ₃ ：F _3D →F _RGB

wherein F is _3D For 3D intermediate features, F _RGB Is an RGB color feature;

using the RGB image of the original input to obtain an original depth map by projection of laser radar points (LiDAR points) carried by the RGB image;

a color consistency loss is established between the reconstructed RGB image and the original input RGB image, a sparse depth map consistency loss is established between the reconstructed depth map and the original depth map obtained by laser radar point projection, and the following loss constraint is also established: the three-dimensional coordinates of each laser radar point carried by the original input RGB image are found to correspond to elements in the symbol distance field features, and zero level set constraint is established with the values of the elements being 0 if the elements are found.

7. The method for detecting the 3D target of the monocular image based on the nerve volume rendering according to claim 6, wherein the method comprises the following steps of:

V _density ＝G(F _density )

V _f ＝G(F _3D )

V _3D ＝V _f ·tanh(V _density )