CN117422884A

CN117422884A - Three-dimensional target detection method, system, electronic equipment and storage medium

Info

Publication number: CN117422884A
Application number: CN202311490553.5A
Authority: CN
Inventors: 张惠敏; 马辉
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2023-11-09
Filing date: 2023-11-09
Publication date: 2024-01-19

Abstract

The embodiment of the application provides a three-dimensional target detection method, a three-dimensional target detection system, electronic equipment and a storage medium, and belongs to the technical field of artificial intelligence. The method comprises the steps of obtaining visual image data and sparse point cloud data to be detected, inputting the visual image data into a depth prediction model constructed based on a convolutional neural network to obtain depth image data, performing back projection processing according to the depth image data and the visual image data to obtain pseudo point cloud data, fusing and supplementing the pseudo point cloud data into the sparse point cloud data to improve the geometric information of the point cloud data to obtain dense point cloud data, and inputting the dense point cloud data into a three-dimensional target detection model based on point cloud voxelization to obtain an accurate three-dimensional target detection result. According to the method, visual image data are converted into pseudo point cloud data, the pseudo point cloud data are utilized to complement original sparse point cloud data, more target geometric detail information is obtained, and then, based on target detection of point cloud voxelization, the efficiency and accuracy of three-dimensional target detection are improved.

Description

Three-dimensional target detection method, system, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a three-dimensional target detection method, system, electronic device, and storage medium.

Background

Three-dimensional object detection is a key technology in the field of computer vision, and aims to accurately detect and locate an object in three-dimensional space from data of three-dimensional sensors (such as laser radars and depth cameras). Compared with the traditional two-dimensional target detection, the three-dimensional target detection requires that the plane position and the class of a specific target can be obtained from the sensing data, and the three-dimensional bounding box can be detected, wherein the three-dimensional bounding box comprises the information of the three-dimensional center position, the size, the direction and the like, and the detection difficulty is far higher than that of the two-dimensional target detection. According to the mode of the sensing data, the current three-dimensional target detection technology is generally an image-based detection technology or a point cloud-based detection technology.

The detection technology based on the image is to estimate depth information through the inherent geometrical attribute of the image by only using an image data mode, so as to realize three-dimensional target detection, wherein the image data mode comprises a monocular image, a binocular image and a depth image. However, the image lacks depth information and the remote object is deformed to some extent during processing. In addition, for long-distance and blocked targets, the three-dimensional detection accuracy of the image-based three-dimensional target detection method is generally low through multi-frame fitting, geometric constraint, instance segmentation and other modes.

Based on the detection technology of the point cloud, the data representation of the original point cloud or grid is often adopted. The method based on the original point cloud directly extracts the features from the original point cloud and performs three-dimensional target detection, and the method directly processes the sparse point cloud, and can fully utilize the point cloud information, but can cause repeated calculation, so that huge calculation resource waste is caused. The method based on the grid converts irregular point clouds into regular grid representation, three-dimensional target detection is carried out by learning grid features, and the calculated amount can be reduced, but information contained in the point clouds in voxelization is lost, the point clouds of the same target object are subjected to instance cutting, and particularly, the point clouds at far positions are sparse, and the point cloud features of small targets are insufficient and are easily influenced by a cutting mode, so that the three-dimensional detection precision is reduced.

Disclosure of Invention

The embodiment of the application mainly aims to provide a three-dimensional target detection method, a three-dimensional target detection system, electronic equipment and a storage medium, and aims to improve the accuracy of three-dimensional target detection.

To achieve the above object, an aspect of an embodiment of the present application provides a three-dimensional object detection method, including:

acquiring visual image data to be detected and sparse point cloud data;

Inputting the visual image data into a depth prediction model constructed based on a convolutional neural network to obtain depth image data;

performing back projection processing according to the depth image data and the visual image data to obtain pseudo point cloud data;

fusing the sparse point cloud data and the pseudo point cloud data to obtain dense point cloud data;

and inputting the dense point cloud data into a three-dimensional target detection model based on point cloud voxelization to obtain a three-dimensional target detection result.

In some embodiments, the inputting the visual image data into a depth prediction model constructed based on a convolutional neural network to obtain depth image data includes the following steps:

performing feature extraction on the visual image data through a convolution layer of a depth prediction model to obtain a first depth feature;

normalizing the first depth features through a batch normalization layer of the depth prediction model to obtain second depth features;

and mapping the second depth feature through a nonlinear activation function to obtain depth image data.

In some embodiments, the depth prediction model is trained by:

acquiring a training data set and initializing a depth prediction model, wherein the training data set comprises a plurality of training samples, and the training samples comprise visual image data marked with real depth data;

Inputting the training data set into the depth prediction model for forward prediction to obtain predicted depth data;

calculating a model loss value according to the predicted depth data and the corresponding real depth data;

and updating parameters of the depth prediction model according to the model loss value to obtain a trained depth prediction model.

In some embodiments, the performing back projection processing according to the depth image data and the visual image data to obtain pseudo point cloud data includes the following steps:

acquiring camera parameters;

converting pixel coordinates in the visual image data into normalized coordinates by the camera parameters;

and determining pseudo point cloud data according to the normalized coordinates, the depth values in the depth image data and the camera parameters.

In some embodiments, the inputting the dense point cloud data into a three-dimensional target detection model based on point cloud voxelization to obtain a three-dimensional target detection result includes the following steps:

voxelized processing is carried out on the dense point cloud data to obtain a plurality of voxel units in a three-dimensional voxel space;

extracting features of a plurality of voxel units through a three-dimensional convolutional neural network to obtain multi-scale voxel features;

And determining a three-dimensional target detection result according to the multi-scale voxel characteristics.

In some embodiments, the determining a three-dimensional object detection result according to the multi-scale voxel feature comprises the steps of:

selecting the highest scale feature in the multi-scale voxel features, and compressing the highest scale feature by using a bird's eye view perspective to obtain the voxel feature to be predicted;

and carrying out target detection on the voxel characteristics to be predicted through a regional suggestion network to obtain a three-dimensional target detection result.

In some embodiments, the voxelization processing is performed on the dense point cloud data to obtain a plurality of voxel units in a three-dimensional voxel space, and the method includes the following steps:

acquiring voxel parameters of a voxel unit, wherein the voxel parameters comprise voxel sizes and the number of accommodating point clouds;

calculating voxel indexes where the point clouds are located according to coordinate information of each point cloud in the dense point cloud data;

and determining a voxel unit according to the voxel index, judging whether the voxel unit reaches an accommodation upper limit according to voxel parameters of the voxel unit, and mapping the point cloud into the voxel unit when the voxel unit does not reach the accommodation upper limit.

To achieve the above object, another aspect of the embodiments of the present application proposes a three-dimensional object detection system, including:

The first module is used for acquiring visual image data to be detected and sparse point cloud data;

the second module is used for inputting the visual image data into a depth prediction model constructed based on a convolutional neural network to obtain depth image data;

the third module is used for carrying out back projection processing according to the depth image data and the visual image data to obtain pseudo point cloud data;

a fourth module, configured to fuse the sparse point cloud data and the pseudo point cloud data to obtain dense point cloud data;

and a fifth module, configured to input the dense point cloud data into a three-dimensional target detection model based on point cloud voxel, and obtain a three-dimensional target detection result.

To achieve the above object, another aspect of the embodiments of the present application proposes an electronic device including a memory, a processor, a program stored on the memory and executable on the processor, and a data bus for implementing connection communication between the processor and the memory, the program implementing the three-dimensional object detection method described in the above embodiments when executed by the processor.

In order to achieve the above object, a fourth aspect of the embodiments of the present application proposes a storage medium, which is a computer-readable storage medium, for computer-readable storage, the storage medium storing one or more programs executable by one or more processors to implement the three-dimensional object detection method described in the above embodiments.

According to the three-dimensional target detection method, the system, the electronic equipment and the storage medium, visual image data and sparse point cloud data to be detected are simultaneously acquired, the visual image data is input into a depth prediction model constructed based on a convolutional neural network to obtain depth image data, then back projection processing is carried out according to the depth image data and the visual image data to obtain pseudo point cloud data, the pseudo point cloud data is fused and supplemented into the sparse point cloud data to perfect geometric information of the point cloud data to obtain dense point cloud data, and then the dense point cloud data is input into a three-dimensional target detection model based on point cloud voxelization to obtain an accurate three-dimensional target detection result. According to the method, visual image data are converted into pseudo point cloud data, the pseudo point cloud data are complemented with original sparse point cloud data, more target geometric detail information can be obtained, and accuracy of three-dimensional target detection is improved while certain calculation efficiency is achieved through three-dimensional target detection based on point cloud voxelization.

Drawings

FIG. 1 is a flow chart of a three-dimensional object detection method provided in an embodiment of the present application;

fig. 2 is a flowchart of step S102 in fig. 1;

FIG. 3 is a flow chart of the depth prediction model training process of step S102 in FIG. 1;

fig. 4 is a flowchart of step S103 in fig. 1;

fig. 5 is a flowchart of step S105 in fig. 1;

fig. 6 is a flowchart of step S501 in fig. 5;

fig. 7 is a flowchart of step S503 in fig. 5;

FIG. 8 is a schematic structural diagram of a three-dimensional object detection system provided in an embodiment of the present application;

fig. 9 is a schematic hardware structure of an electronic device according to an embodiment of the present application;

fig. 10 is a schematic diagram of a three-dimensional object detection process according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

It should be noted that although functional block diagrams are depicted as block diagrams, and logical sequences are shown in the flowchart, in some cases, the steps shown or described may be performed in a different order than the block diagrams in the system. The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.

First, several nouns referred to in this application are parsed:

artificial intelligence (artificial intelligence, AI): is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding the intelligence of people; artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and to produce a new intelligent machine that can react in a manner similar to human intelligence, research in this field including robotics, language recognition, image recognition, natural language processing, and expert systems. Artificial intelligence can simulate the information process of consciousness and thinking of people. Artificial intelligence is also a theory, method, technique, and application system that utilizes a digital computer or digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.

Convolutional neural network (Convolutional Neural Networks, CNN): the feedforward neural network comprises convolution calculation and has a depth structure, and can perform supervised learning through marked training data, so that the tasks of visual image recognition, target detection and the like are completed.

The three-dimensional object detection model is a model that detects an object in a three-dimensional space. Three-dimensional object detection is a traditional task in the field of computer vision, and unlike image recognition, three-dimensional object detection not only needs to recognize an object existing on an image and give a corresponding category, but also needs to give the position of the object in a mode of a minimum bounding box. According to the different output results of target detection, information such as object types, length, width, height, rotation angle and the like in a three-dimensional space is generally output.

RGB-D (RGB-Depth), which is a multi-modal sensor data representation, combines RGB image and Depth information. Such data representation methods are commonly used in the computer vision and robotics fields to enhance the understanding and perception of environments and objects.

ResNet50 (Residual Network) is a deeper version of ResNet, comprising 50 convolutional layers and pooled layers and multiple Residual blocks to build a Network, aiming at solving the optimization problem of deep neural networks.

Voxel-RCNN (Voxelization Region-Based Convolutional Neural Networks, convolutional neural network based on voxelized region): a computer vision model is specifically used for three-dimensional (3D) object detection in point cloud data.

BEV (Bird's-Eye View): a view of a display scene or region at a top view angle, similar to the perspective of birds looking down on the ground. In the bird's eye view, the observer is located high above or away from the scene, presenting objects and structures on the ground in a global, comprehensive manner.

An RPN (Region Proposal Network ), a network for generating candidate target regions (also referred to as region proposals or candidate boxes), is typically used in conjunction with convolutional neural networks to improve the efficiency and accuracy of target detection.

The AP (Average Precision, average accuracy) is a performance evaluation index commonly used in target detection tasks to measure the accuracy of the algorithm under different categories or different thresholds.

The point cloud refers to the information of azimuth, distance and the like of reflected laser when the laser irradiates the surface of an object. When a laser beam is scanned along a certain track, reflected laser spot information is recorded while scanning, and since the scanning is extremely fine, a large number of laser spots can be obtained, and thus a laser spot cloud can be formed. The point cloud format is; * Pcd; * Txt, etc.

A depth image, also called a range image, refers to an image in which the distance (depth) value from an image collector to each point in the scene is taken as a pixel value.

The embodiment of the application provides a three-dimensional target detection method, a three-dimensional target detection system, electronic equipment and a storage medium, and aims to improve the accuracy of three-dimensional target detection.

The three-dimensional object detection method, the system, the electronic device and the storage medium provided in the embodiments of the present application are specifically described through the following embodiments, and the three-dimensional object detection method in the embodiments of the present application is first described.

The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The embodiment of the application provides a three-dimensional target detection method, and relates to the technical field of artificial intelligence. The three-dimensional target detection method provided by the embodiment of the application can be applied to a terminal, a server side and software running in the terminal or the server side. In some embodiments, the terminal may be a smart phone, tablet, notebook, desktop, etc.; the server side can be configured as an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms and the like; the software may be an application or the like for realizing the three-dimensional object detection method, but is not limited to the above form.

The subject application is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

Fig. 1 is an optional flowchart of a three-dimensional object detection method provided in an embodiment of the present application, where the method in fig. 1 may include, but is not limited to, steps S101 to S105.

Step S101, visual image data to be detected and sparse point cloud data are obtained;

step S102, inputting visual image data into a depth prediction model constructed based on a convolutional neural network to obtain depth image data;

step S103, performing back projection processing according to the depth image data and the visual image data to obtain pseudo point cloud data;

step S104, fusing sparse point cloud data and pseudo point cloud data to obtain dense point cloud data;

step S105, inputting the dense point cloud data into a three-dimensional target detection model based on point cloud voxelization to obtain a three-dimensional target detection result.

In the steps S101 to S105 illustrated in the embodiment of the present application, visual image data and sparse point cloud data to be detected are simultaneously acquired, the visual image data is input into a depth prediction model constructed based on a convolutional neural network to obtain depth image data, then back projection processing is performed according to the depth image data and the visual image data to obtain pseudo point cloud data, the pseudo point cloud data is fused and supplemented into the sparse point cloud data to perfect geometric information of the point cloud data to obtain dense point cloud data, and then the dense point cloud data is input into a three-dimensional target detection model based on point cloud voxelization to obtain an accurate three-dimensional target detection result. According to the embodiment, the visual image data are converted into the pseudo point cloud data, the pseudo point cloud data are utilized to complement the original sparse point cloud data, more target geometric detail information can be obtained, and the accuracy of three-dimensional target detection is improved while certain calculation efficiency is achieved through three-dimensional target detection based on point cloud voxelization.

In step S101 of some embodiments, the point cloud data refers to a set of vectors in a three-dimensional coordinate system, each point cloud includes three-dimensional coordinates, some may include color information or reflection intensity information, and the point cloud data itself has sparse, uneven, etc. features, which cannot provide sufficiently rich feature expression and information description. The visual image data is data representing contents by color information obtained by photographing with a camera, and is an RGB image.

Referring to fig. 2, in some embodiments, in step S102, the step of inputting the visual image data into a depth prediction model constructed based on a convolutional neural network to obtain depth image data may include, but is not limited to, steps S201 to S203:

step S201, extracting features of the visual image data through a convolution layer of a depth prediction model to obtain a first depth feature;

step S202, normalizing the first depth features through a batch normalization layer of the depth prediction model to obtain second depth features;

and step S203, mapping the second depth feature through a nonlinear activation function to obtain depth image data.

In this embodiment, the network structure of the depth prediction model is a convolutional neural network, which may be a res net network or a res net50 network. With the continuous increase of the layer number of the convolutional neural network, the characteristic learning capability of the network on the image is gradually deepened, but after the network is deepened to a certain degree, the effect is reduced instead of being improved. Based on the above, the embodiment of the application can combine the residual error network, and adopt the depth residual error network ResNet50 with deeper network layer and smaller calculated amount as the main feature extraction network, thereby learning more abundant image features and improving the depth prediction precision. In order to further improve the effect of solving the problem of gradient disappearance or gradient explosion caused by too deep layer number of the convolutional neural network, the residual structure of the embodiment comprises two layers, and the two-layer expressions are respectively shown in formulas (1) and (2).

F(x)＝W ₂ σ _X σ(W ₁ x)； (1)

h(x)＝F(x,{Wi})+x； (2)

Where x is the input vector, h (x) is the output vector, σ is the Relu activation function, and Wi is the linear transformation.

Specifically, the depth prediction model comprises a plurality of convolution layers, each convolution layer is connected through a batch normalization layer, and a nonlinear activation function is adopted for prediction mapping when the characteristics of the depth prediction model are output. Feature extraction is performed by a convolution layer, a batch normalization layer, and a nonlinear activation function, and depth image data is generated by a deconvolution layer and an upsampling layer.

The convolution layers are main parts in the convolution neural network structure, after the feature diagram of the previous convolution layer is input, each convolution kernel carries out convolution operation with the feature diagram, then the convolution kernels slide on the feature diagram by a certain step length, and each time of sliding carries out convolution operation. The convolution layer performs feature extraction on the input picture, and due to the relationship between the two layers of feature images, a large number of parameters can be reduced compared with a fully-connected neural network structure. The calculation formula of the convolution layer is shown in formula (3).

Wherein l represents the current layer;represents the j-th feature map on layer i; />Representing the ith feature map in layer l-1; f () represents an activation function; * Representing a convolution operation; w (w) _ij Representing a convolution kernel; k (k) _j A receptive field representing an input layer; b is the bias term of the output.

And the batch normalization layers, wherein the distribution of the input of each layer can be changed along with the change of the parameters of the previous layer in the training process of the neural network, so that the training of the neural network becomes complex. The initial parameters need to be adjusted to reduce the training speed, and batch normalization has the advantage of normalizing as part of the model architecture and normalizing each training batch. The data mean value and variance of each batch are obtained, and then the obtained mean value and variance are used for carrying out normalization processing on the data, namely, the mean value and variance are normalized to be between effective ranges, so that the value of each layer can be transferred to the next layer in the effective range, and the batch normalization algorithm process is as follows:

Input batch (mini-batch) data x: b= { x ₁ ,......,m}；

Calculating a mean value:

calculating the variance:

normalizing the value of the convolutional layer output using the mean and variance: y is _i ＝γx _i +β＝BN _γ,β (x _i )，The parameters gamma and beta are obtained through training.

The batch normalization can effectively prevent gradient explosion and disappearance in the training of the neural network, quicken the network convergence speed and improve the accuracy of the model.

The nonlinear activation function, reLu, is a commonly used activation function in convolutional neural networks, and has the advantages of high convergence rate and simple solution gradient. The ReLu activation function is shown in equation (4).

Wherein x represents the normalized second depth feature.

Before step S102 of some embodiments, a depth prediction model needs to be trained in advance, where the input of the depth prediction model is visual image data, and the model estimates a depth value of each pixel point by analyzing features such as texture, edges, and the like in an image, and learns and acquires a mapping relationship between the input image and a depth map. In training a depth prediction network, a set of RGB images with a true depth map is input, and the network is trained by minimizing the difference between the predicted depth map and the true depth map.

Referring to fig. 3, in some embodiments, the depth prediction model is trained by:

Step S301, a training data set is obtained and a depth prediction model is initialized, wherein the training data set comprises a plurality of training samples, and the training samples comprise visual image data marked with real depth data;

step S302, inputting a training data set into a depth prediction model for forward prediction to obtain predicted depth data;

step S303, calculating a model loss value according to the predicted depth data and the corresponding real depth data;

and step S304, updating parameters of the depth prediction model according to the model loss value to obtain a trained depth prediction model.

Specifically, after the training data set is obtained through random sampling, the training data set can be input into the initialized depth prediction model for training. Specifically, after data in the training data set is input into the initialized depth prediction model, a recognition result output by the model, namely predicted depth data, can be obtained, and the accuracy of prediction of the recognition model can be estimated according to the predicted depth data and the real depth data, so that parameters of the model are updated. For the depth prediction model, the accuracy of the model prediction result can be measured by a Loss function (Loss function), which is defined on single training data and is used for measuring the prediction error of one training data, specifically determining the Loss value of the training sample through the real result of the single training sample and the prediction result of the model on the training sample. In actual training, a training data set has a plurality of training samples, so that a Cost function (Cost function) is generally adopted to measure the overall error of the training data set, and the Cost function is defined on the whole training data set and is used for calculating the average value of the prediction errors of all the training samples, so that the prediction effect of the model can be better measured. For a general machine learning model, based on the cost function, a regular term for measuring the complexity of the model can be used as a training objective function, and based on the objective function, the loss value of the whole training data set can be obtained. There are many kinds of common loss functions, such as 0-1 loss function, square loss function, absolute loss function, logarithmic loss function, cross entropy loss function, etc., which can be used as the loss function of the machine learning model, and will not be described in detail herein. In embodiments of the present application, a loss function may be selected to determine a trained model loss value. Based on the model loss value, updating parameters of the model by adopting a back propagation algorithm, and iterating for several rounds to obtain a trained depth prediction model. The specific number of iteration rounds may be preset or training may be deemed complete when the test set meets the accuracy requirements.

Referring to fig. 4, in some embodiments, in step S103, the step of performing back projection processing according to the depth image data and the visual image data to obtain pseudo point cloud data may include, but is not limited to, steps S401 to S403:

step S401, obtaining camera parameters;

step S402, converting pixel coordinates in visual image data into normalized coordinates through camera parameters;

step S403, determining pseudo point cloud data according to the normalized coordinates, the depth values in the depth image data and the camera parameters.

In this embodiment, after the depth image data is predicted, the predicted depth image data may be combined with the original visual image data, and a back projection process may be used to generate the point cloud. The back projection process back projects the two-dimensional image into a three-dimensional point cloud, and specifically comprises the following steps:

camera parameters are acquired, including internal parameters and external parameters of the camera. The internal parameters include information of focal length, principal point position, distortion, etc., typically given in the form of a camera matrix K. The external parameters include the pose of the camera, i.e. the shooting angle of view, including the rotation matrix R and the translation vector t.

And converting the image coordinates, converting the pixel coordinates in the two-dimensional image into normalized coordinates, subtracting the offset of the principal point from the pixel coordinates, and dividing the offset by the focal length to obtain normalized coordinates (u, v).

Back projection calculation for each normalized coordinate (u, v), back projection calculation is performed using formula (5):

X＝inv(K)*[u,v,1]*d； (5)

wherein inv (K) is the inverse of the camera internal parameter matrix K, [ u, v,1] is a homogeneous representation of the normalized coordinates, and d represents the depth or distance value of the pixel.

Reconstructing a point cloud, and for three-dimensional coordinate points (x, y, z) obtained by back projection of each pixel, forming a point cloud data structure by the three-dimensional coordinate points to obtain pseudo point cloud data.

In some embodiments of step S104, the mapped pseudo point cloud data is fused with original sparse point cloud data, and the original sparse point cloud data is complemented into dense point cloud data. In general, the data fusion phase can be divided into early, mid and late fusion. Early fusion can fully utilize the original information of data, has low requirement on calculation amount, but is not flexible enough because of joint processing of a plurality of data modes, and if input data are expanded, a network structure needs to be retrained. The late fusion combines decision output of different data mode network structures, has higher flexibility and modularization, and when a new sensing mode is introduced, only a single training structure is needed without affecting other networks, but the calculation cost is higher, and a plurality of intermediate features can be lost. The middle fusion is a compromise between early fusion and late fusion, and features are fused in the middle layer, so that a network can learn different feature representations, and therefore, the embodiment can add the features of the original point cloud data and the pseudo point cloud data by adopting a middle fusion method to obtain dense point cloud data.

Referring to fig. 5, in some embodiments, in step S105, the step of inputting the dense point cloud data into the three-dimensional object detection model based on the point cloud voxel to obtain the three-dimensional object detection result may further include, but is not limited to, steps S501 to S503:

step S501, voxelization is carried out on the dense point cloud data to obtain a plurality of voxel units in a three-dimensional voxel space;

step S502, extracting characteristics of a plurality of voxel units through a three-dimensional convolutional neural network to obtain multi-scale voxel characteristics;

step S503, determining a three-dimensional target detection result according to the multi-scale voxel characteristics.

In some embodiments of steps S501 to S503, the three-dimensional object detection model may employ a Voxel-RCNN network structure, where Voxel-RCNN is an object detection network belonging to two phases, and the network needs to perform three processes: voxelized, BEV feature extraction and region suggestion network prediction. Firstly, the point cloud data are voxelized, then the 3D sparse convolution network is used for feature extraction, and then the features are compressed to a bird's eye view angle to detect a target. In the embodiment, convolution operation is performed on non-empty voxels, so that the efficiency of 3D convolution feature extraction is improved.

Referring to fig. 6, in some embodiments, step S501 may include, but is not limited to, steps S601 to S603:

Step S601, acquiring voxel parameters of a voxel unit, wherein the voxel parameters comprise voxel sizes and the quantity of accommodating point clouds;

step S602, calculating voxel indexes of the point clouds according to coordinate information of each point cloud in the dense point cloud data;

and step S603, determining a voxel unit according to the voxel index, judging whether the voxel unit reaches the accommodation upper limit according to the voxel parameters of the voxel unit, and mapping the point cloud into the voxel unit when the voxel unit does not reach the accommodation upper limit.

Specifically, voxelization is the division of a continuous three-dimensional space into regular voxel units (or voxel grids), converting point cloud data into discrete voxel representations. The voxel parameters including the voxel size and the number of point clouds that each voxel unit can accommodate are first required for the voxel data voxelization, and for example, the voxel size may be (0.4×0.2×0.2), where 0.4 is the length of the voxel, 0.2 is the height of the voxel, and 0.2 is the width of the voxel, seen from left to right. Then sequentially calculating according to coordinate information of the point cloud to obtain an index of each point in a voxel, mapping the index into a corresponding voxel unit according to the voxel index, judging whether the voxel unit reaches a set maximum value, and discarding the point if the maximum value is reached; if not, the method is reserved.

In step S502 of some embodiments, after obtaining a plurality of voxel units in the three-dimensional voxel space, feature extraction is performed on the plurality of voxel units through a three-dimensional convolutional neural network to obtain a multi-scale voxel feature f _v . The characteristic extraction process of the three-dimensional convolutional neural network specifically comprises the following steps:

input grid center point g _i The point cloud local neighborhood graph is defined as g= (V, E).

Let g= (G ₁ ,G ₂ ,......,G _i ) I is the number of voxel units in the 3D backbone network.

Taking the grid center point g _i Distance g as the center of the graph _i The voxels within the sphere neighborhood with radius r are vertices V, i.e. voxel feature set f _v Voxel v in (3) _i E is v _i To g _i As shown in equation (6).

Wherein v is _i,n Representing voxels within a local neighborhood of the i-th grid center point, n being the index of the voxels.

In each layer of voxel feature extraction, a local neighborhood graph with different spherical radii r is constructedWith r adjusted, the model can learn graph features under different perceived fields.

The three-dimensional convolution neural network is used for extracting the characteristics of voxels in the three-dimensional voxel space, 4 layers of sparse convolution are adopted, then 2 times of downsampling operation is sequentially carried out, the voxel characteristics of different scales can be obtained, and the multi-scale voxel characteristics can be expressed as Wherein the superscript n x denotes a downsampling multiple.

Referring to fig. 7, in some embodiments, step S503 may include, but is not limited to, steps S701 to S702:

step S701, selecting the highest scale feature in the multi-scale voxel features, and compressing the highest scale feature by using a bird' S eye view perspective to obtain the voxel feature to be predicted;

and step S702, carrying out target detection on the voxel characteristics to be predicted through the regional suggestion network to obtain a three-dimensional target detection result.

In the present embodiment, a multi-scale voxel feature f is obtained _v Then, the voxel with the largest scale is characterizedCompressing to BEV view angle along z axis to obtain voxel characteristic f to be predicted under BEV view angle _bev And then f _bev And inputting the region of interest into the RPN network. The regional suggestion network is a full convolution network, the RPN takes as input an arbitrary size image feature, outputs a set of rectangular target suggestion boxes, each box having an object score.

According to some embodiments of the present invention, a three-dimensional object detection process according to an embodiment of the present invention is described with reference to fig. 10. Firstly, the original acquired sparse point cloud data and visual image data are input, the visual image data are input into a depth prediction model of a ResNet50 convolutional neural network structure to obtain depth image data, and then the depth image data are mapped into pseudo point cloud data. And carrying out feature addition on the pseudo point cloud data and the original sparse point cloud data to obtain dense point cloud data, carrying out voxelization on the dense point cloud data, carrying out feature extraction on the dense point cloud data by adopting a three-dimensional convolution network to obtain multi-scale voxel features, carrying out dimension reduction processing on the multi-scale voxel features to obtain BEV features, and finally inputting the BEV features into a regional suggestion network to obtain a three-dimensional target detection result.

According to some embodiments of the present invention, the effects of the three-dimensional object detection method according to the embodiments of the present invention are described as follows:

in the KITTI data set, easy, modernd, hard is defined in terms of whether the annotation box is occluded, the degree of occlusion, and the height of the box. In the easy mode, the minimum bounding box height is 40 pixels, the maximum occlusion level is fully visible, and the maximum cutoff is 15%. In the mode of mode, the minimum bounding box height is 25 pixels, the maximum occlusion level is partial occlusion, and the maximum truncation is 30%. In hard mode, the minimum bounding box height is 25 pixels, the maximum occlusion level is hard to see, and the maximum cutoff is 50%. In general, a mode is often used, taking a compromise between simple and complex images. For detecting the type of driving vehicle, car (Detection) refers to Detection under 2D, car (BEV) refers to Detection under a bird's eye view, car (3D Detection) refers to Detection under 3D, and Car (Orientation) refers to average accuracy of different rotation angles of an object under a camera coordinate system.

On the KITTI test dataset, 40 epochs were iterated using four 3090 graphics cards with total_batch_size set to 4. In this embodiment, the AP is used as an evaluation index, and the IOU threshold is set to 0.7, and the detection accuracy is not lower than 0.7, and the determination is qualified. In the mode of the model, in the image depth prediction and target Detection network model training process based on ResNet50+Voxel-RCNN of the embodiment of the invention, the AP of Car (Detection) is 95.72%, the AP of Car (Orientation) is 95.55%, the AP of Car (3D Detection) is 81.55%, the AP of Car (BEV) is 91.02%, and the accuracy in 4 states is greater than 0.7, and the accuracy of target Detection of the three-dimensional target Detection method of the embodiment of the application is respectively improved by 1.6%, 1.4%, 0.8% and 0.4% compared with that of the traditional network model of Voxel-RCNN.

In the embodiment of the invention, the appearance information provided by the image helps to enrich the point cloud data, and the accurate geometric structure provided by the point cloud also improves the mapping point cloud of the image conversion. Experimental results of the embodiment of the application show that the detection of the point cloud fusion target based on depth prediction can remarkably improve the detection performance of the three-dimensional target.

Referring to fig. 8, an embodiment of the present application further provides a three-dimensional object detection system, including:

It can be understood that the content in the above-mentioned three-dimensional object detection method embodiment is applicable to the system embodiment, and the functions specifically implemented by the system embodiment are the same as those of the above-mentioned three-dimensional object detection method embodiment, and the beneficial effects achieved by the system embodiment are the same as those achieved by the above-mentioned three-dimensional object detection method embodiment.

The embodiment of the application also provides electronic equipment, which comprises: the three-dimensional object detection method comprises a memory, a processor, a program stored on the memory and capable of running on the processor, and a data bus for realizing connection communication between the processor and the memory, wherein the program is executed by the processor to realize the three-dimensional object detection method. The electronic equipment can be any intelligent terminal including a tablet personal computer, a vehicle-mounted computer and the like.

Referring to fig. 9, fig. 9 illustrates a hardware structure of an electronic device according to another embodiment, the electronic device includes:

the processor 901 may be implemented by a general purpose CPU (central processing unit), a microprocessor, an application specific integrated circuit (ApplicationSpecificIntegratedCircuit, ASIC), or one or more integrated circuits, etc. for executing related programs to implement the technical solutions provided by the embodiments of the present application;

the memory 902 may be implemented in the form of read-only memory (ReadOnlyMemory, ROM), static storage, dynamic storage, or random access memory (RandomAccessMemory, RAM). The memory 902 may store an operating system and other application programs, and when the technical solutions provided in the embodiments of the present application are implemented by software or firmware, relevant program codes are stored in the memory 902, and the processor 901 invokes the three-dimensional object detection method to execute the embodiments of the present application;

An input/output interface 903 for inputting and outputting information;

the communication interface 904 is configured to implement communication interaction between the device and other devices, and may implement communication in a wired manner (e.g. USB, network cable, etc.), or may implement communication in a wireless manner (e.g. mobile network, WIFI, bluetooth, etc.);

a bus 905 that transfers information between the various components of the device (e.g., the processor 901, the memory 902, the input/output interface 903, and the communication interface 904);

wherein the processor 901, the memory 902, the input/output interface 903 and the communication interface 904 are communicatively coupled to each other within the device via a bus 905.

The embodiment of the application also provides a storage medium, which is a computer readable storage medium and is used for computer readable storage, the storage medium stores one or more programs, and the one or more programs can be executed by one or more processors to realize the three-dimensional target detection method.

The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The embodiments described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and as those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.

It will be appreciated by those skilled in the art that the solutions shown in fig. 1-7 are not limiting to embodiments of the present application and may include more or fewer steps than shown, or certain steps may be combined, or different steps.

The system embodiments described above are merely illustrative, in that the units illustrated as separate components may or may not be physically separate, i.e., may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Those of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

The terms "first," "second," "third," "fourth," and the like in the description of the present application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in this application, "at least one" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided in this application, it should be understood that the disclosed systems and methods may be implemented in other ways. For example, the system embodiments described above are merely illustrative, e.g., the division of the above elements is merely a logical functional division, and there may be additional divisions in actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interface, system or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including multiple instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the various embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing a program.

Preferred embodiments of the present application are described above with reference to the accompanying drawings, and thus do not limit the scope of the claims of the embodiments of the present application. Any modifications, equivalent substitutions and improvements made by those skilled in the art without departing from the scope and spirit of the embodiments of the present application shall fall within the scope of the claims of the embodiments of the present application.

Claims

1. A three-dimensional object detection method, characterized by comprising the steps of:

acquiring visual image data to be detected and sparse point cloud data;

2. The method for three-dimensional object detection according to claim 1, wherein the step of inputting the visual image data into a depth prediction model constructed based on a convolutional neural network to obtain depth image data comprises the steps of:

3. The three-dimensional object detection method according to claim 1, wherein the depth prediction model is trained by:

4. The method for detecting a three-dimensional object according to claim 1, wherein the performing back projection processing according to the depth image data and the visual image data to obtain pseudo point cloud data comprises the following steps:

acquiring camera parameters;

5. The method for three-dimensional object detection according to claim 1, wherein the step of inputting the dense point cloud data into a three-dimensional object detection model based on point cloud voxelization to obtain a three-dimensional object detection result comprises the steps of:

6. The method of claim 5, wherein determining the three-dimensional object detection result from the multi-scale voxel feature comprises the steps of:

7. The method for detecting a three-dimensional object according to claim 5, wherein the voxel processing is performed on the dense point cloud data to obtain a plurality of voxel units in a three-dimensional voxel space, and the method comprises the following steps:

8. A three-dimensional object detection system, comprising:

9. An electronic device comprising a memory, a processor, a program stored on the memory and executable on the processor, and a data bus for enabling a connection communication between the processor and the memory, the program when executed by the processor implementing the steps of the three-dimensional object detection method according to any one of claims 1 to 7.

10. A storage medium, which is a computer-readable storage medium, for computer-readable storage, characterized in that the storage medium stores one or more programs executable by one or more processors to implement the steps of the three-dimensional object detection method of any one of claims 1 to 7.