CN112464905B

CN112464905B - 3D target detection method and device

Info

Publication number: CN112464905B
Application number: CN202011494753.4A
Authority: CN
Inventors: 刘彩苹; 易子越; 李智勇
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2020-12-17
Filing date: 2020-12-17
Publication date: 2022-07-26
Anticipated expiration: 2040-12-17
Also published as: CN112464905A

Abstract

The invention discloses a 3D target detection method, which comprises the steps of obtaining an original RGB image; performing 2D target detection on the RGB image to obtain a 2D bounding box and a target category; segmenting and resampling by using the 2D bounding box to obtain viewing cone point cloud data containing a target; cutting the RGB image by using the 2D bounding box to obtain a target RGB image; inputting a target RGB image into a feature extraction network to obtain RGB depth features; inputting the viewing cone point cloud data and the RGB depth characteristics into a segmentation network to obtain a segmentation mask and converting the segmentation mask into a target point cloud; and resampling the target point cloud and inputting the target point cloud into a 3D frame prediction network to obtain a final target 3D boundary frame. The invention also provides a device for realizing the 3D target detection method. The method disclosed by the invention integrates the RGB depth characteristics and the view cone point cloud data, so that the reliability is higher and the accuracy is better.

Description

3D target detection method and device

Technical Field

The invention belongs to the field of image processing, and particularly relates to a 3D target detection method and device.

Background

With the development of economic technology and the wide application of intelligent technology, the field of automatic driving becomes a research hotspot nowadays.

The multi-modal perception fusion technology is an important component of an automatic driving system; the automatic driving system often needs to fuse the perception data of various sensors and detect a target in a three-dimensional space, so that a real, reliable and reasonable expression of the surrounding environment of the vehicle is provided for a planning module.

The view cone point cloud is formed by dividing laser points in a space view cone where a 2D target is located by utilizing a target 2D boundary frame on an image plane and a mapping relation between a laser radar coordinate system and a camera coordinate system. Currently, there are many 3D object detection methods based on viewing cone point clouds:

the method a comprises the following steps: detecting a 2D target frame on the RGB image, inputting the viewing cone point cloud into a segmentation network after segmenting the viewing cone point cloud for carrying out target or non-target binary classification, outputting a segmentation mask and segmenting a target point cloud; inputting the target point cloud into a 3D boundary frame prediction network, performing regression on target center coordinates, classifying and performing regression on the size and the course angle, and finally outputting a target 3D boundary frame expressed in a vector (x, y, z, w, l, h and theta) form;

the method b comprises the following steps: on the basis of the method a, a Mask RCNN is introduced to directly output a 2D Mask of a target on an image plane, and the 2D Mask is used for segmenting an original point cloud to obtain a view cone point cloud instead of segmenting the original point cloud in a three-dimensional coordinate like a;

the method c comprises the following steps: on the basis of the method a, when the target or the non-target is classified, an attention mechanism is introduced to find out a space point and a characteristic channel which need to be concerned in the point cloud data so as to achieve the purpose of effectively increasing target information. And solving the problem of unbalanced target and background categories in the point cloud data by using Focal local.

However, the existing 3D target detection method still has the problems of poor accuracy and low reliability, thereby affecting the application of the multi-modal perceptual fusion technology.

Disclosure of Invention

One of the purposes of the invention is to provide a 3D target detection method with high reliability and good accuracy.

The second objective of the present invention is to provide a device for implementing the 3D object detection method.

The 3D target detection method provided by the invention comprises the following steps:

s1, acquiring an original RGB image;

s2, performing 2D target detection on the RGB image acquired in the step S1 to obtain a 2D boundary frame and a target category;

s3, segmenting and resampling by using the 2D boundary box obtained in the step S2 to obtain view cone point cloud data containing a target;

s4, cutting the RGB image obtained in the step S1 by using the 2D bounding box obtained in the step S2, and thus obtaining a target RGB image;

s5, inputting the target RGB image obtained in the step S4 into a feature extraction network to obtain RGB depth features;

s6, inputting the view cone point cloud data obtained in the step S3 and the RGB depth characteristics obtained in the step S5 into a segmentation network to obtain a segmentation mask, and converting the segmentation mask into target point cloud;

and S7, resampling the target point cloud obtained in the step S6, and inputting the target point cloud into a 3D frame prediction network, so as to obtain a final target 3D boundary frame.

In step S3, the 2D bounding box obtained in step S2 is used for segmentation and resampling to obtain the viewing cone point cloud data including the target, specifically, the 2D bounding box obtained in step S2 is used for segmentation, and 1024 points are resampled to obtain the viewing cone point cloud data including the target.

In step S4, the RGB image obtained in step S1 is clipped by using the 2D bounding box obtained in step S2, so as to obtain a target RGB image, specifically, a copyMakeBorder function in an OpenCV library is used, an edge is filled with a gray value (128,128,128), a square image with an aspect ratio of 1:1 is obtained, and the size is adjusted to be a fixed [244 × 224 ].

Inputting the target RGB image obtained in step S4 into the feature extraction network to obtain RGB depth features in step S5, specifically, inputting the target RGB image into a ResNet 50 network, outputting features having a shape of [1 × 1 × 2048], and performing dimension reduction by convolution of [1 × 1,128], thereby obtaining RGB depth features γ having a shape of [1 × 1 × 128 ].

Step S6, inputting the viewing cone point cloud data obtained in step S3 and the RGB depth features obtained in step S5 into a segmentation network to obtain a segmentation mask, and converting the segmentation mask into a target point cloud, specifically, obtaining the target point cloud by the following steps:

A. expanding the dimensionality of the view cone point cloud data tensor with the shape of [1024 × 4] obtained in the step S3, and then passing through 3 layers of [1 × 1,64] of convolution layers to obtain point-by-point feature alpha with the shape of [1024 × 1 × 64 ];

B. b, the pointwise features alpha obtained in the step A are subjected to two convolutional layers of [1 × 1,128] and [1 × 1,1024] respectively to obtain features of [1024 × 1 × 1024], and the features are subjected to maximum pooling in the first dimension to obtain global features beta of [1 × 1 × 1024 ];

C. since the target is of three types, the type of the target obtained in step S2 is expressed as a tensor of [3], and dimension expansion is performed to obtain a type feature δ of [1 × 1 × 3 ];

D. splicing the global feature beta obtained in the step B, the RGB depth feature gamma obtained in the step S5 and the category feature delta obtained in the step C in a third dimension, and copying the first dimension to obtain a feature epsilon with the shape of [1024 × 1 × 1155 ];

E. splicing the point-by-point feature alpha obtained in the step A and the feature epsilon obtained in the step D in a third dimension to obtain a feature with the shape of [1024 × 1 × 1219 ];

F. inputting the [1024 × 1 × 1219] feature obtained in step E into one [1 × 1,512] convolution layer, one [1 × 1,256] convolution layer, two [1 × 1,128] convolution layers, and one [1 × 1,2] convolution layer, respectively, and then deleting the second dimension to obtain a [1024 × 2] division mask;

G. and F, correspondingly classifying the two types of each point in 1024 points in the input view cone point cloud data by the aid of the segmentation mask with the shape of [1024 × 2], obtaining a point cloud with the shape of [1024 × 3] by segmentation, resampling the first dimension, and outputting a target point cloud with the shape of [1024 × 3 ].

In step S7, the target point cloud obtained in step S6 is resampled and input to the 3D frame prediction network, so as to obtain a final target 3D bounding box, specifically, the following steps are adopted to obtain the final target 3D bounding box:

a. resampling an input target point cloud with the shape of [1024 × 3] to be [512 × 3], and expanding one dimension in a second dimension to obtain a tensor with the shape of [512 × 1 × 3 ];

b. b, respectively passing the tensor of [512 multiplied by 1 multiplied by 3] obtained in the step a through two convolutional layers of [1 multiplied by 1,128], one convolutional layer of [1 multiplied by 1,256] and one convolutional layer of [1 multiplied by 1,512], and obtaining the characteristic of the shape of [512 multiplied by 1 multiplied by 512 ];

c. c, performing maximum pooling on the [512 multiplied by 1 multiplied by 512] characteristics obtained in the step b in a first dimension to obtain characteristics xi with the shape of [1 multiplied by 512 ];

d. reducing the dimension of the RGB depth feature gamma obtained in the step S5 to [1 × 1 × 64] through convolution of [1 × 1,64], splicing the RGB depth feature gamma with the feature xi of [1 × 1 × 512] obtained in the step c in the third dimension, and then deleting the first two dimensions to obtain a tensor with the shape of [576 ];

e. d, splicing the tensor with the shape of [576] obtained in the step d with the tensor of the target type with the shape of [3], and obtaining the characteristic with the shape of [579 ];

f. inputting the characteristics with the shape of [579] obtained in the step e into three full-connection layers with the widths of 512, 256 and 59 in sequence, and finally outputting a vector with the length of 59;

g. in the vector with the length of 59 obtained in the step f, the meaning of each item is as follows: target center x coordinate, target center y coordinate, target center z coordinate, the next 24 items are [ angle scores, corresponding residuals ] corresponding to 12 predefined angles, the next 32 items are [ size scores, height residuals, width residuals, length residuals ] corresponding to 8 predefined sizes;

h. and g, restoring to obtain a final target 3D boundary box according to the definition of the step g.

The invention also provides a device for realizing the 3D target detection method, which comprises an image acquisition module, a 2D target detection module, a segmentation resampling module, a cutting module, a feature extraction module, a network segmentation module and a 3D frame prediction network module; the image acquisition module, the 2D target detection module, the segmentation resampling module, the cutting module, the feature extraction module, the network segmentation module and the 3D frame prediction network module are sequentially connected in series; the image acquisition module is used for acquiring an original RGB image; the 2D target detection module is used for carrying out 2D target detection on the acquired original RGB image so as to obtain a 2D bounding box and a target category; the segmentation and resampling module is used for segmenting and resampling the obtained 2D bounding box so as to obtain the viewing cone point cloud data containing the target; the cutting module is used for cutting the RGB image by using the obtained 2D boundary frame so as to obtain a target RGB image; the feature extraction module is used for extracting features of the target RGB image so as to obtain RGB depth features; the network segmentation module is used for segmenting the view cone point cloud data and the RGB depth features to obtain a segmentation mask and converting the segmentation mask into a target point cloud; and the 3D frame prediction network module is used for resampling the target point cloud and inputting the resampled target point cloud into the 3D frame prediction network so as to obtain a final target 3D boundary frame.

According to the 3D target detection method provided by the invention, the RGB depth features and the view cone point cloud data are fused, detected and calculated, so that a final target 3D bounding box is obtained; the method disclosed by the invention integrates the RGB depth characteristics and the view cone point cloud data, so that the reliability is higher and the accuracy is better.

Drawings

FIG. 1 is a schematic process flow diagram of the process of the present invention.

FIG. 2 is a functional block diagram of the apparatus of the present invention.

Detailed Description

FIG. 1 is a schematic flow chart of the method of the present invention: the 3D target detection method provided by the invention comprises the following steps:

s1, acquiring an original RGB image;

s3, segmenting and resampling by using the 2D bounding box obtained in the step S2 to obtain viewing cone point cloud data containing a target; specifically, the 2D bounding box obtained in the step S2 is used for segmentation, 1024 points are resampled, and therefore viewing cone point cloud data containing the target are obtained;

s4, cutting the RGB image obtained in the step S1 by using the 2D boundary frame obtained in the step S2, and thus obtaining a target RGB image; specifically, a copy MakeBorder function in an OpenCV library is adopted, edges are filled by gray values (128,128,128), a square image with the length-width ratio of 1:1 is obtained, and the size is adjusted to be a fixed [244 x 224 ];

s5, inputting the target RGB image obtained in the step S4 into a feature extraction network to obtain RGB depth features; inputting a target RGB image into a ResNet 50 network, outputting a feature with the shape of [1 multiplied by 2048], and performing dimensionality reduction through convolution of [1 multiplied by 1,128] to obtain an RGB depth feature gamma with the shape of [1 multiplied by 128 ];

s6, inputting the viewing cone point cloud data obtained in the step S3 and the RGB depth characteristics obtained in the step S5 into a segmentation network to obtain a segmentation mask, and converting the segmentation mask into a target point cloud; specifically, the method comprises the following steps of:

G. f, obtaining a segmentation mask with a shape of [1024 × 2], wherein the segmentation mask corresponds to two classification categories of each point in 1024 points in the input view cone point cloud data, and the target point cloud with the shape of [1024 × 3] is obtained through segmentation;

s7, resampling the target point cloud obtained in the step S6, and inputting the target point cloud into a 3D frame prediction network to obtain a final target 3D boundary frame; specifically, the following steps are adopted to obtain a final target 3D bounding box:

h. and g, restoring to obtain a final target 3D bounding box according to the definition of the step g.

On the KITTI data set val set, the method of the invention uses the same group of 2D detection results to perform the test as the method a in the background art, and the obtained comparison value of the average precision AP is shown in the following table 1:

table 1 comparative results schematic table

As can be seen from Table 1, the method of the present invention has the advantages of better precision, higher reliability and better accuracy.

Fig. 2 is a schematic diagram of functional modules of the apparatus of the present invention: the device for realizing the 3D target detection method comprises an image acquisition module, a 2D target detection module, a segmentation resampling module, a cutting module, a feature extraction module, a network segmentation module and a 3D frame prediction network module; the image acquisition module, the 2D target detection module, the segmentation resampling module, the cutting module, the feature extraction module, the network segmentation module and the 3D frame prediction network module are sequentially connected in series; the image acquisition module is used for acquiring an original RGB image; the 2D target detection module is used for carrying out 2D target detection on the acquired original RGB image so as to obtain a 2D bounding box and a target category; the segmentation and resampling module is used for segmenting and resampling the obtained 2D bounding box so as to obtain the viewing cone point cloud data containing the target; the cutting module is used for cutting the RGB image by using the obtained 2D bounding box so as to obtain a target RGB image; the feature extraction module is used for extracting features of the target RGB image so as to obtain RGB depth features; the network segmentation module is used for segmenting the view cone point cloud data and the RGB depth features to obtain a segmentation mask and converting the segmentation mask into a target point cloud; and the 3D frame prediction network module is used for resampling the target point cloud and inputting the resampled target point cloud into the 3D frame prediction network so as to obtain a final target 3D boundary frame.

Claims

1. A3D target detection method comprises the following steps:

s1, acquiring an original RGB image;

s2, performing 2D target detection on the RGB image obtained in the step S1 to obtain a 2D bounding box and a target category;

s3, segmenting and resampling by using the 2D bounding box obtained in the step S2 to obtain viewing cone point cloud data containing a target;

B. b, the pointwise features alpha obtained in the step A are subjected to two convolutional layers of [1 × 1,128] and [1 × 1,1024] respectively to obtain features of a shape of [1024 × 1 × 1024], and the features are subjected to maximal pooling in the first dimension to obtain global features beta of a shape of [1 × 1 × 1024 ];

C. since the target is in three categories, the category of the target obtained in step S2 is expressed as a tensor of [3], and dimension expansion is performed to obtain a category feature δ of [1 × 1 × 3 ];

s7, resampling the target point cloud obtained in the step S6, and inputting the target point cloud into a 3D frame prediction network to obtain a final target 3D boundary frame; specifically, the final target 3D bounding box is obtained by adopting the following steps:

e. d, splicing the tensor with the shape of [576] obtained in the step d with the tensor of the target category with the shape of [3], and obtaining the feature with the shape of [579 ];

2. The 3D object detection method according to claim 1, wherein the 2D bounding box obtained in step S2 is used for segmentation and resampling in step S3, so as to obtain the view cone point cloud data including the object, specifically, the 2D bounding box obtained in step S2 is used for segmentation, and 1024 points are resampled, so as to obtain the view cone point cloud data including the object.

3. The 3D object detection method according to claim 2, wherein in step S4, the RGB image obtained in step S1 is cropped by using the 2D bounding box obtained in step S2, so as to obtain an object RGB image, and specifically, the copy makeborder function in the OpenCV library is used, the gray value (128,128,128) is used to fill the edge, so as to obtain a square image with an aspect ratio of 1:1, and the size is adjusted to be a fixed [244 x 224 ].

4. The 3D object detection method according to claim 3, wherein the step S5 inputs the target RGB image obtained in step S4 to a feature extraction network to obtain RGB depth features, specifically, the target RGB image is input to a ResNet 50 network to output features with a shape of [1 x 2048], and dimension reduction is performed by convolution of [1 x 1,128] to obtain RGB depth features γ with a shape of [1 x 128 ].

5. A device for realizing the 3D target detection method in claim 1 is characterized by comprising an image acquisition module, a 2D target detection module, a segmentation resampling module, a clipping module, a feature extraction module, a network segmentation module and a 3D frame prediction network module; the image acquisition module, the 2D target detection module, the segmentation resampling module, the cutting module, the feature extraction module, the network segmentation module and the 3D frame prediction network module are sequentially connected in series; the image acquisition module is used for acquiring an original RGB image; the 2D target detection module is used for carrying out 2D target detection on the acquired original RGB image so as to obtain a 2D boundary frame and a target category; the segmentation and resampling module is used for segmenting and resampling the obtained 2D bounding box so as to obtain the view cone point cloud data containing the target; the cutting module is used for cutting the RGB image by using the obtained 2D bounding box so as to obtain a target RGB image; the feature extraction module is used for extracting features of the target RGB image so as to obtain RGB depth features; the network segmentation module is used for segmenting the view cone point cloud data and the RGB depth features to obtain a segmentation mask and converting the segmentation mask into a target point cloud; and the 3D frame prediction network module is used for resampling the target point cloud and inputting the target point cloud into the 3D frame prediction network so as to obtain a final target 3D boundary frame.