CN115063539A

CN115063539A - Image dimension increasing method and three-dimensional target detection method

Info

Publication number: CN115063539A
Application number: CN202210847218.5A
Authority: CN
Inventors: 李怡康; 石博天; 李鑫
Original assignee: Shanghai AI Innovation Center
Current assignee: Shanghai AI Innovation Center
Priority date: 2022-07-19
Filing date: 2022-07-19
Publication date: 2022-09-16
Anticipated expiration: 2042-07-19
Also published as: CN115063539B

Abstract

The invention discloses a three-dimensional target detection method, which comprises the steps of extracting point cloud voxel characteristics from point cloud, projecting two-dimensional image characteristics to a three-dimensional homogeneous image voxel space in an ascending manner to obtain image voxel characteristics, fusing the point cloud voxel characteristics and the image voxel characteristics to obtain fusion characteristics, and identifying and classifying targets based on the fusion characteristics, so that information loss in a characteristic level fusion process is effectively reduced.

Description

Image dimension increasing method and three-dimensional target detection method

Technical Field

The invention relates to the technical field of automatic driving, in particular to an image dimension increasing method and a three-dimensional target detection method.

Background

Three-dimensional object detection is an important direction in the field of automatic driving, and aims to accurately locate and classify each object in a three-dimensional space, so that a vehicle can comprehensively perceive and understand the surrounding environment.

The existing three-dimensional target detection is realized based on laser radar and/or images. The point cloud obtained based on the laser radar contains accurate spatial information, and the image contains more semantic information, so that the multimode-based three-dimensional target detection is a method capable of utilizing complementary information of the image and the point cloud, and becomes a development direction of a three-dimensional target detection method.

The multi-modal-based three-dimensional target detection comprises two categories of decision-level fusion and feature-level fusion. The decision-level fusion refers to merging objects detected by different modality detection modules together through a strategy, and the performance of the method is limited by the performance of each modality detection module. And the feature level fusion is to fuse features of different modalities and then perform object detection based on the fused features. There are two common feature level fusion methods. One is to generate a region of interest, and then to intercept the sub-features of the corresponding region from the features of the respective modalities for fusion. This method requires projection of three-dimensional points onto a two-dimensional plane before feature fusion, and therefore, results in significant loss of three-dimensional information. The other method is to transform the points in the point cloud into voxels and then fuse the voxels with the image, which can achieve finer granularity fusion, but may have larger mismatch and information loss due to larger difference between the image and the feature space of the point cloud.

Disclosure of Invention

In order to solve some or all of the problems in the prior art, the present invention provides an image dimension-increasing method for increasing dimensions of two-dimensional image features into a three-dimensional space, the method comprising:

extracting two-dimensional image features of the image;

generating a visual pyramid feature based on the two-dimensional image feature; and

and mapping the visual cone features to a three-dimensional space.

Further, the generating of the visual pyramid feature comprises:

and performing outer product operation on the feature vector corresponding to each pixel and a depth interval, wherein the depth interval is along the ray direction of the perspective projection of the image visual cone.

Further, mapping the view volume features to three-dimensional space by tri-linear interpolation, comprising:

and traversing the three-dimensional space, and projecting each point of the three-dimensional space based on the calibration matrix.

The invention provides a three-dimensional target detection method based on homogeneous multi-mode feature fusion, which comprises the following steps:

extracting point cloud voxel characteristics from the point cloud, and extracting two-dimensional image characteristics from the image;

projecting the two-dimensional image characteristic to a three-dimensional homogeneous image voxel space in an ascending way by adopting the image ascending method to obtain an image voxel characteristic;

fusing the point cloud voxel characteristics with the image voxel characteristics to obtain fused characteristics; and

and identifying and classifying the target based on the fusion characteristics.

Further, the point cloud voxel characteristics are extracted from the point cloud by adopting a point cloud coding network.

Further, a query fusion mechanism is adopted to fuse the point cloud voxel characteristics and the image voxel characteristics.

Further, the adoption of the self-attention layer as a query fusion mechanism comprises the following steps:

using the point cloud voxel characteristics as query and the image voxel characteristics as keys and values;

applying three learnable linear transformations to the query, key, and value for each attention head;

mapping a plurality of attention heads to a homogeneous three-dimensional space to obtain multi-head attention;

and splicing the multi-head attention and the point cloud voxel characteristics to obtain fusion characteristics.

Further, the constructing of the query includes:

and extracting appointed non-empty voxels in the point cloud voxel characteristics, and constructing and forming a query.

Further, the key and value configuration includes:

and extracting parts with information quantity higher than a preset value from the image voxel characteristics by using three-dimensional maximum pooling to form a key sum value.

Further, the method further comprises:

and distributing the fusion features into the point cloud voxel space to serve as a basis for target identification and classification.

Further, the method further comprises: applying a target-level similarity constraint to the point cloud voxel characteristics and the image voxel characteristics, including:

intercepting N groups of interest area features on the point cloud voxel features and the image voxel features by utilizing voxel-level interest area pooling;

converting the N groups of interest region features into a measurement space to obtain N pairs of measurement features; and

the cosine similarity distance of each pair of metrology features is minimized.

Further, the N groups of interest region features are converted into a metric space through an encoder and a predictor, and a gradient cut-off strategy is adopted for the encoder and the predictor.

Further, the predictor includes a multi-tier perceptron.

Compared with the method of projecting the three-dimensional point cloud to the two-dimensional image and the like, the method for detecting the three-dimensional target can prevent the image and the point cloud from being compressed and lost to the greatest extent in the characteristic transformation process and retain most original information. The method also introduces a Query Fusion Mechanism (QFM), and the principle is that the point cloud and the image feature are adaptively fused based on self-attention, so that each point cloud voxel can adaptively perceive useful information from the global three-dimensional image feature, and the two homogeneous representations can be effectively fused. In addition, in order to improve the consistency of the point cloud and the image, the method also provides an object-level voxel characteristic interaction method, so that the semantic consistency of two homogeneous characteristics is effectively improved, and the capability of the model for performing cross-modal characteristic fusion is enhanced.

Drawings

To further clarify the above and other advantages and features of embodiments of the present invention, a more particular description of embodiments of the present invention will be rendered by reference to the appended drawings. It is appreciated that these drawings depict only typical embodiments of the invention and are therefore not to be considered limiting of its scope. In the drawings, the same or corresponding parts will be denoted by the same or similar reference numerals for clarity.

FIG. 1 is a process diagram of an image upscaling method according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a three-dimensional target detection method according to an embodiment of the invention; and

FIG. 3 illustrates a process diagram for applying a target-level similarity constraint, according to an embodiment of the invention.

Detailed Description

In the following description, the present invention is described with reference to examples. One skilled in the relevant art will recognize, however, that the embodiments can be practiced without one or more of the specific details, or with other alternative and/or additional methods. In other instances, well-known operations have not been shown or described in detail to avoid obscuring aspects of the invention. Similarly, for purposes of explanation, specific configurations are set forth in order to provide a thorough understanding of the embodiments of the invention. However, the invention is not limited to these specific details.

Reference in the specification to "one embodiment" or "the embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment.

In the feature level fusion method of the three-dimensional object detection algorithm, if multi-modal features are fused in a Region of Interest (RoI), the RoI is usually projected onto a two-dimensional bird's-eye view or front view for alignment and feature extraction, which may cause serious information loss, and the lost three-dimensional information plays a key role in positioning an object in a three-dimensional space. If the points in the point cloud are transformed into voxels and then fused with the image, large mismatching and information loss can also be caused due to the projection parallax between the two-dimensional dense image pixels and the three-dimensional sparse lidar points. Based on the above problems, the present invention provides a uniform fusion scheme: homogeneous multimodal Feature Fusion Interaction method (HMFI). This method first upscales the two-dimensional image into three dimensions and forms a dense voxel structure. Feature level fusion was then performed in a three-dimensional Homogeneous Space (Homogeneous Space). Since there is no dimension reduction compression on the data, this method can reduce the information loss caused by projection.

In addition, a cross-mode feature interaction mechanism spanning point cloud and images is designed based on the homogeneous three-dimensional structure to enhance semantic information transfer and interaction during fusion of the two data. Specifically, the invention designs an Image-Voxel lifting mapping Module (IVLM) to recombine two-dimensional Image features in a three-dimensional space, so as to construct an Image Voxel structure which is homogeneous with point cloud voxels, and perform multi-mode feature fusion by using depth information provided by the point cloud. This fusion in three-dimensional space does not present the information loss due to the dimension reduction projection. In addition, in order to enable better Fusion interaction of homogeneous voxel representation of cross-modal data, the invention further introduces a Query Fusion Mechanism (QFM), which introduces an operation based on self-attention to adaptively combine point cloud and image features. And each point cloud voxel can inquire information of corresponding positions in all image voxels for fusion and can be combined to form a camera-laser radar characteristic. The QFM mechanism enables each point cloud voxel to adaptively perceive useful information from the global three-dimensional image features and to effectively fuse the two homogenous representations.

Furthermore, although the point cloud and image representations are different, the object level semantic information should be similar in homogenous structure. Therefore, in order to enhance the abstract representation of the point cloud and the image in the shared three-dimensional space and utilize the similarity of the same object attributes in the two modes, the invention also provides a Voxel Feature Interaction Module (VFIM) at an object level to improve the consistency of the point cloud and the image. In particular, a Voxel-based region of interest Pooling (Voxel RoI-Pooling) module is used to extract features from two homogeneous representations based on predicted bounding box candidates and RoI features, and then the cosine similarity loss between each set of corresponding image Voxel features and point cloud Voxel features is computed to enforce object-level consistency in the point cloud and image. In addition, in the VFIM module, the interaction between the paired homogeneous region characteristics is modeled, so that the semantic consistency of the two homogeneous characteristics can be improved, and the capability of the model for performing cross-modal characteristic fusion is enhanced.

The solution of the invention is further described below with reference to the accompanying drawings of embodiments.

Fig. 1 is a process diagram of an image dimension-increasing method according to an embodiment of the present invention. As shown in fig. 1, an image dimension-increasing method includes:

first, two-dimensional image features of an image are extracted. Image processing method

Into networks such as ResNet-50 networks, etc., to extract two-dimensional image features

Wherein

And

w is the width and height of the image _F ,H _F ,C _F Respectively representing the width, the height and the channel number of the two-dimensional image features;

next, the two-dimensional image features are converted into view pyramid features in which depth information is encoded in the image features

In one embodiment of the invention, each pixel (m, n) in the two-dimensional image feature F is assigned to a two-dimensional image feature

By a depth bin D along the ray direction of the image view cone perspective projection _m,n Walking into three-dimensional space, the depth interval

Is formed by discretizing a Depth map using a Linear-interpolating Depth Discretization (LID) algorithm, which is formed by W _F ×H _F One-hot discretized depth interval of dimension R. To correlate the two-dimensional image features with the discretized depth information, in one embodiment of the invention, the depth information is determined by correlating the two-dimensional image features for each pixel (m, n)

And the depth interval

Performing outer product operation to obtain the visual cone features corresponding to each pixel

And

finally, mapping the visual cone features to a three-dimensional space to obtain image voxel features

Wherein (X) _I ,Y _I ,Z _I ) The size of the grid is sliced for the image voxels. In one embodiment of the invention, the view volume is adjusted by trilinear differences

Mapping into three-dimensional space

In one embodiment of the invention, to obtain the image voxel characteristics of the ith location

First, sampling the centroid of the visual pyramid feature G based on a Calibration Matrix (Calibration Matrix) CM:

wherein,

three-dimensional coordinate values representing the ith voxel in G, I, and then surrounding

Is subjected to trilinear interpolation to form a neighborhood of

And traversing each position I to obtain the image voxel characteristic I by performing the operation as described above.

Based on the image dimension-increasing method, fig. 2 shows a flow chart of a three-dimensional target detection method according to an embodiment of the present invention. As shown in fig. 2, a three-dimensional target detection method includes:

first, in step 201, features are extracted. Extracting point cloud voxel characteristics from point cloud

And extracting two-dimensional image features in the image

Wherein C is _F Channel number (X) which is a point cloud voxel characteristic _P ,Y _P ,Z _P ) Segmenting the size of the grid for point cloud voxels, W _F ,H _F ,C _F Respectively representing the width, height and channel number of the two-dimensional image features. In one embodiment of the invention, a point cloud encoding network is utilized to extract features from the point cloud. In yet another embodiment of the invention, two-dimensional image features are extracted through a ResNet-50 network;

next, at step 202, the image is upscaled. In one embodiment of the invention, Image Voxel characteristics are obtained by Image Voxel ascending projection Module (IVLM) to project Image characteristics F to three-dimensional homogeneous Image Voxel space in ascending dimension

Wherein C is _F The number of channels that are characteristic of the image voxels, (X) _I ,Y _I ,Z _I ) The size of the grid is sliced for the image voxels. The specific dimension-increasing method is as described above;

next, in step 203, the features are fused. In one embodiment of the invention, a Query Fusion Mechanism (QFM) is used to fuse homogenous point cloud voxel features and image voxel features to drill down complementary information between the point cloud and image such that each point cloud voxel feature is able to perceive the entire picture and selectively fuse image voxel features therefrom. In one embodiment of the invention, a module of the self-attention mechanism is used to treat each voxel feature vector of the image and point cloud as a homogenous token (token), i.e., point cloud voxel feature F is used _P As query (query), with image voxel characteristics F _I These two features, as key and value, are eventually transformed into a fused voxel feature after the self-attention mechanism, specifically including:

first, for each attention head _j ，j＝1,2, …, r utilizes three learnable linear transformations

Is applied to the query, key and value to obtain

Next, the r heads of attention (head) are assigned ₁ ,head ₂ ,…,head _r ) Mapping to a homogeneous three-dimensional space to obtain multi-head attention:

A _M ＝Concat(head ₁ ,head ₂ ,…,head _r )W ^O

wherein:

mapping r attention heads to linear transformation matrices of a homogeneous three-dimensional space for each; and

and

finally, attention A is paid to the multiple heads _M And point cloud voxel feature F _P Splicing to obtain a fused feature

Since most lidar point cloud voxels are empty, in one embodiment of the invention, a forming query is constructed by extracting the M non-empty voxels in the point cloud voxel feature P

Whereas for image voxels it is in two dimensionsThe image features are obtained by projecting the image features into a three-dimensional space in an ascending way, and methods such as trilinear interpolation and the like are used in the construction process, so that image voxels are denser than point cloud voxels, namely the number of non-empty voxels is more. Therefore, in order to reduce the calculation overhead, in one embodiment of the present invention, three-dimensional Max-pooling (3D Max-pooling) is used to extract the portion of the image voxel characteristic I with information amount higher than a preset value

Wherein

λ is a preset proportion, and then flattening I along the first three dimensions ^* To obtain

In one embodiment of the present invention, the M non-empty voxels are further obtained

Obtained in a spread homogenous voxel space

As the input of the final downstream three-dimensional target detection module; and

finally, at step 204, the target is identified. Using detection module from P ^* To generate bounding boxes and object classification results for the target.

Since the lidar and the camera are of different modalities, there will be different data representations for the same object. But should be very similar at the high level of features since the same object is described. Based on this, in order to further improve the model detection performance, in an embodiment of the present invention, a Voxel Feature Interaction Module (VFIM) is designed to associate features of two modes at a target level based on a detection result, so as to improve semantic consistency of the two cross-mode homogeneous features. Specifically, in one embodiment of the present invention, the interaction of cross-modal features is realized according to the consistency of target-level attributes in the point cloud and the image, that is, a target-level similarity constraint is applied to the homogenous point cloud voxel feature P and the image voxel feature I, so as to realize better cross-modal feature fusion performance.

FIG. 3 illustrates a process diagram for applying a target-level similarity constraint, according to an embodiment of the invention. As shown in fig. 3, applying a target-level similarity constraint to the point cloud voxel characteristics and the image voxel characteristics includes:

first, N three-dimensional detection frames B ═ B are sampled from a three-dimensional detection head ₁ ,B ₂ ,…,B _N }；

Next, using Voxel-level region of interest Pooling (Voxel RoI-Pooling) method to respectively intercept and obtain respective modal region of interest features (RoI Feature) on the point cloud Voxel features and the image Voxel features

And then, converting the interest region features into a measurement space to obtain N pairs of measurement features p, e. In order to enhance each pair of region of interest features

In one embodiment of the invention, the region of interest is characterized

A predictor Ψ consisting of a coder Ω and a multi-layer perceptron is fed to convert it into metric space:

and finally, minimizing the region of interest features

Cosine similarity distance of (a):

in one embodiment of the present invention, a gradient cut-off (stop-gradient) strategy is further used to better model the similarity constraint:

the three-dimensional target detection method projects the image to the three-dimensional space, and then is fused with the point cloud in the three-dimensional space, and compared with other methods of projecting the three-dimensional point cloud to the two-dimensional image and the like, the method has the advantage that the information loss in the fusion process is smaller. A large number of experiments prove that the method achieves the best performance in target detection data sets such as KITTI and the like.

While various embodiments of the present invention have been described above, it should be understood that they have been presented by way of example only, and not limitation. It will be apparent to persons skilled in the relevant art that various combinations, modifications, and changes can be made thereto without departing from the spirit and scope of the invention. Thus, the breadth and scope of the present invention should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.

Claims

1. An image upscaling method, characterized by comprising the steps of:

extracting two-dimensional image features of the image;

and mapping the visual pyramid characteristics to a three-dimensional space.

2. An image upscaling method according to claim 1, characterized by associating a two-dimensional image feature with each pixel (m, n)

And depth interval

Wherein the depth interval is along the ray direction of the perspective projection of the image visual cone body, and is formed by W _F ×H _F A one-dimensional R independent discretized depth interval, wherein W _F ，H _F ，C _F Respectively representing the width, height and channel number of the two-dimensional image feature.

3. The image upscaling method of claim 1, wherein mapping the view volume feature G to three-dimensional space I comprises: traversing each position i in the three-dimensional space, and performing the following operations:

sampling the centroid of the view vertebral body features based on a calibration matrix CM:

wherein,

three-dimensional coordinate values representing the ith voxel in G, I; and

around the surface of the steel pipe

Is subjected to trilinear interpolation to form a neighborhood of

4. A three-dimensional target detection method is characterized by comprising the following steps:

extracting point cloud voxel characteristics from point cloud

And extracting two-dimensional image features in the image

Wherein C is _F Channel number (X) which is a point cloud voxel characteristic _P ，Y _P ，Z _P ) Segmenting the size of the grid for point cloud voxels, W _F ，H _F ，C _F Respectively representing the width, the height and the channel number of the two-dimensional image features;

using the image upscaling method according to any one of claims 1 to 3, upscaling the two-dimensional image feature into a three-dimensional homogenous image voxel space to obtain an image voxel feature

Wherein C _F The number of channels that are characteristic of the image voxels, (X) _I ，Y _I ，Z _I ) Segmenting the size of the grid for the image voxels;

and identifying and classifying the target based on the fusion characteristics.

5. The method of claim 4, wherein a query fusion mechanism is used to fuse the point cloud voxel features with the image voxel features.

6. The three-dimensional object detection method of claim 5, wherein a self-attention layer is adopted as a query fusion mechanism, comprising the steps of:

using point cloud voxel characteristics F _P As a query, image voxel characteristics F _I As keys and values;

utilizing three learnable linear transformations for each attention head j

Is applied to the query, key and value to obtain

Q _i ＝F _P ·W _i ^Q ，K _i ＝F _I ·W _i ^K ，V _i ＝F _I ·W _i ^V ；

Will r attention heads (heads) ₁ ，head ₂ ，...，head _r ) Mapping to a homogeneous three-dimensional space to obtain multi-head attention:

A _M ＝Concat(head ₁ ，head ₂ ，...，head _r )W ^O

wherein:

is a linear transformation matrix; and

and

attention to the multiple head A _M And point cloud voxel characteristics F _P Splicing to obtain a fused feature

7. The three-dimensional object detection method of claim 6, wherein the constructing of the query comprises:

extracting M non-empty voxels in the point cloud voxel feature P, and constructing and forming a query

8. The three-dimensional object detection method of claim 6, wherein the key and value construction comprises:

extracting parts with information quantity higher than a preset value from the image voxel characteristic I by using three-dimensional maximum pooling

Wherein

Lambda is a preset proportion;

flattening in the first three dimensions I ^* To obtain

9. The three-dimensional object detection method of claim 6, further comprising:

combining the fusion features

Spread into the point cloud voxel space to obtain

As the basis for object recognition and classification.

10. The three-dimensional object detection method of claim 4, further comprising: applying a target-level similarity constraint to the point cloud voxel characteristics and the image voxel characteristics, including:

intercepting N sets of interest region features on the point cloud voxel features and image voxel features using voxel-level interest region pooling

Converting the N groups of interest region features into a measurement space to obtain N pairs of measurement features p, e; and

minimizing the cosine similarity distance of each pair of metric features:

11. the method of claim 10, wherein the N sets of region of interest features are transformed into metric space by an encoder Ω and a predictor Ψ:

and a gradient cut-off strategy is adopted for the coder and the predictor:

12. the three-dimensional object detection method of claim 11, wherein the predictor comprises a multi-layered perceptron.