CN115063539A - Image dimension increasing method and three-dimensional target detection method - Google Patents

Image dimension increasing method and three-dimensional target detection method Download PDF

Info

Publication number
CN115063539A
CN115063539A CN202210847218.5A CN202210847218A CN115063539A CN 115063539 A CN115063539 A CN 115063539A CN 202210847218 A CN202210847218 A CN 202210847218A CN 115063539 A CN115063539 A CN 115063539A
Authority
CN
China
Prior art keywords
image
dimensional
voxel
features
point cloud
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210847218.5A
Other languages
Chinese (zh)
Other versions
CN115063539B (en
Inventor
李怡康
石博天
李鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai AI Innovation Center
Original Assignee
Shanghai AI Innovation Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai AI Innovation Center filed Critical Shanghai AI Innovation Center
Priority to CN202210847218.5A priority Critical patent/CN115063539B/en
Publication of CN115063539A publication Critical patent/CN115063539A/en
Application granted granted Critical
Publication of CN115063539B publication Critical patent/CN115063539B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4007Scaling of whole images or parts thereof, e.g. expanding or contracting based on interpolation, e.g. bilinear interpolation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Image Processing (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a three-dimensional target detection method, which comprises the steps of extracting point cloud voxel characteristics from point cloud, projecting two-dimensional image characteristics to a three-dimensional homogeneous image voxel space in an ascending manner to obtain image voxel characteristics, fusing the point cloud voxel characteristics and the image voxel characteristics to obtain fusion characteristics, and identifying and classifying targets based on the fusion characteristics, so that information loss in a characteristic level fusion process is effectively reduced.

Description

Image dimension increasing method and three-dimensional target detection method
Technical Field
The invention relates to the technical field of automatic driving, in particular to an image dimension increasing method and a three-dimensional target detection method.
Background
Three-dimensional object detection is an important direction in the field of automatic driving, and aims to accurately locate and classify each object in a three-dimensional space, so that a vehicle can comprehensively perceive and understand the surrounding environment.
The existing three-dimensional target detection is realized based on laser radar and/or images. The point cloud obtained based on the laser radar contains accurate spatial information, and the image contains more semantic information, so that the multimode-based three-dimensional target detection is a method capable of utilizing complementary information of the image and the point cloud, and becomes a development direction of a three-dimensional target detection method.
The multi-modal-based three-dimensional target detection comprises two categories of decision-level fusion and feature-level fusion. The decision-level fusion refers to merging objects detected by different modality detection modules together through a strategy, and the performance of the method is limited by the performance of each modality detection module. And the feature level fusion is to fuse features of different modalities and then perform object detection based on the fused features. There are two common feature level fusion methods. One is to generate a region of interest, and then to intercept the sub-features of the corresponding region from the features of the respective modalities for fusion. This method requires projection of three-dimensional points onto a two-dimensional plane before feature fusion, and therefore, results in significant loss of three-dimensional information. The other method is to transform the points in the point cloud into voxels and then fuse the voxels with the image, which can achieve finer granularity fusion, but may have larger mismatch and information loss due to larger difference between the image and the feature space of the point cloud.
Disclosure of Invention
In order to solve some or all of the problems in the prior art, the present invention provides an image dimension-increasing method for increasing dimensions of two-dimensional image features into a three-dimensional space, the method comprising:
extracting two-dimensional image features of the image;
generating a visual pyramid feature based on the two-dimensional image feature; and
and mapping the visual cone features to a three-dimensional space.
Further, the generating of the visual pyramid feature comprises:
and performing outer product operation on the feature vector corresponding to each pixel and a depth interval, wherein the depth interval is along the ray direction of the perspective projection of the image visual cone.
Further, mapping the view volume features to three-dimensional space by tri-linear interpolation, comprising:
and traversing the three-dimensional space, and projecting each point of the three-dimensional space based on the calibration matrix.
The invention provides a three-dimensional target detection method based on homogeneous multi-mode feature fusion, which comprises the following steps:
extracting point cloud voxel characteristics from the point cloud, and extracting two-dimensional image characteristics from the image;
projecting the two-dimensional image characteristic to a three-dimensional homogeneous image voxel space in an ascending way by adopting the image ascending method to obtain an image voxel characteristic;
fusing the point cloud voxel characteristics with the image voxel characteristics to obtain fused characteristics; and
and identifying and classifying the target based on the fusion characteristics.
Further, the point cloud voxel characteristics are extracted from the point cloud by adopting a point cloud coding network.
Further, a query fusion mechanism is adopted to fuse the point cloud voxel characteristics and the image voxel characteristics.
Further, the adoption of the self-attention layer as a query fusion mechanism comprises the following steps:
using the point cloud voxel characteristics as query and the image voxel characteristics as keys and values;
applying three learnable linear transformations to the query, key, and value for each attention head;
mapping a plurality of attention heads to a homogeneous three-dimensional space to obtain multi-head attention;
and splicing the multi-head attention and the point cloud voxel characteristics to obtain fusion characteristics.
Further, the constructing of the query includes:
and extracting appointed non-empty voxels in the point cloud voxel characteristics, and constructing and forming a query.
Further, the key and value configuration includes:
and extracting parts with information quantity higher than a preset value from the image voxel characteristics by using three-dimensional maximum pooling to form a key sum value.
Further, the method further comprises:
and distributing the fusion features into the point cloud voxel space to serve as a basis for target identification and classification.
Further, the method further comprises: applying a target-level similarity constraint to the point cloud voxel characteristics and the image voxel characteristics, including:
intercepting N groups of interest area features on the point cloud voxel features and the image voxel features by utilizing voxel-level interest area pooling;
converting the N groups of interest region features into a measurement space to obtain N pairs of measurement features; and
the cosine similarity distance of each pair of metrology features is minimized.
Further, the N groups of interest region features are converted into a metric space through an encoder and a predictor, and a gradient cut-off strategy is adopted for the encoder and the predictor.
Further, the predictor includes a multi-tier perceptron.
Compared with the method of projecting the three-dimensional point cloud to the two-dimensional image and the like, the method for detecting the three-dimensional target can prevent the image and the point cloud from being compressed and lost to the greatest extent in the characteristic transformation process and retain most original information. The method also introduces a Query Fusion Mechanism (QFM), and the principle is that the point cloud and the image feature are adaptively fused based on self-attention, so that each point cloud voxel can adaptively perceive useful information from the global three-dimensional image feature, and the two homogeneous representations can be effectively fused. In addition, in order to improve the consistency of the point cloud and the image, the method also provides an object-level voxel characteristic interaction method, so that the semantic consistency of two homogeneous characteristics is effectively improved, and the capability of the model for performing cross-modal characteristic fusion is enhanced.
Drawings
To further clarify the above and other advantages and features of embodiments of the present invention, a more particular description of embodiments of the present invention will be rendered by reference to the appended drawings. It is appreciated that these drawings depict only typical embodiments of the invention and are therefore not to be considered limiting of its scope. In the drawings, the same or corresponding parts will be denoted by the same or similar reference numerals for clarity.
FIG. 1 is a process diagram of an image upscaling method according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a three-dimensional target detection method according to an embodiment of the invention; and
FIG. 3 illustrates a process diagram for applying a target-level similarity constraint, according to an embodiment of the invention.
Detailed Description
In the following description, the present invention is described with reference to examples. One skilled in the relevant art will recognize, however, that the embodiments can be practiced without one or more of the specific details, or with other alternative and/or additional methods. In other instances, well-known operations have not been shown or described in detail to avoid obscuring aspects of the invention. Similarly, for purposes of explanation, specific configurations are set forth in order to provide a thorough understanding of the embodiments of the invention. However, the invention is not limited to these specific details.
Reference in the specification to "one embodiment" or "the embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment.
In the feature level fusion method of the three-dimensional object detection algorithm, if multi-modal features are fused in a Region of Interest (RoI), the RoI is usually projected onto a two-dimensional bird's-eye view or front view for alignment and feature extraction, which may cause serious information loss, and the lost three-dimensional information plays a key role in positioning an object in a three-dimensional space. If the points in the point cloud are transformed into voxels and then fused with the image, large mismatching and information loss can also be caused due to the projection parallax between the two-dimensional dense image pixels and the three-dimensional sparse lidar points. Based on the above problems, the present invention provides a uniform fusion scheme: homogeneous multimodal Feature Fusion Interaction method (HMFI). This method first upscales the two-dimensional image into three dimensions and forms a dense voxel structure. Feature level fusion was then performed in a three-dimensional Homogeneous Space (Homogeneous Space). Since there is no dimension reduction compression on the data, this method can reduce the information loss caused by projection.
In addition, a cross-mode feature interaction mechanism spanning point cloud and images is designed based on the homogeneous three-dimensional structure to enhance semantic information transfer and interaction during fusion of the two data. Specifically, the invention designs an Image-Voxel lifting mapping Module (IVLM) to recombine two-dimensional Image features in a three-dimensional space, so as to construct an Image Voxel structure which is homogeneous with point cloud voxels, and perform multi-mode feature fusion by using depth information provided by the point cloud. This fusion in three-dimensional space does not present the information loss due to the dimension reduction projection. In addition, in order to enable better Fusion interaction of homogeneous voxel representation of cross-modal data, the invention further introduces a Query Fusion Mechanism (QFM), which introduces an operation based on self-attention to adaptively combine point cloud and image features. And each point cloud voxel can inquire information of corresponding positions in all image voxels for fusion and can be combined to form a camera-laser radar characteristic. The QFM mechanism enables each point cloud voxel to adaptively perceive useful information from the global three-dimensional image features and to effectively fuse the two homogenous representations.
Furthermore, although the point cloud and image representations are different, the object level semantic information should be similar in homogenous structure. Therefore, in order to enhance the abstract representation of the point cloud and the image in the shared three-dimensional space and utilize the similarity of the same object attributes in the two modes, the invention also provides a Voxel Feature Interaction Module (VFIM) at an object level to improve the consistency of the point cloud and the image. In particular, a Voxel-based region of interest Pooling (Voxel RoI-Pooling) module is used to extract features from two homogeneous representations based on predicted bounding box candidates and RoI features, and then the cosine similarity loss between each set of corresponding image Voxel features and point cloud Voxel features is computed to enforce object-level consistency in the point cloud and image. In addition, in the VFIM module, the interaction between the paired homogeneous region characteristics is modeled, so that the semantic consistency of the two homogeneous characteristics can be improved, and the capability of the model for performing cross-modal characteristic fusion is enhanced.
The solution of the invention is further described below with reference to the accompanying drawings of embodiments.
Fig. 1 is a process diagram of an image dimension-increasing method according to an embodiment of the present invention. As shown in fig. 1, an image dimension-increasing method includes:
first, two-dimensional image features of an image are extracted. Image processing method
Figure BDA0003753295930000051
Into networks such as ResNet-50 networks, etc., to extract two-dimensional image features
Figure BDA0003753295930000052
Wherein
Figure BDA0003753295930000053
And
Figure BDA0003753295930000054
w is the width and height of the image F ,H F ,C F Respectively representing the width, the height and the channel number of the two-dimensional image features;
next, the two-dimensional image features are converted into view pyramid features in which depth information is encoded in the image features
Figure BDA0003753295930000055
In one embodiment of the invention, each pixel (m, n) in the two-dimensional image feature F is assigned to a two-dimensional image feature
Figure BDA0003753295930000056
By a depth bin D along the ray direction of the image view cone perspective projection m,n Walking into three-dimensional space, the depth interval
Figure BDA0003753295930000057
Is formed by discretizing a Depth map using a Linear-interpolating Depth Discretization (LID) algorithm, which is formed by W F ×H F One-hot discretized depth interval of dimension R. To correlate the two-dimensional image features with the discretized depth information, in one embodiment of the invention, the depth information is determined by correlating the two-dimensional image features for each pixel (m, n)
Figure BDA0003753295930000058
And the depth interval
Figure BDA0003753295930000059
Performing outer product operation to obtain the visual cone features corresponding to each pixel
Figure BDA0003753295930000061
Figure BDA0003753295930000062
And
finally, mapping the visual cone features to a three-dimensional space to obtain image voxel features
Figure BDA0003753295930000063
Wherein (X) I ,Y I ,Z I ) The size of the grid is sliced for the image voxels. In one embodiment of the invention, the view volume is adjusted by trilinear differences
Figure BDA0003753295930000064
Mapping into three-dimensional space
Figure BDA0003753295930000065
In one embodiment of the invention, to obtain the image voxel characteristics of the ith location
Figure BDA0003753295930000066
First, sampling the centroid of the visual pyramid feature G based on a Calibration Matrix (Calibration Matrix) CM:
Figure BDA0003753295930000067
wherein,
Figure BDA0003753295930000068
three-dimensional coordinate values representing the ith voxel in G, I, and then surrounding
Figure BDA0003753295930000069
Is subjected to trilinear interpolation to form a neighborhood of
Figure BDA00037532959300000610
And traversing each position I to obtain the image voxel characteristic I by performing the operation as described above.
Based on the image dimension-increasing method, fig. 2 shows a flow chart of a three-dimensional target detection method according to an embodiment of the present invention. As shown in fig. 2, a three-dimensional target detection method includes:
first, in step 201, features are extracted. Extracting point cloud voxel characteristics from point cloud
Figure BDA00037532959300000611
And extracting two-dimensional image features in the image
Figure BDA00037532959300000612
Wherein C is F Channel number (X) which is a point cloud voxel characteristic P ,Y P ,Z P ) Segmenting the size of the grid for point cloud voxels, W F ,H F ,C F Respectively representing the width, height and channel number of the two-dimensional image features. In one embodiment of the invention, a point cloud encoding network is utilized to extract features from the point cloud. In yet another embodiment of the invention, two-dimensional image features are extracted through a ResNet-50 network;
next, at step 202, the image is upscaled. In one embodiment of the invention, Image Voxel characteristics are obtained by Image Voxel ascending projection Module (IVLM) to project Image characteristics F to three-dimensional homogeneous Image Voxel space in ascending dimension
Figure BDA00037532959300000613
Wherein C is F The number of channels that are characteristic of the image voxels, (X) I ,Y I ,Z I ) The size of the grid is sliced for the image voxels. The specific dimension-increasing method is as described above;
next, in step 203, the features are fused. In one embodiment of the invention, a Query Fusion Mechanism (QFM) is used to fuse homogenous point cloud voxel features and image voxel features to drill down complementary information between the point cloud and image such that each point cloud voxel feature is able to perceive the entire picture and selectively fuse image voxel features therefrom. In one embodiment of the invention, a module of the self-attention mechanism is used to treat each voxel feature vector of the image and point cloud as a homogenous token (token), i.e., point cloud voxel feature F is used P As query (query), with image voxel characteristics F I These two features, as key and value, are eventually transformed into a fused voxel feature after the self-attention mechanism, specifically including:
first, for each attention head j ,j=1,2, …, r utilizes three learnable linear transformations
Figure BDA0003753295930000071
Is applied to the query, key and value to obtain
Figure BDA0003753295930000072
Figure BDA0003753295930000073
Next, the r heads of attention (head) are assigned 1 ,head 2 ,…,head r ) Mapping to a homogeneous three-dimensional space to obtain multi-head attention:
A M =Concat(head 1 ,head 2 ,…,head r )W O
wherein:
Figure BDA0003753295930000074
mapping r attention heads to linear transformation matrices of a homogeneous three-dimensional space for each; and
Figure BDA0003753295930000075
and
finally, attention A is paid to the multiple heads M And point cloud voxel feature F P Splicing to obtain a fused feature
Figure BDA0003753295930000076
Since most lidar point cloud voxels are empty, in one embodiment of the invention, a forming query is constructed by extracting the M non-empty voxels in the point cloud voxel feature P
Figure BDA0003753295930000077
Whereas for image voxels it is in two dimensionsThe image features are obtained by projecting the image features into a three-dimensional space in an ascending way, and methods such as trilinear interpolation and the like are used in the construction process, so that image voxels are denser than point cloud voxels, namely the number of non-empty voxels is more. Therefore, in order to reduce the calculation overhead, in one embodiment of the present invention, three-dimensional Max-pooling (3D Max-pooling) is used to extract the portion of the image voxel characteristic I with information amount higher than a preset value
Figure BDA0003753295930000078
Wherein
Figure BDA0003753295930000079
λ is a preset proportion, and then flattening I along the first three dimensions * To obtain
Figure BDA00037532959300000710
In one embodiment of the present invention, the M non-empty voxels are further obtained
Figure BDA00037532959300000711
Obtained in a spread homogenous voxel space
Figure BDA00037532959300000712
As the input of the final downstream three-dimensional target detection module; and
finally, at step 204, the target is identified. Using detection module from P * To generate bounding boxes and object classification results for the target.
Since the lidar and the camera are of different modalities, there will be different data representations for the same object. But should be very similar at the high level of features since the same object is described. Based on this, in order to further improve the model detection performance, in an embodiment of the present invention, a Voxel Feature Interaction Module (VFIM) is designed to associate features of two modes at a target level based on a detection result, so as to improve semantic consistency of the two cross-mode homogeneous features. Specifically, in one embodiment of the present invention, the interaction of cross-modal features is realized according to the consistency of target-level attributes in the point cloud and the image, that is, a target-level similarity constraint is applied to the homogenous point cloud voxel feature P and the image voxel feature I, so as to realize better cross-modal feature fusion performance.
FIG. 3 illustrates a process diagram for applying a target-level similarity constraint, according to an embodiment of the invention. As shown in fig. 3, applying a target-level similarity constraint to the point cloud voxel characteristics and the image voxel characteristics includes:
first, N three-dimensional detection frames B ═ B are sampled from a three-dimensional detection head 1 ,B 2 ,…,B N };
Next, using Voxel-level region of interest Pooling (Voxel RoI-Pooling) method to respectively intercept and obtain respective modal region of interest features (RoI Feature) on the point cloud Voxel features and the image Voxel features
Figure BDA0003753295930000081
And then, converting the interest region features into a measurement space to obtain N pairs of measurement features p, e. In order to enhance each pair of region of interest features
Figure BDA0003753295930000082
In one embodiment of the invention, the region of interest is characterized
Figure BDA0003753295930000083
A predictor Ψ consisting of a coder Ω and a multi-layer perceptron is fed to convert it into metric space:
Figure BDA0003753295930000084
and finally, minimizing the region of interest features
Figure BDA0003753295930000085
Cosine similarity distance of (a):
Figure BDA0003753295930000086
in one embodiment of the present invention, a gradient cut-off (stop-gradient) strategy is further used to better model the similarity constraint:
Figure BDA0003753295930000087
the three-dimensional target detection method projects the image to the three-dimensional space, and then is fused with the point cloud in the three-dimensional space, and compared with other methods of projecting the three-dimensional point cloud to the two-dimensional image and the like, the method has the advantage that the information loss in the fusion process is smaller. A large number of experiments prove that the method achieves the best performance in target detection data sets such as KITTI and the like.
While various embodiments of the present invention have been described above, it should be understood that they have been presented by way of example only, and not limitation. It will be apparent to persons skilled in the relevant art that various combinations, modifications, and changes can be made thereto without departing from the spirit and scope of the invention. Thus, the breadth and scope of the present invention should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.

Claims (12)

1. An image upscaling method, characterized by comprising the steps of:
extracting two-dimensional image features of the image;
generating a visual pyramid feature based on the two-dimensional image feature; and
and mapping the visual pyramid characteristics to a three-dimensional space.
2. An image upscaling method according to claim 1, characterized by associating a two-dimensional image feature with each pixel (m, n)
Figure FDA0003753295920000011
And depth interval
Figure FDA0003753295920000012
Performing outer product operation to obtain the visual cone features corresponding to each pixel
Figure FDA0003753295920000013
Figure FDA0003753295920000014
Wherein the depth interval is along the ray direction of the perspective projection of the image visual cone body, and is formed by W F ×H F A one-dimensional R independent discretized depth interval, wherein W F ,H F ,C F Respectively representing the width, height and channel number of the two-dimensional image feature.
3. The image upscaling method of claim 1, wherein mapping the view volume feature G to three-dimensional space I comprises: traversing each position i in the three-dimensional space, and performing the following operations:
sampling the centroid of the view vertebral body features based on a calibration matrix CM:
Figure FDA0003753295920000015
wherein,
Figure FDA0003753295920000016
three-dimensional coordinate values representing the ith voxel in G, I; and
around the surface of the steel pipe
Figure FDA0003753295920000017
Is subjected to trilinear interpolation to form a neighborhood of
Figure FDA0003753295920000018
4. A three-dimensional target detection method is characterized by comprising the following steps:
extracting point cloud voxel characteristics from point cloud
Figure FDA0003753295920000019
And extracting two-dimensional image features in the image
Figure FDA00037532959200000110
Wherein C is F Channel number (X) which is a point cloud voxel characteristic P ,Y P ,Z P ) Segmenting the size of the grid for point cloud voxels, W F ,H F ,C F Respectively representing the width, the height and the channel number of the two-dimensional image features;
using the image upscaling method according to any one of claims 1 to 3, upscaling the two-dimensional image feature into a three-dimensional homogenous image voxel space to obtain an image voxel feature
Figure FDA00037532959200000111
Figure FDA00037532959200000112
Wherein C F The number of channels that are characteristic of the image voxels, (X) I ,Y I ,Z I ) Segmenting the size of the grid for the image voxels;
fusing the point cloud voxel characteristics with the image voxel characteristics to obtain fused characteristics; and
and identifying and classifying the target based on the fusion characteristics.
5. The method of claim 4, wherein a query fusion mechanism is used to fuse the point cloud voxel features with the image voxel features.
6. The three-dimensional object detection method of claim 5, wherein a self-attention layer is adopted as a query fusion mechanism, comprising the steps of:
using point cloud voxel characteristics F P As a query, image voxel characteristics F I As keys and values;
utilizing three learnable linear transformations for each attention head j
Figure FDA0003753295920000021
Figure FDA0003753295920000022
Is applied to the query, key and value to obtain
Figure FDA0003753295920000023
Q i =F P ·W i Q ,K i =F I ·W i K ,V i =F I ·W i V
Will r attention heads (heads) 1 ,head 2 ,...,head r ) Mapping to a homogeneous three-dimensional space to obtain multi-head attention:
A M =Concat(head 1 ,head 2 ,...,head r )W O
wherein:
Figure FDA0003753295920000024
is a linear transformation matrix; and
Figure FDA0003753295920000025
and
attention to the multiple head A M And point cloud voxel characteristics F P Splicing to obtain a fused feature
Figure FDA0003753295920000026
7. The three-dimensional object detection method of claim 6, wherein the constructing of the query comprises:
extracting M non-empty voxels in the point cloud voxel feature P, and constructing and forming a query
Figure FDA0003753295920000027
8. The three-dimensional object detection method of claim 6, wherein the key and value construction comprises:
extracting parts with information quantity higher than a preset value from the image voxel characteristic I by using three-dimensional maximum pooling
Figure FDA0003753295920000031
Wherein
Figure FDA0003753295920000032
Lambda is a preset proportion;
flattening in the first three dimensions I * To obtain
Figure FDA0003753295920000033
9. The three-dimensional object detection method of claim 6, further comprising:
combining the fusion features
Figure FDA0003753295920000034
Spread into the point cloud voxel space to obtain
Figure FDA0003753295920000035
As the basis for object recognition and classification.
10. The three-dimensional object detection method of claim 4, further comprising: applying a target-level similarity constraint to the point cloud voxel characteristics and the image voxel characteristics, including:
intercepting N sets of interest region features on the point cloud voxel features and image voxel features using voxel-level interest region pooling
Figure FDA0003753295920000036
Converting the N groups of interest region features into a measurement space to obtain N pairs of measurement features p, e; and
minimizing the cosine similarity distance of each pair of metric features:
Figure FDA0003753295920000037
11. the method of claim 10, wherein the N sets of region of interest features are transformed into metric space by an encoder Ω and a predictor Ψ:
Figure FDA0003753295920000038
and a gradient cut-off strategy is adopted for the coder and the predictor:
Figure FDA0003753295920000039
12. the three-dimensional object detection method of claim 11, wherein the predictor comprises a multi-layered perceptron.
CN202210847218.5A 2022-07-19 2022-07-19 Image dimension-increasing method and three-dimensional target detection method Active CN115063539B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210847218.5A CN115063539B (en) 2022-07-19 2022-07-19 Image dimension-increasing method and three-dimensional target detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210847218.5A CN115063539B (en) 2022-07-19 2022-07-19 Image dimension-increasing method and three-dimensional target detection method

Publications (2)

Publication Number Publication Date
CN115063539A true CN115063539A (en) 2022-09-16
CN115063539B CN115063539B (en) 2024-07-02

Family

ID=83205824

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210847218.5A Active CN115063539B (en) 2022-07-19 2022-07-19 Image dimension-increasing method and three-dimensional target detection method

Country Status (1)

Country Link
CN (1) CN115063539B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115591240A (en) * 2022-12-01 2023-01-13 腾讯科技(深圳)有限公司(Cn) Feature extraction method, device and equipment for three-dimensional game scene and storage medium
CN116665189A (en) * 2023-07-31 2023-08-29 合肥海普微电子有限公司 Multi-mode-based automatic driving task processing method and system
CN116740668A (en) * 2023-08-16 2023-09-12 之江实验室 Three-dimensional object detection method, three-dimensional object detection device, computer equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070036437A1 (en) * 2003-02-27 2007-02-15 Arizona Board Of Regents/Arizona State University Comparative and analytic apparatus method for converting two-dimentional bit map data into three-dimensional data
CN113793255A (en) * 2021-09-09 2021-12-14 百度在线网络技术(北京)有限公司 Method, apparatus, device, storage medium and program product for image processing

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070036437A1 (en) * 2003-02-27 2007-02-15 Arizona Board Of Regents/Arizona State University Comparative and analytic apparatus method for converting two-dimentional bit map data into three-dimensional data
CN113793255A (en) * 2021-09-09 2021-12-14 百度在线网络技术(北京)有限公司 Method, apparatus, device, storage medium and program product for image processing

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115591240A (en) * 2022-12-01 2023-01-13 腾讯科技(深圳)有限公司(Cn) Feature extraction method, device and equipment for three-dimensional game scene and storage medium
WO2024114152A1 (en) * 2022-12-01 2024-06-06 腾讯科技(深圳)有限公司 Feature extraction method and apparatus for three-dimensional scene, and device and storage medium
CN116665189A (en) * 2023-07-31 2023-08-29 合肥海普微电子有限公司 Multi-mode-based automatic driving task processing method and system
CN116665189B (en) * 2023-07-31 2023-10-31 合肥海普微电子有限公司 Multi-mode-based automatic driving task processing method and system
CN116740668A (en) * 2023-08-16 2023-09-12 之江实验室 Three-dimensional object detection method, three-dimensional object detection device, computer equipment and storage medium
CN116740668B (en) * 2023-08-16 2023-11-14 之江实验室 Three-dimensional object detection method, three-dimensional object detection device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN115063539B (en) 2024-07-02

Similar Documents

Publication Publication Date Title
Zamanakos et al. A comprehensive survey of LIDAR-based 3D object detection methods with deep learning for autonomous driving
Qi et al. Frustum pointnets for 3d object detection from rgb-d data
CN115063539B (en) Image dimension-increasing method and three-dimensional target detection method
Xu et al. Zoomnet: Part-aware adaptive zooming neural network for 3d object detection
CN109655019B (en) Cargo volume measurement method based on deep learning and three-dimensional reconstruction
Königshof et al. Realtime 3d object detection for automated driving using stereo vision and semantic information
KR102267562B1 (en) Device and method for recognition of obstacles and parking slots for unmanned autonomous parking
Hirschmuller Stereo processing by semiglobal matching and mutual information
Hoppe et al. Incremental Surface Extraction from Sparse Structure-from-Motion Point Clouds.
CN108269266A (en) Segmentation image is generated using Markov random field optimization
KR101907883B1 (en) Object detection and classification method
Gählert et al. Single-shot 3d detection of vehicles from monocular rgb images via geometrically constrained keypoints in real-time
Parmehr et al. Automatic registration of optical imagery with 3d lidar data using local combined mutual information
Gigli et al. Road segmentation on low resolution lidar point clouds for autonomous vehicles
Tang et al. Content-based 3-D mosaics for representing videos of dynamic urban scenes
Gählert et al. Single-shot 3d detection of vehicles from monocular rgb images via geometry constrained keypoints in real-time
CN113409242A (en) Intelligent monitoring method for point cloud of rail intersection bow net
CN116704307A (en) Target detection method and system based on fusion of image virtual point cloud and laser point cloud
Kozonek et al. On the fusion of camera and lidar for 3D object detection and classification
CN116843829A (en) Concrete structure crack three-dimensional reconstruction and length quantization method based on binocular video
He et al. Planar constraints for an improved uav-image-based dense point cloud generation
Li et al. Automatic Keyline Recognition and 3D Reconstruction For Quasi‐Planar Façades in Close‐range Images
Kitt et al. Trinocular optical flow estimation for intelligent vehicle applications
Zhang et al. A Novel Point Cloud Compression Algorithm for Vehicle Recognition Using Boundary Extraction
Zhang et al. PMVC: Promoting Multi-View Consistency for 3D Scene Reconstruction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant