CN115063768A

CN115063768A - Three-dimensional target detection method, encoder and decoder

Info

Publication number: CN115063768A
Application number: CN202210810402.2A
Authority: CN
Inventors: 苗振伟; 杨泽宇; 陈家棋; 占新; 卿泉; 张力
Original assignee: Alibaba Damo Institute Hangzhou Technology Co Ltd
Current assignee: Alibaba Damo Institute Hangzhou Technology Co Ltd
Priority date: 2022-07-11
Filing date: 2022-07-11
Publication date: 2022-09-16

Abstract

The embodiment of the application provides a three-dimensional target detection method, electronic equipment and a computer storage medium, wherein the three-dimensional target detection method comprises the following steps: acquiring an image and a three-dimensional point cloud of a three-dimensional environment to be detected, and an image characteristic diagram corresponding to the image and a point cloud characteristic diagram corresponding to the three-dimensional point cloud; carrying out feature interaction between the image feature map and the point cloud feature map, and obtaining an enhanced image feature map fused with point cloud features and an enhanced point cloud feature map fused with image features according to a feature interaction result; and detecting a three-dimensional target in the three-dimensional environment based on the enhanced image feature map and the enhanced point cloud feature map. By the embodiment of the application, the high-quality three-dimensional target detection result can be obtained.

Description

Three-dimensional target detection method, encoder and decoder

Technical Field

The embodiment of the application relates to the technical field of target detection, in particular to a three-dimensional target detection method, electronic equipment and a computer storage medium.

Background

In the fields of unmanned driving, robots and the like, the perception capability of the environment is an important capability for guaranteeing the normal work of the environment. And to realize correct perception of the environment, three-dimensional target detection is the basic function of environment perception. Through three-dimensional target detection, the position and the category of an object in a three-dimensional scene can be predicted, and the size and the orientation of the object can be judged. Further, based on this, job tasks such as trajectory prediction, path planning, and the like can be realized.

In order to realize the perception of the surrounding environment, vehicles or robots with an automatic driving function are often equipped with various sensors, such as laser radar and around-looking cameras, so that data of different modalities acquired by various sensors can be complemented. The existing three-dimensional target detection method based on multiple sensors adopts independent detection by using images acquired by a panoramic camera and laser point clouds acquired by a laser radar, and integrates the images and the laser point clouds at the level of a detection result. Although the existing detection algorithm for a single modality can be directly used in this way, and the fusion difficulty is low, the method uses data of each modality to independently generate an initial detection result, so that the information complementary advantage of data of two modalities cannot be fully exerted, and therefore, a high-quality three-dimensional detection result cannot be obtained.

Disclosure of Invention

In view of the above, embodiments of the present application provide a three-dimensional object detection scheme to at least partially solve the above problems.

According to a first aspect of embodiments of the present application, there is provided a three-dimensional target detection method, including: acquiring an image and a three-dimensional point cloud of a three-dimensional environment to be detected, and an image characteristic diagram corresponding to the image and a point cloud characteristic diagram corresponding to the three-dimensional point cloud; carrying out feature interaction between the image feature map and the point cloud feature map, and obtaining an enhanced image feature map fused with point cloud features and an enhanced point cloud feature map fused with image features according to a feature interaction result; and detecting a three-dimensional target in the three-dimensional environment based on the enhanced image feature map and the enhanced point cloud feature map.

According to a second aspect of embodiments of the present application, there is provided an encoder, wherein the encoder comprises a plurality of encoding layers; each encoding layer includes: the characteristic input part is used for receiving the image characteristic diagram and the point cloud characteristic diagram output by the front layer; the characteristic interaction part is used for respectively carrying out first characteristic interaction between image characteristics and second characteristic interaction between point cloud characteristics and the image characteristics aiming at the image characteristic diagram, and carrying out third characteristic interaction between the point cloud characteristics and fourth characteristic interaction between the image characteristics and the point cloud characteristics aiming at the point cloud characteristic diagram; and the feature fusion part is used for performing feature fusion based on the first feature interaction result and the second feature interaction result to obtain an enhanced image feature map, and performing feature fusion based on the third feature interaction result and the fourth feature interaction result to obtain an enhanced point cloud feature map.

According to a third aspect of embodiments herein, there is provided a decoder, wherein the decoder comprises a plurality of decoding layers; the adjacent decoding layers are used for decoding different types of feature maps, wherein the different types of feature maps comprise an enhanced point cloud feature map and an enhanced image feature map; the enhanced point cloud feature map and the enhanced image feature map are as follows: respectively carrying out feature interaction between a point cloud feature map and an image feature map corresponding to a three-dimensional environment to be detected, and obtaining a corresponding enhanced point cloud feature map fused with image features and an enhanced image feature map fused with point cloud features according to feature interaction results; the point cloud characteristic diagram and the image characteristic diagram are obtained by respectively extracting the characteristics of the image and the three-dimensional point cloud corresponding to the three-dimensional environment.

According to a fourth aspect of embodiments of the present application, there is provided an electronic apparatus, including: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus; the memory is used for storing at least one executable instruction, and the executable instruction causes the processor to execute the corresponding operation of the method according to the first aspect.

According to a fifth aspect of embodiments herein, there is provided a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the method according to the first aspect.

According to the scheme provided by the embodiment of the application, when the three-dimensional target is detected, the detection is carried out based on two-aspect modal information of the three-dimensional environment to be detected, namely the image and the three-dimensional point cloud. In the detection process, feature interaction between the image feature map corresponding to the image and the point cloud feature map corresponding to the three-dimensional point cloud is performed, so that the modal information on one hand can be effectively fused into the modal information on the other hand, and the modal information on the two aspects can be supplemented with each other, thereby realizing feature enhancement of the fused modal information, and obtaining the corresponding enhanced image feature map and enhanced point cloud feature map. On the basis of enhancing the image characteristic diagram and enhancing the point cloud characteristic diagram, the three-dimensional target detection is carried out, and the corresponding information is more comprehensive, richer and characteristic, so that the three-dimensional target detection is more accurate and more efficient, and a high-quality three-dimensional target detection result can be obtained.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the embodiments of the present application, and other drawings can be obtained by those skilled in the art according to the drawings.

FIG. 1A is a schematic diagram of an exemplary system suitable for use with embodiments of the present application;

FIG. 1B is a diagram illustrating an exemplary three-dimensional object detection model architecture suitable for use with embodiments of the present application;

FIG. 2A is a flowchart illustrating steps of a method for detecting a three-dimensional object according to an embodiment of the present disclosure;

FIG. 2B is a diagram illustrating a structure of an encoding layer in the embodiment shown in FIG. 2A;

FIG. 2C is a schematic diagram of feature interaction from point cloud features to image features based on the encoder shown in FIG. 2B;

FIG. 2D is a schematic illustration of feature mapping from image features to point cloud features based on the encoder shown in FIG. 2B;

FIG. 2E is a block diagram of a decoder according to the embodiment shown in FIG. 2A;

FIG. 2F is a block diagram of a decoding layer in the decoder shown in FIG. 2E;

FIG. 2G is a diagram illustrating an example of a scenario in the embodiment shown in FIG. 2A;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the embodiments of the present application, the technical solutions in the embodiments of the present application will be described clearly and completely below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, but not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application shall fall within the scope of the protection of the embodiments in the present application.

The following further describes specific implementations of embodiments of the present application with reference to the drawings of the embodiments of the present application.

FIG. 1A illustrates an exemplary system to which embodiments of the present application may be applied. As shown in fig. 1, the system 100 may include a cloud server 102, a communication network 104, and/or one or more user devices 106, illustrated in fig. 1 as a plurality of user devices.

Cloud server 102 may be any suitable device for storing information, data, programs, and/or any other suitable type of content, including but not limited to distributed storage system devices, server clusters, computing cloud server clusters, and the like. In some embodiments, cloud server 102 may perform any suitable functions. For example, in some embodiments, the cloud server 102 may be configured to perform three-dimensional target detection based on an image and a three-dimensional point cloud of a three-dimensional environment to be detected. As an optional example, in some embodiments, the cloud server 102 may be configured to perform feature interaction based on an image feature map corresponding to an image of a three-dimensional environment to be detected and a point cloud feature map corresponding to a three-dimensional point cloud, and obtain an enhanced image feature map and an enhanced point cloud feature map after feature fusion according to a feature interaction result; and then, carrying out three-dimensional target detection based on the enhanced image feature map and the enhanced point cloud feature map. As another example, in some embodiments, the cloud service 102 may be configured to send a three-dimensional target detection result to the user equipment, or send a result after performing downstream job task processing based on the three-dimensional target detection result to the user equipment.

In some embodiments, the communication network 104 may be any suitable combination of one or more wired and/or wireless networks. For example, the communication network 104 can include any one or more of the following: the network may include, but is not limited to, the internet, an intranet, a Wide Area Network (WAN), a Local Area Network (LAN), a wireless network, a Digital Subscriber Line (DSL) network, a frame relay network, an Asynchronous Transfer Mode (ATM) network, a Virtual Private Network (VPN), and/or any other suitable communication network. The user device 106 can be connected to the communication network 104 via one or more communication links (e.g., communication link 112), and the communication network 104 can be linked to the cloud server 102 via one or more communication links (e.g., communication link 114). The communication link may be any communication link suitable for communicating data between the user device 106 and the cloud service 102, such as a network link, a dial-up link, a wireless link, a hardwired link, any other suitable communication link, or any suitable combination of such links.

The user device 106 may include any one or more user devices suitable for interacting with a user and capable of acquiring images and point cloud data of a three-dimensional environment. In some embodiments, user devices 106 may comprise any suitable type of device. For example, in some embodiments, user device 106 may include a vehicle, aircraft, robot, and/or any other suitable type of user device with autopilot functionality.

In an alternative embodiment, the three-dimensional object detection provided by the embodiment of the present application may be implemented based on an exemplary three-dimensional object detection model structure, as shown in fig. 1B.

As can be seen, the three-dimensional object detection model structure includes a feature extractor, an encoder, and a decoder.

The feature extractor comprises two parts, wherein one part is used for extracting features of the three-dimensional point cloud so as to obtain a corresponding point cloud feature map. In fig. 1B, feature extraction is performed on the acquired three-dimensional point cloud data by using a point cloud feature extractor, and a corresponding point cloud bird's-eye view feature map is generated, but it should be understood by those skilled in the art that other point cloud feature maps in a non-bird's-eye view form are also applicable to the solution of the embodiment of the present application. And the other part is used for carrying out feature extraction on the image so as to obtain a corresponding image feature map. Fig. 1B illustrates feature extraction performed by an image feature extractor on a ring-view image captured by a ring-view camera, and a corresponding ring-view image feature map generated. It should be apparent to those skilled in the art that other non-panoramic camera captured images or captured non-panoramic images are suitable for use with the aspects of the embodiments of the present application. The two parts can be implemented by any appropriate structure having a function of extracting features from corresponding modal data, including but not limited to an encoder structure, a convolutional network structure, and the like.

The encoder in the three-dimensional target detection model has a multi-mode feature interaction function, a depth structure is constructed aiming at a feature fusion stage, so that features of two modes can be subjected to intensive interaction, and the mutual fusion and enhancement of the features of the two modes are realized. By the encoder, feature interaction between image features corresponding to images in a three-dimensional environment and point cloud features corresponding to three-dimensional point clouds can be performed, enhanced image features fused with the point cloud features and enhanced point cloud features fused with the image features are obtained.

The decoder in the three-dimensional target detection model can perform three-dimensional target detection based on the enhanced image feature map and the enhanced point cloud feature map after feature interaction enhancement is performed by the encoder, and the specific structure and specific function implementation of the decoder will be described in detail in corresponding parts later. In one possible approach, the decoder may alternately converge the enhanced features to iteratively refine the prediction box and produce a final prediction based on the enhanced feature maps of the two modalities. And then, outputting a prediction result, namely a three-dimensional target detection result.

Hereinafter, the three-dimensional object detection scheme provided in the embodiments of the present application will be described by embodiments based on the above description of the system and model structure.

Referring to fig. 2A, a flowchart illustrating steps of a method for detecting a three-dimensional object according to an embodiment of the present application is shown.

The three-dimensional target detection method comprises the following steps:

step S202: the method comprises the steps of obtaining an image and a three-dimensional point cloud of a three-dimensional environment to be detected, and obtaining an image characteristic diagram corresponding to the image and a point cloud characteristic diagram corresponding to the three-dimensional point cloud.

The three-dimensional environment to be inspected is generally the physical environment in which certain equipment, in particular vehicles or aircraft or robots with autopilot functionality, etc., are located. The devices realize data acquisition of a physical environment, namely a three-dimensional environment, of the devices through various sensors of the devices, such as a laser radar, a camera (which can be a looking-around camera or a non-looking-around camera), and the like.

After the data of two modalities, namely an image of a three-dimensional environment and a three-dimensional point cloud, are obtained, feature extraction can be respectively carried out so as to obtain an image feature map corresponding to the image and a point cloud feature map corresponding to the three-dimensional point cloud. When the three-dimensional target detection model shown in fig. 1B is used, for an image of a three-dimensional environment, feature extraction may be performed on the image of the three-dimensional environment through a feature extractor for the image in the feature extractor to obtain an image feature map; for the three-dimensional point cloud of the three-dimensional environment, feature extraction can be performed on the three-dimensional point cloud of the three-dimensional environment through a feature extractor for the three-dimensional point cloud in a feature extractor of the three-dimensional point cloud to obtain a point cloud feature map, and optionally, the point cloud feature map can be a point cloud aerial view feature map to more effectively capture information, especially depth information, in the three-dimensional point cloud. However, as mentioned above, other ways of performing feature extraction are also applicable to the solution of the embodiments of the present application.

Step S204: and carrying out feature interaction between the image feature map and the point cloud feature map, and obtaining an enhanced image feature map fused with the point cloud features and an enhanced point cloud feature map fused with the image features according to a feature interaction result.

The image and the three-dimensional point cloud express the three-dimensional environment from different angles, but the information carried by the image and the three-dimensional point cloud is different due to different acquisition modes and presentation modes. In the conventional method, the two parts of data are processed separately and only integrated at the result level, but the two parts of data are processed independently in the processing process, so the obtained results are also unbalanced, and more accurate results cannot be obtained even if the two parts of data are integrated at the result level. Therefore, according to the scheme provided by the embodiment of the application, the features corresponding to the two parts of data are interacted at the feature stage so as to complement the two parts of data, so that the feature diagram generated after interaction can carry richer and more comprehensive information, and high-quality feature data are provided for the subsequent three-dimensional target detection.

Based on this, in one possible way, the present step can be implemented as: aiming at the image feature map, performing first feature interaction from image features to image features and second feature interaction from point cloud features to image features; and obtaining an enhanced image feature map fused with the point cloud features according to the result of the first feature interaction and the result of the second feature interaction. Performing third feature interaction from the point cloud features to the point cloud features and fourth feature interaction from the image features to the point cloud features aiming at the point cloud feature map; and obtaining an enhanced point cloud feature map fused with the image features according to the result of the third feature interaction and the result of the fourth feature interaction. By the mode, the two modal characteristics can be effectively fused, and the characteristic representation of the own modal characteristics can further achieve the effect of strengthening.

The first feature interaction from the image feature to the image feature and the third feature interaction from the point cloud feature to the point cloud feature can be obtained by performing feature enhancement processing, for example, local attention calculation, on the basis of feature data of the self modality. Specifically, local attention calculation is carried out on image features to realize first feature interaction, and an enhanced first image feature part is obtained; and carrying out local attention calculation on the point cloud characteristics to realize third characteristic interaction, and obtaining the enhanced first point cloud characteristic part.

For the second feature interaction from the point cloud feature to the image feature, in a feasible manner, the point cloud feature at the position corresponding to the position point in the image feature map may be subjected to projection transformation and converted into the image feature. When the point cloud characteristics of the positions corresponding to the position points in the image characteristic diagram are subjected to projection transformation and converted into the image characteristics, the three-dimensional point cloud can be subjected to projection transformation to obtain a bird's-eye view diagram with depth information; according to the positions of the pixel points in the aerial view, point cloud characteristics of the pixel points at corresponding positions in the point cloud characteristic map are obtained; and carrying out projection transformation on the obtained point cloud characteristics, and converting the point cloud characteristics into image characteristics. By the above manner, the feature alignment and fusion can be accurately performed.

For each pixel point in the image feature map, a location (coordinate) is associated, and based on the location, a corresponding location in the point cloud feature map (e.g., by projection or coordinate transformation) can be determined, which corresponds to the corresponding point cloud feature. Then, after the point cloud feature is obtained, the point cloud feature can be fused with the image feature of the pixel point position, so as to obtain the fused image feature of the pixel point position.

For the fourth feature interaction from the image feature to the point cloud feature, in a feasible manner, aiming at a certain pixel point in the point cloud feature map, a plurality of reference pixel points of the pixel point in the point cloud feature map can be obtained; based on the positions of the pixel point and the reference pixel points, obtaining a plurality of image characteristics of corresponding positions from the image characteristic diagram; and fusing the point cloud characteristics corresponding to the pixel points and the obtained image characteristics. When the image feature map is obtained based on the positions of the pixel point and the reference pixel points, the image feature map is projected and transformed based on the positions of the pixel point and the reference pixel points to obtain corresponding two-dimensional coordinates; and carrying out feature acquisition on the corresponding positions of the image feature map based on the two-dimensional coordinates to obtain a plurality of corresponding image features. By the method, the deviation possibly caused by collecting the image characteristics of a single pixel point is avoided, and the collected image characteristics are more objective and effective.

When the three-dimensional object detection model shown in fig. 1B is used, the first, second, third and fourth feature interactions may be implemented by an encoder of the three-dimensional object detection model.

In one example, the encoder may include a plurality of encoding layers, each encoding layer including: the system comprises a feature input part, a feature interaction part and a feature fusion part. In the embodiments of the present invention, the number of "plural" or "plural" means two or more unless otherwise specified.

The feature input part is used for receiving an image feature map and a point cloud feature map output by a front layer (a previous coding layer or a feature extraction layer of a feature extractor before a first layer coding layer); the characteristic interaction part is used for respectively carrying out first characteristic interaction between image characteristics and second characteristic interaction between point cloud characteristics and the image characteristics aiming at the image characteristic diagram, and carrying out third characteristic interaction between the point cloud characteristics and fourth characteristic interaction between the image characteristics and the point cloud characteristics aiming at the point cloud characteristic diagram; and the feature fusion part is used for performing feature fusion based on the first feature interaction result and the second feature interaction result to obtain an enhanced image feature map, and performing feature fusion based on the third feature interaction result and the fourth feature interaction result to obtain an enhanced point cloud feature map.

An exemplary coding layer structure can be shown in fig. 2B, where the input of the first coding layer of the encoder is to extract the features h of two modes for the feature extractor _p And h _c Each of the other encoding layers is inputted with a point cloud feature map enhanced by a previous encoding layer, such as a point cloud bird's-eye feature map and an image feature map, and outputs a multi-modal feature map enhanced by the current layer and having the same shape, i.e. an enhanced point cloud feature map (such as an enhanced point cloud bird's-eye feature map) h ' _p And enhanced image feature map h' _c 。

Specifically, as shown in FIG. 2B, each encoding layer contains four feature interactions, which may be separately denoted as φ _p→c ,φ _p→p ,φ _c→p ,φ _c→c . Wherein phi is _x→y (x, y e { p, c }) represents a feature interaction from modality x to modality y with an input h _x And h _y Output is AND _y Having the same form of reinforcement

Subsequently, the respective interaction enhanced features are further fused by two MLPs (multi-layer perceptron). This process can be formalized as:

below, the interaction for the above four features is explained as follows:

(1) feature interaction within two modalities _p→p And phi _c→c Local self-attention operation is executed by taking the feature diagram under the self-mode as input, so that interaction and enhancement of the self-mode feature are realized.

(2) For feature interaction from point cloud features to image features _p→c In the embodiment of the present application, first, the Warp operation is used, and for the bird's-eye view feature map, the BEVWarp operation is used, and h is used _p Is organized into a sum of h _c In the same way, is described

Then with h _c A parameter Q calculated for self-attention to

And performing local self-attention calculation for K and V, and obtaining a self-attention calculation result as a feature interaction result.

The Warp operation is a transformation operation that can transform data from one type to another type by a certain transformation algorithm, such as euclidean transformation, similarity transformation, bias transformation, projective transformation, and the like. Based on this principle, in practical applications, those skilled in the art can modify and process it to meet their own needs in combination with practical needs.

In the present embodiment, the operation of BEVWarp is taken as an example, and h is determined _p Is organized into a sum of h _c The same form is illustrated as shown in fig. 2C.

As can be seen from FIG. 2C, when h is to be _p Is organized into a sum of h _c In the same form, firstly, a point cloud feature map h is required _p Projecting onto an image to obtain a sparse depth map; then, performing depth completion on the sparse depth map to obtain a pixel-by-pixel dense depth map; lifting each pixel in the dense depth map to a corresponding unique three-dimensional point in a three-dimensional space according to the depth of each pixel to form a pseudo-point cloud; then the pseudo point cloud is projected on a bird's-eye view characteristic diagram, namely a BEV characteristic diagram,obtaining corresponding characteristics; and then, filling the obtained features into corresponding positions of the image feature map to form BEV features in the form of image features. The partial feature is subsequently interacted with another partial image feature of the corresponding position (i.e., phi) _c→c The latter features) are fused, and then an enhanced image feature map fused with the point cloud features can be obtained.

(3) Feature interaction flow phi for image features to point cloud features _c→p First, a correspondence relationship between each pixel in a starting point cloud feature map, such as a bird's eye view BEV feature map, and a set of features in an image feature map may be established. For example, as shown in fig. 2D, for each pixel in the BEV feature map, a preset number of feature points, e.g., 20 feature points randomly selected from the space range of the pixel, are selected and projected onto the image feature map, and then the image feature map is sampled to obtain 20 corresponding image features. The specific setting of the preset number can be flexibly set by a person skilled in the art according to actual requirements.

Based on this, proceed with _c→p In the multi-head attention calculation, the distance is h _p For the source of the parameter Q of the multi-head attention calculation, K and V corresponding to each pixel on the BEV feature map as Q are the image features determined in the above manner, and local attention calculation is performed, and the obtained attention calculation result is used as a feature interaction result. The partial feature is subsequently interacted with another partial point cloud feature (i.e., phi) of the corresponding position _p→p The latter features) are fused, and then an enhanced point cloud feature map fused with the image features can be obtained.

Through the process, interaction and fusion among different modal characteristics are realized at the characteristic processing stage of the model, and the obtained enhanced image characteristic diagram and enhanced point cloud characteristic diagram provide richer and more comprehensive information and more accurate data basis for subsequent three-dimensional target detection.

It should be noted that the encoder may be implemented not only by a code manner, but also by an encoder chip or other hardware form, such as an FPGA, which implements the functions of the encoder through a logic circuit.

Step S206: and detecting the three-dimensional target in the three-dimensional environment based on the enhanced image characteristic diagram and the enhanced point cloud characteristic diagram.

Based on the enhanced image characteristic diagram and the enhanced point cloud characteristic diagram, the three-dimensional target detection can be carried out in a conventional mode. However, in order to obtain better detection accuracy, in the embodiment of the application, a mode of sequentially and alternately performing feature extraction on the enhanced image feature map and the enhanced point cloud feature map and performing three-dimensional target detection in a three-dimensional environment based on the extracted features is adopted. Therefore, the method and the device can be used for alternately interacting with information expressed by the features of different modes, so that the information can be more effectively utilized, and a more accurate detection result can be obtained.

In a feasible mode, before feature extraction is carried out on the enhanced image feature map and the enhanced point cloud feature map in turn, three-dimensional target position prediction can be carried out respectively on the basis of the point cloud feature map before feature interaction and the enhanced point cloud feature map after feature interaction, and a corresponding first prediction result and a second prediction result are obtained; and obtaining a preset number of candidate prediction frames according to the first prediction result and the second prediction result, wherein the probability that the candidate prediction frames belong to a certain type of target is greater than the preset probability. The preset number and the preset probability can be set properly by those skilled in the art according to actual requirements, and the embodiment of the application is not limited thereto. Based on this, sequentially and alternately performing feature extraction on the enhanced image feature map and the enhanced point cloud feature map can be realized as follows: and sequentially and alternately extracting features of the enhanced image feature map and the enhanced point cloud feature map based on the information of the candidate prediction frame, and detecting a three-dimensional target in a three-dimensional environment based on the extracted features. In this way, the three-dimensional target detection can be performed based on the candidate prediction frame, and the convergence speed of the three-dimensional target detection model is accelerated. The first prediction result and the second prediction result are respectively a first prediction heat map (heatmap) and a second prediction heat map, and the k-th channel in the first prediction heat map and the second prediction heat map represents the probability that the k-th type of target exists in the current position. The information of the candidate prediction frame comprises the point cloud characteristics corresponding to the candidate prediction frame and the target category vector information corresponding to the candidate prediction frame, so that initial effective reference is provided for subsequent further detection, and the detection efficiency is improved.

Based on the point cloud candidate prediction frame, in the subsequent detection process, when feature extraction is carried out on the enhanced image feature map, the point cloud candidate prediction frame corresponding to the enhanced point cloud feature map in front of the enhanced image feature map is obtained; projecting the point cloud candidate prediction frame to the enhanced image feature map to obtain a corresponding image candidate prediction frame; when feature extraction is carried out on the enhanced point cloud feature map, an image candidate prediction frame corresponding to a previous enhanced image feature map of the enhanced point cloud feature map is obtained, and the image candidate prediction frame is converted to the enhanced point cloud feature map to obtain a corresponding point cloud candidate prediction frame.

When the point cloud candidate prediction frame is projected onto the enhanced image feature map to obtain a corresponding image candidate prediction frame, the following steps may be adopted: amplifying the point cloud candidate prediction frame by a preset multiple and projecting the point cloud candidate prediction frame on the enhanced image feature map; acquiring a corresponding area of the projection result on the enhanced image feature map; and extracting the features of the region, and obtaining a corresponding image candidate prediction frame according to the feature extraction result. Therefore, prediction deviation caused by projection can be avoided, and a more accurate image candidate prediction frame can be obtained.

Furthermore, in order to further improve the detection efficiency and exclude data invalid for detection, in a feasible manner, before feature extraction is performed on an enhanced image feature map, multi-head self-attention calculation is performed on features corresponding to a point cloud candidate prediction frame corresponding to a previous enhanced point cloud feature map so as to obtain the relative position relationship between candidate targets corresponding to the point cloud candidate prediction frames, and/or repeated point cloud candidate prediction frames are eliminated; or before feature extraction is carried out on the enhanced point cloud feature map, multi-head self-attention calculation is carried out on features corresponding to image candidate prediction frames corresponding to a previous enhanced image feature map so as to obtain the relative position relation between candidate targets corresponding to the image candidate prediction frames and/or eliminate repeated image candidate prediction frames.

When the three-dimensional object detection model as shown in fig. 1B is employed, the above process may be implemented by a decoder of the three-dimensional object detection model.

In one example, the decoder may include a plurality of decoding layers, adjacent decoding layers for decoding different types of feature maps, wherein the different types of feature maps include an enhanced point cloud feature map and an enhanced image feature map. The enhanced point cloud characteristic map and the enhanced image characteristic map are characteristic maps obtained in the manner.

Illustratively, the decoder is configured as shown in FIG. 2E, with the decoding layers being shown as θ ⁽⁰⁾ 、θ ⁽¹⁾ 、……θ ^(t) . Wherein the first layer of the decoder is theta ⁽⁰⁾ The structure of the decoding layer in a standard transform decoder is used. In order to better focus on a local area where a target may exist, for the cross attention calculation used in the decoding layer, in the embodiment of the present application, the aforementioned encoder is used to perform a point cloud feature map before feature interaction, that is, before feature enhancement, such as a bird's eye view BEV feature map h before feature enhancement _p Each pixel in (a) as K and V in the cross attention calculation.

To speed up convergence, embodiments of the present application also use an input-dependent query initialization strategy. Specifically, the point cloud feature map h before and after enhancement is used _p And h' _p And respectively predicting two heatmaps, wherein the kth channel of each position represents the probability that the kth object center exists in the current position. The two heatmaps are added and the largest m positions are taken as the initial Q (query) coordinates. Q is encoded as h _p The sum of the feature at that location and the class vector, shown schematically as q in the figure _init . Wherein m is an integer of 2 or more.

The subsequent 2 xn decoding layers of the decoder can be seen as consisting of N concatenated elementary units, each of which contains two decoding layers for query (Q) -feature dynamic interaction (i.e. θ in fig. 2E) ⁽ⁱ⁾ ) Thus, the decode layer in this example is also referred to as the query-feature dynamic interaction layer, each in turn from h' _p And h' _c Extracting features to improve the expression of Q vectors and generating a prediction box b for each layer _i 。

Wherein each query-feature is dynamically handed overInter-layer (i.e. theta) ⁽ⁱ⁾ ) The structure of (A) is shown in FIG. 2F, where h 'is an input of a feature map obtained by enhancing the features of the previous layer, and h' is h 'if the previous layer is used for decoding the feature map of the image' _c If the previous layer is used for decoding the point cloud feature map, h 'is h' _p 。b _i-1 Prediction block output for previous layer, q _i-1 Q (query) vectors output for the previous layer.

Based on this, the specific feature interaction process of each query-feature dynamic interaction layer is as follows:

the first step is as follows: firstly, before each inquiry-characteristic dynamic interaction layer carries out characteristic interaction, a group of inquiry vectors, namely q, outputted by the previous layer are firstly carried out _i-1 Performs a multi-headed self-attention calculation, illustrated in FIG. 2F as MHSA, to infer the prediction blocks b _i-1 And eliminating the repeated prediction frame according to the relative position relation between the corresponding targets. Output of MHSA and q _i-1 Added and normalized using the LayerNorm method (schematically shown as Add after MHSA)&Norm) to obtain an updated query vector q _i-1 。

The second step is that: then, for the updated query vector q _i-1 A prediction frame b of the three-dimensional object to be decoded from the previous layer _i-1 Projecting the image feature map h' of the corresponding mode of the layer to obtain a two-dimensional prediction frame on the feature map, and extracting the feature R (region of interest) with the size of 7 × 7 from the image feature map by using the RoIAlign method (a method for mapping the generated prediction frame to the feature map with a fixed size) _i . Specifically, the largest bounding rectangle of the two-dimensional convex polygon projected from the prediction frame of the three-dimensional object is used as the RoI. In a scene like automatic driving, a target generally has a smaller scale on a point cloud feature map, such as a BEV feature map, so that when the point cloud feature is processed, a prediction frame of the three-dimensional target is enlarged twice and then projected.

The third step: then, q obtained by updating the first step _i-1 And R obtained in the second step _i DynConv (dynamic convolution) processing was performed. Specifically, q obtained by the first step updating _i-1 Are mapped to two sets of 1 x 1 convolution kernelsThe convolution kernel is successively used to characterize R at RoI _i The convolution is performed. Convolution processed RoI characteristic R _i Is flattened and reduced in dimensionality to obtain a product of q _i-1 The same shape of the output. This output is then added to q updated in the first step _i-1 And normalized using the LayerNorm method (schematically illustrated as Add after DynConv in the figure)&Norm) to form a query vector q fused with the RoI features _i-1 。

The fourth step: using a two-layer feedforward neural network to search the query vector q with the fused RoI characteristics _i-1 And (6) updating. The updated q _i-1 Added to the input of a feedforward neural network (denoted as FFN in FIG. 2F) and normalized using the LayerNorm method (Add shown in the bottom right hand corner of the figure)&Norm) to obtain q of the query-feature dynamic interaction layer output _i 。

The fifth step: at the end of each query-feature dynamic interaction layer, a two-layer feed-forward network (illustrated in FIG. 2F as CLS)&REG) to output q _i Independently decoding to obtain the improved prediction block b of the layer _i 。

And the prediction frame decoded by the feedforward network behind the last query-feature dynamic interaction layer is used as the final detection frame of the final three-dimensional target to be output, namely the detection result of the three-dimensional target.

It should be noted that the decoder may be implemented not only by a code manner, but also as a decoder chip or other hardware form, such as an FPGA, which implements the functions of the decoder through a logic circuit. Moreover, the above encoder-decoder architecture is also similar, and can be implemented not only by means of codes, but also in the form of encoder chip-decoder chip or other hardware forms such as FPGA.

Through the encoder-decoder framework, the scheme of the embodiment of the application can perform multi-modal feature fusion, and can perform accurate detection on the three-dimensional target by using the fused features.

Furthermore, during training, the Hungarian losses calculated based on the actual prediction box set and the prediction box set predicted by the model can be optimized for each decoding layer for the decoder. And, in addition to that, heatmap prediction loss (e.g., using gaussian focus loss) can be optimized as well to supervise query initialization. The final loss is the sum of the Hungarian loss and the heatmap loss described above. In addition, an Adam optimizer can be used for carrying out optimization training on the whole three-dimensional target detection model, and a one-cycle learning rate strategy is adopted.

In the following, the above process is exemplarily described by taking a specific scenario as an example, as shown in fig. 2G.

In fig. 2G, taking an automatic driving scene as an example, it is assumed that a laser radar and a camera are installed in an automatic driving vehicle, the laser radar acquires a three-dimensional point cloud of a three-dimensional environment where the automatic driving vehicle is located, and the camera acquires an image of the environment where the automatic driving vehicle is located.

Inputting the collected three-dimensional point cloud and the image into a three-dimensional target detection model shown in FIG. 1B, and respectively extracting three-dimensional point cloud features through a feature extractor aiming at the three-dimensional point cloud to generate a point cloud feature map; an image feature map is generated by extracting image features with respect to a feature extractor of an image. Then, the point cloud feature map and the image feature map are input into an encoder, and feature interaction is carried out by the encoder to obtain an enhanced image feature map and an enhanced point cloud feature map. Then, the enhanced image feature map and the enhanced point cloud feature map output by the encoder are input into a decoder, the decoder alternately decodes the enhanced image feature map and the enhanced point cloud feature map, and finally outputs the detection result of each three-dimensional target, namely the detection frame of each three-dimensional target and the information of the corresponding three-dimensional target, such as the category, the position, the size, the orientation and the like of the three-dimensional target. Further, a driving plan may be made for the autonomous vehicle based on the three-dimensional target detection result.

Through the embodiment, when the three-dimensional target is detected, the detection is carried out based on two-aspect modal information of the three-dimensional environment to be detected, namely the image and the three-dimensional point cloud. In the detection process, feature interaction between the image feature map corresponding to the image and the point cloud feature map corresponding to the three-dimensional point cloud is performed, so that the modal information on one hand can be effectively fused into the modal information on the other hand, and the modal information on the two aspects can be supplemented with each other, thereby realizing feature enhancement of the fused modal information, and obtaining the corresponding enhanced image feature map and enhanced point cloud feature map. On the basis of enhancing the image characteristic diagram and enhancing the point cloud characteristic diagram, the three-dimensional target detection is carried out, and the corresponding information is more comprehensive, richer and characteristic, so that the three-dimensional target detection is more accurate and more efficient, and a high-quality three-dimensional target detection result can be obtained.

Referring to fig. 3, a schematic structural diagram of an electronic device according to an embodiment of the present application is shown, and the specific embodiment of the present application does not limit a specific implementation of the electronic device.

As shown in fig. 3, the electronic device may include: a processor (processor)302, a communication Interface 304, a memory 306, and a communication bus 308.

Wherein:

the processor 302, communication interface 304, and memory 306 communicate with each other via a communication bus 308.

A communication interface 304 for communicating with other electronic devices or servers.

The processor 302 is configured to execute the program 310, and may specifically perform relevant steps in the above-described three-dimensional object detection method embodiment.

In particular, program 310 may include program code comprising computer operating instructions.

The processor 302 may be a CPU, or an application Specific Integrated circuit (asic), or one or more Integrated circuits configured to implement embodiments of the present application. The intelligent device comprises one or more processors which can be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.

And a memory 306 for storing a program 310. Memory 306 may comprise high-speed RAM memory and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The program 310 may be specifically configured to enable the processor 302 to execute operations corresponding to the three-dimensional object detection method described in the foregoing method embodiment.

For specific implementation of each step in the program 310, reference may be made to corresponding steps and corresponding descriptions in units in the foregoing method embodiments, and corresponding beneficial effects are provided, which are not described herein again. It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working processes of the above-described devices and modules may refer to the corresponding process descriptions in the foregoing method embodiments, and are not described herein again.

The embodiment of the present application further provides a computer program product, which includes a computer instruction, where the computer instruction instructs a computing device to execute an operation corresponding to the three-dimensional object detection method in the foregoing method embodiment.

It should be noted that, according to implementation needs, each component/step described in the embodiment of the present application may be divided into more components/steps, and two or more components/steps or partial operations of the components/steps may also be combined into a new component/step to achieve the purpose of the embodiment of the present application.

The above-described methods according to embodiments of the present application may be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CD ROM, a RAM, a floppy disk, a hard disk, or a magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine-readable medium downloaded through a network and to be stored in a local recording medium, so that the methods described herein may be stored in such software processes on a recording medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware such as an ASIC or FPGA. It will be appreciated that a computer, processor, microprocessor controller, or programmable hardware includes memory components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by a computer, processor, or hardware, implements the methods described herein. Further, when a general-purpose computer accesses code for implementing the methods illustrated herein, execution of the code transforms the general-purpose computer into a special-purpose computer for performing the methods illustrated herein.

Those of ordinary skill in the art will appreciate that the various illustrative elements and method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the embodiments of the present application.

The above embodiments are only used for illustrating the embodiments of the present application, and not for limiting the embodiments of the present application, and those skilled in the relevant art can make various changes and modifications without departing from the spirit and scope of the embodiments of the present application, so that all equivalent technical solutions also belong to the scope of the embodiments of the present application, and the scope of patent protection of the embodiments of the present application should be defined by the claims.

Claims

1. A three-dimensional object detection method, comprising:

acquiring an image and a three-dimensional point cloud of a three-dimensional environment to be detected, and an image characteristic diagram corresponding to the image and a point cloud characteristic diagram corresponding to the three-dimensional point cloud;

carrying out feature interaction between the image feature map and the point cloud feature map, and obtaining an enhanced image feature map fused with point cloud features and an enhanced point cloud feature map fused with image features according to a feature interaction result;

and detecting a three-dimensional target in the three-dimensional environment based on the enhanced image feature map and the enhanced point cloud feature map.

2. The method according to claim 1, wherein the performing of feature interaction between the image feature map and the point cloud feature map, obtaining an enhanced image feature map fused with point cloud features according to a result of the feature interaction, and obtaining the enhanced point cloud feature map fused with image features comprises:

aiming at the image feature map, performing first feature interaction from image features to image features and second feature interaction from point cloud features to image features; obtaining an enhanced image feature map fused with point cloud features according to the result of the first feature interaction and the result of the second feature interaction;

and (c) and (d),

performing third feature interaction from point cloud features to point cloud features and fourth feature interaction from image features to point cloud features on the point cloud feature map; and obtaining an enhanced point cloud feature map fused with the image features according to the result of the third feature interaction and the result of the fourth feature interaction.

3. The method of claim 2, wherein the second feature interaction from point cloud features to image features comprises:

and performing projection transformation on the point cloud characteristics of the positions corresponding to the position points in the image characteristic diagram, and converting the point cloud characteristics into image characteristics.

4. The method of claim 3, wherein the projective transformation of the point cloud features at locations corresponding to location points in the image feature map into image features comprises:

projecting and transforming the three-dimensional point cloud into a bird's-eye view with depth information;

according to the positions of the pixel points in the aerial view, acquiring point cloud characteristics of the pixel points at corresponding positions in the point cloud characteristic map;

and performing projection transformation on the obtained point cloud characteristics, and converting the point cloud characteristics into image characteristics.

5. The method of claim 2, wherein the fourth feature interaction from image features to point cloud features comprises:

aiming at a certain pixel point in the point cloud characteristic diagram, obtaining a plurality of reference pixel points of the pixel point in the point cloud characteristic diagram;

based on the positions of the pixel point and the reference pixel points, obtaining a plurality of image features of corresponding positions from the image feature map;

and fusing the point cloud characteristics corresponding to the pixel points and the obtained image characteristics.

6. The method of claim 5, wherein obtaining a plurality of image features corresponding to the positions from the image feature map based on the positions of the pixel and the plurality of reference pixels comprises:

performing projection transformation based on the positions of the pixel point and the plurality of reference pixel points to obtain corresponding two-dimensional coordinates;

and performing feature acquisition of corresponding positions on the image feature map based on the two-dimensional coordinates to obtain a plurality of corresponding image features.

7. The method of claim 1, wherein the performing three-dimensional target detection in the three-dimensional environment based on the enhanced image feature map and the enhanced point cloud feature map comprises:

and sequentially and alternately extracting features of the enhanced image feature map and the enhanced point cloud feature map, and detecting the three-dimensional target in the three-dimensional environment based on the extracted features.

8. The method of claim 7, wherein,

before the sequentially and alternately performing feature extraction on the enhanced image feature map and the enhanced point cloud feature map, the method further comprises the following steps: performing three-dimensional target position prediction based on the point cloud characteristic map and the enhanced point cloud characteristic map respectively to obtain a corresponding first prediction result and a corresponding second prediction result; obtaining a preset number of candidate prediction frames according to the first prediction result and the second prediction result, wherein the probability that the candidate prediction frames belong to a certain type of target is greater than a preset probability;

the sequentially and alternately extracting the features of the enhanced image feature map and the enhanced point cloud feature map, and detecting the three-dimensional target in the three-dimensional environment based on the extracted features comprises the following steps: and sequentially and alternately extracting features of the enhanced image feature map and the enhanced point cloud feature map based on the information of the candidate prediction frame, and detecting a three-dimensional target in the three-dimensional environment based on the extracted features.

9. The method of claim 8, wherein the first prediction result and the second prediction result are a first prediction heat map and a second prediction heat map, respectively; a kth channel in the first prediction heat map and the second prediction heat map represents a probability that a kth class of targets exists at a current location;

the information of the candidate prediction box comprises point cloud characteristics corresponding to the candidate prediction box and information of a target category vector corresponding to the candidate prediction box.

10. The method of claim 8 or 9, wherein the alternating feature extraction of the enhanced image feature map and the enhanced point cloud feature map in sequence, and the detection of the three-dimensional object in the three-dimensional environment based on the extracted features comprises:

when feature extraction is carried out on the enhanced image feature map, a point cloud candidate prediction frame corresponding to a previous enhanced point cloud feature map of the enhanced image feature map is obtained; projecting the point cloud candidate prediction frame to the enhanced image feature map to obtain a corresponding image candidate prediction frame;

when feature extraction is carried out on the enhanced point cloud feature map, an image candidate prediction frame corresponding to a previous enhanced image feature map of the enhanced point cloud feature map is obtained, the image candidate prediction frame is converted to the enhanced point cloud feature map, and a corresponding point cloud candidate prediction frame is obtained.

11. The method of claim 10, wherein,

before feature extraction is carried out on an enhanced image feature map, multi-head self-attention calculation is carried out on features corresponding to a point cloud candidate prediction frame corresponding to a previous enhanced point cloud feature map so as to obtain the relative position relation between candidate targets corresponding to each point cloud candidate prediction frame and/or eliminate repeated point cloud candidate prediction frames;

alternatively, the first and second electrodes may be,

before feature extraction is carried out on the enhanced point cloud feature map, multi-head self-attention calculation is carried out on features corresponding to image candidate prediction frames corresponding to a previous enhanced image feature map so as to obtain relative position relations among candidate targets corresponding to the image candidate prediction frames and/or repeated image candidate prediction frames are eliminated.

12. The method of claim 10, wherein said projecting the point cloud candidate prediction box onto the enhanced image feature map to obtain a corresponding image candidate prediction box comprises:

amplifying the point cloud candidate prediction frame by a preset multiple and then projecting the point cloud candidate prediction frame to the enhanced image feature map;

acquiring a corresponding area of the projection result on the enhanced image feature map;

and extracting the features of the region, and obtaining a corresponding image candidate prediction frame according to the feature extraction result.

13. An encoder, wherein the encoder comprises a plurality of encoding layers;

each encoding layer includes:

the characteristic input part is used for receiving the image characteristic diagram and the point cloud characteristic diagram output by the front layer;

the characteristic interaction part is used for respectively carrying out first characteristic interaction between image characteristics and second characteristic interaction between point cloud characteristics and the image characteristics aiming at the image characteristic diagram, and carrying out third characteristic interaction between the point cloud characteristics and fourth characteristic interaction between the image characteristics and the point cloud characteristics aiming at the point cloud characteristic diagram;

and the feature fusion part is used for performing feature fusion based on the first feature interaction result and the second feature interaction result to obtain an enhanced image feature map, and performing feature fusion based on the third feature interaction result and the fourth feature interaction result to obtain an enhanced point cloud feature map.

14. A decoder, wherein the decoder comprises a plurality of decoding layers;

the adjacent decoding layers are used for decoding different types of feature maps, wherein the different types of feature maps comprise an enhanced point cloud feature map and an enhanced image feature map;

the enhanced point cloud feature map and the enhanced image feature map are as follows: respectively carrying out feature interaction between a point cloud feature map and an image feature map corresponding to a three-dimensional environment to be detected, and obtaining a corresponding enhanced point cloud feature map fused with image features and an enhanced image feature map fused with point cloud features according to feature interaction results; the point cloud characteristic diagram and the image characteristic diagram are obtained by respectively extracting the characteristics of the image and the three-dimensional point cloud corresponding to the three-dimensional environment.