CN112733672B

CN112733672B - Three-dimensional target detection method and device based on monocular camera and computer equipment

Info

Publication number: CN112733672B
Application number: CN202011631597.1A
Authority: CN
Inventors: 刘明; 廖毅雄; 马福龙
Original assignee: Shenzhen Yiqing Innovation Technology Co ltd
Current assignee: Shenzhen Yiqing Innovation Technology Co ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2024-06-18
Anticipated expiration: 2040-12-31
Also published as: CN112733672A

Abstract

The application relates to a three-dimensional target detection method, a three-dimensional target detection device, a three-dimensional target detection computer device and a three-dimensional target storage medium based on a monocular camera. The method comprises the following steps: acquiring an image under an automatic driving scene acquired by a monocular camera; inputting the image into a distraction residual error network in a trained target object detection model, upsampling the image by utilizing deformable convolution, and fusing each feature map obtained after upsampling with a feature map of a front lower layer; carrying out feature enhancement on the fused feature images; regression is performed on the three-dimensional frame used for identifying the target object, the orientation of the target object and the offset of the center point of the target object based on the enhanced feature map; and adjusting the position of the three-dimensional frame according to the offset, and obtaining a target detection result of the target object. By adopting the method, the accuracy of three-dimensional target detection can be improved.

Description

Three-dimensional target detection method and device based on monocular camera and computer equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and apparatus for detecting a three-dimensional target based on a monocular camera, and a computer device.

Background

With the development of computer technology, autopilot has become a hotspot for research. In an autopilot scenario, it is important to accurately detect surrounding objects. In order to save cost, mass production is mainly to use a camera to acquire images of surrounding obstacles and detect surrounding objects according to the captured images.

However, in the conventional method, features are extracted from a photographed image, an object is detected based on an output feature map, and since the receptive field of the feature map obtained by directly extracting the features is not high and distortion exists in the image photographed by a camera in comparison with a real object, the object around the auto-driving scene cannot be accurately detected by directly detecting the feature map having the not high receptive field.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a monocular camera-based three-dimensional object detection method, apparatus, computer device, and storage medium that can improve accuracy.

A method of three-dimensional object detection based on a monocular camera, the method comprising:

Acquiring an image under an automatic driving scene acquired by a monocular camera;

Inputting the image into a distraction residual error network in a trained target object detection model, upsampling the image by utilizing deformable convolution, and fusing each feature map obtained after upsampling with a feature map of a front lower layer;

Carrying out feature enhancement on the fused feature images;

Regression is performed on the three-dimensional frame used for identifying the target object, the orientation of the target object and the offset of the center point of the target object based on the enhanced feature map;

and adjusting the position of the three-dimensional frame according to the offset, and obtaining a target detection result of the target object.

In one embodiment, the step of inputting the image into a distraction residual network in a trained target object detection model, up-sampling the image with a deformable convolution, further comprises:

Inputting the image into a distraction residual error network in a trained target object detection model;

in the process of carrying out up-sampling on the image for a plurality of times by utilizing deformable convolution, aiming at each up-sampling, deforming a convolution kernel according to the geometric shape of a target object in the image to obtain the convolution kernel which is suitable for the geometric shape;

and up-sampling the image based on the convolution check which is adaptive to the geometric shape, and obtaining a characteristic diagram of the receptive field conforming to the size of the target object.

In one embodiment, the step of upsampling the image by using a deformable convolution and fusing each feature map obtained after upsampling with a feature map of a front lower layer further includes:

Upsampling the image for a plurality of times by utilizing deformable convolution, and fusing each feature map obtained after upsampling each time with a feature map of a front lower layer;

The feature map of the front lower layer is an unfused feature map which is output after the last layer of sampling and corresponds to the previous layer of sampling before the current up-sampling.

In one embodiment, the step of regressing, based on the enhanced feature map, a three-dimensional frame for identifying a target object, an orientation of the target object, and an offset of a center point of the target object, further includes:

adding a priori frame into the image based on the enhanced feature map, and regressing the length and width of the three-dimensional frame according to the priori frame;

returning the direction of the target object, the offset of the center point of the target object and the height of the three-dimensional frame according to the distance between the center point of the target object and the monocular camera;

and obtaining the three-dimensional frame for identifying the target object in the image according to the length and the width of the three-dimensional frame and the height of the three-dimensional frame.

In one embodiment, the method further comprises:

Labeling label information on the sample image according to the point cloud coordinates; the point cloud coordinates are formed by sample images acquired by a laser radar;

converting the point cloud coordinates into coordinates under a camera coordinate system through camera external parameters of the tag information;

converting coordinates in the camera coordinate system into a pixel coordinate system through camera internal parameters of the tag information;

and training a target object detection model according to the sample image converted to the pixel coordinate system.

In one embodiment, the method further comprises:

in the process of using the trained target object detection model, using an inference optimizer in the target object detection model to quantize 32-bit or 16-bit data to be calculated in the target object detection model into 8-bit integer data in a mode of minimizing KL divergence;

And calculating the data quantized into the form of 8-bit integers according to the target object detection model.

A monocular camera-based three-dimensional object detection device, the device comprising:

the image acquisition module is used for acquiring images under the automatic driving scene acquired by the monocular camera;

the feature extraction module is used for inputting the image into a distraction residual error network in a trained target object detection model, upsampling the image by utilizing deformable convolution, and fusing each feature image obtained after upsampling with a feature image of a front lower layer;

The enhancement module is used for carrying out feature enhancement on the fused feature images;

the regression module is used for regressing the three-dimensional frame for identifying the target object, the direction of the target object and the offset of the center point of the target object based on the enhanced feature map;

and the detection module is used for adjusting the position of the three-dimensional frame according to the offset and obtaining a target detection result of the target object.

In one embodiment, the feature extraction module is further configured to input the image into a distraction residual network in a trained target object detection model; in the process of carrying out up-sampling on the image for a plurality of times by utilizing deformable convolution, aiming at each up-sampling, deforming a convolution kernel according to the geometric shape of a target object in the image to obtain the convolution kernel which is suitable for the geometric shape; and up-sampling the image based on the convolution check which is adaptive to the geometric shape, and obtaining a characteristic diagram of the receptive field conforming to the size of the target object.

A computer device comprising a memory storing a computer program and a processor which when executing the computer program performs the steps of:

Inputting the image into a distraction residual error network in a trained target object detection model, upsampling the image by utilizing deformable convolution, and fusing each feature map obtained after upsampling with a feature map of a front lower layer

Carrying out feature enhancement on the fused feature images;

A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

Carrying out feature enhancement on the fused feature images;

According to the three-dimensional target detection method, the three-dimensional target detection device, the computer equipment and the storage medium based on the monocular camera, the image under the automatic driving scene acquired by the monocular camera is acquired, and compared with the binocular camera, half of the cost can be saved. The acquired images are input into a trained target object detection model, and the distraction residual error network is used as a trunk feature extraction network, so that compared with the residual error network, the distraction module is added, and important features can be focused more. In the scattered attention residual error network, the deformable convolution is utilized to carry out up-sampling on the image for a plurality of times, each feature image obtained after the up-sampling for a plurality of times is fused with the feature image of the front lower layer, the further enhancement is carried out, and the receptive field of the output feature image is relatively high. And returning the offset of the three-dimensional frame and the center point of the target object to the feature map with higher receptive field, and adjusting the three-dimensional frame according to the offset, so that inaccuracy of the framed target object caused by cutting off the three-dimensional frame can be avoided. In summary, from the use of a residual network combined with a distraction module to the enhancement of a feature map obtained by up-sampling and fusion for multiple times, the target object is detected by using the adjusted three-dimensional frame, so that the accuracy of target object detection is improved as a whole.

Drawings

FIG. 1 is an application environment diagram of a monocular camera-based three-dimensional object detection method in one embodiment;

FIG. 2 is a flow chart of a method for three-dimensional object detection based on a monocular camera in one embodiment;

FIG. 3 is a schematic diagram of the results of a monocular camera-based three-dimensional object detection method in one embodiment;

FIG. 4 is a block diagram of a three-dimensional object detection device based on a monocular camera in one embodiment;

FIG. 5 is a block diagram of a three-dimensional object detection device based on a monocular camera in another embodiment;

fig. 6 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

The three-dimensional target detection method based on the monocular camera provided by the application can be applied to an application environment shown in fig. 1. Wherein the monocular camera 102 is connected to the vehicle 104 via a network. The vehicle 104 is provided with an object detection device. The monocular camera 102 detects by passing the captured images of the autopilot scene to the target detection device of the vehicle 104.

In one embodiment, as shown in fig. 2, there is provided a three-dimensional object detection method based on a monocular camera, which is described by taking an example that the method is applied to the object detection apparatus in fig. 1, including the steps of:

Step 202, acquiring an image under an automatic driving scene acquired by a monocular camera.

Wherein, monocular camera means to use a camera head. It will be appreciated that a binocular camera or a multi-view camera uses two or more cameras. An autopilot scene is a scene in which a vehicle automatically travels.

In one embodiment, there may or may not be a driver in the vehicle in the autopilot scenario.

In one embodiment, pictures of different autopilot scenes may be captured by a monocular camera, including pictures of multiple autopilot scenes during the day, under backlighting, at night, in rainy days, and in foggy days.

Specifically, in the vehicle autopilot scene, a monocular camera is disposed on the vehicle, and the monocular camera can shoot the autopilot scene to acquire an image in the autopilot scene.

And 204, inputting the image into a distraction residual error network in a trained target object detection model, upsampling the image by utilizing deformable convolution, and fusing each feature map obtained after upsampling with the feature map of a front lower layer.

The target object detection model is a model for detecting a target object in an image. The target object detection model comprises a distraction residual error network, a detection head and a task head. The distraction residual network is a residual network added with a distraction module, and is used for extracting the characteristics of a target object of an image, and can also be called a characteristic extractor. The deformable convolution is a convolution using a convolution kernel that can change shape. The feature map is a map having features of a target object. The target object is an object to be detected.

Specifically, the target detection device places the image acquired by the monocular camera into a trained target object detection model. The target detection device performs downsampling on the image according to a preset step length in a distraction residual error network of a trained target object detection model, and outputs a feature map. The target detection device performs up-sampling on the feature map after down-sampling by using deformable convolution in the distraction residual error network, and outputs the feature maps obtained after up-sampling through different channel numbers and then superimposes the feature map after up-sampling. And the object detection equipment fuses the up-sampled characteristic diagram with the characteristic diagram of the front lower layer in the distraction residual error network.

In one embodiment, the downsampling step size may be 32.

And 206, carrying out feature enhancement on the fused feature map.

The detection head is a code block for enhancing the target object.

Specifically, the target detection device convolves the fused feature map through a detection head in the target object detection model to increase the features of the fused feature map.

Step 208, regressing, based on the enhanced feature map, a three-dimensional frame for identifying the target object, an orientation of the target object, and an offset of a center point of the target object.

Wherein, the three-dimensional frame is a solid geometric frame containing three-dimensional information.

Specifically, the target detection device regresses the length, width and height of the three-dimensional frame for identifying the target object in the image, the orientation of the target object and the offset of the center point of the identified target object based on the enhanced feature map through the task head in the target object detection model. The task head is a code block for regressing the offset of the three-dimensional frame and the center point of the target object.

And step 210, adjusting the position of the three-dimensional frame according to the offset, and obtaining a target detection result of the target object.

The target detection result comprises the type, the direction, the position, the length, the width and the height of the target object.

Specifically, during model training, a user can manually adjust camera parameters, adjust the offset of camera coordinates relative to world coordinates by adjusting the camera parameters, and store the offset as a parameter of a trained target object detection model. The target detection device adjusts the position of the three-dimensional frame according to the stored offset through the task head in the target object detection model so as to detect the type, the orientation, the position, the length, the width and the height of the three-dimensional target object.

According to the three-dimensional target detection method based on the monocular camera, the images under the automatic driving scene acquired by the monocular camera are acquired, and compared with the binocular camera, the three-dimensional target detection method based on the monocular camera can save half of cost. The acquired images are input into a trained target object detection model, and the distraction residual error network is used as a trunk feature extraction network, so that compared with the residual error network, the distraction module is added, and important features can be focused more. In the scattered attention residual error network, the deformable convolution is utilized to carry out up-sampling on the image for a plurality of times, each feature image obtained after the up-sampling for a plurality of times is fused with the feature image of the front lower layer, the further enhancement is carried out, and the receptive field of the output feature image is relatively high. And (3) returning the offset of the three-dimensional frame and the center point of the target object to the feature map with higher receptive field, and adjusting the three-dimensional frame according to the offset, so that inaccuracy of the framed target object caused by cutting off the three-dimensional frame can be avoided. In summary, from the use of the residual network combined with the distraction module to the enhancement of the feature map obtained by up-sampling and fusion for multiple times, the target object is detected by using the adjusted three-dimensional frame, so that the accuracy of target object detection is improved as a whole.

In one embodiment, the step of inputting the image into a distraction residual network in a trained target object detection model, upsampling the image with a deformable convolution, further comprises: inputting the image into a distraction residual error network in a trained target object detection model; in the process of carrying out up-sampling on an image for a plurality of times by utilizing deformable convolution, aiming at each up-sampling, deforming a convolution kernel according to the geometric shape of a target object in the image to obtain a convolution kernel which is adaptive to the geometric shape; and up-sampling is carried out on the basis of the convolution check image which is adaptive to the geometric shape, so that a feature map which is consistent with the size of the target object in the receptive field is obtained.

The convolution kernel is a matrix used for convolving with the pixel matrix corresponding to the image. The receptive field is the visual field feeling of the target object identified by the characteristic diagram. For example, if the target object can be easily identified by the feature map, the sensitivity is high, whereas the sensitivity is low.

Specifically, the target detection device upsamples the feature map of the downsampled image multiple times by utilizing deformable convolution in the distracting residual network of the trained target object detection model. In each up-sampling, a convolution kernel deformed according to the geometric shape of the target object is utilized, and a pixel matrix corresponding to the feature map is convolved to obtain a feature map with the same size as the target object by sampling.

In this embodiment, up-sampling is performed by using deformable convolution, so that the adaptability of the target object detection model to the geometric change of the target object in the image can be enhanced.

In one embodiment, the step of upsampling the image by using deformable convolution and fusing each feature map obtained after upsampling with the feature map of the front lower layer further comprises: fusing the feature map obtained after each up-sampling with the feature map of the front lower layer; the feature map of the front lower layer is an unfused feature map which is output after the last layer of sampling and corresponds to the previous layer of sampling before the current up-sampling.

The convolutional neural network is a network corresponding to a single neuron in the distraction residual network. It will be appreciated that a neuron corresponds to a layer of networks, each of which may convolve an image input to that layer. The image input to the layer may be an original image or a characteristic image which is convolved with the upper layer convolutional neural network.

Specifically, the target detection device fuses each feature map obtained by upsampling in each layer of convolutional neural network with a feature map which is obtained by upsampling in a front lower layer of convolutional neural network of the convolutional neural network corresponding to each upsampling and is not fused by using the distraction residual network.

In one embodiment, the number of upsamples may be three. The target detection device performs first upsampling on the feature map output after downsampling in a distraction residual error network of a trained target object detection model, outputs the feature map through 256 channels and superimposes the feature map, so as to obtain a feature map which is upsampled for the first time and is not fused. And fusing the feature images which are up-sampled for the first time and are not fused with the feature images which are output after down-sampling to obtain the feature images which are up-sampled for the first time and are fused. The target detection device performs the first upsampling and the fusion of the feature images through the distraction residual error network, performs the second upsampling through the deformable convolution, outputs the feature images through 128 channels and superimposes the feature images, and obtains the feature images which are the second upsampling and the fusion of the feature images. And fusing the feature images which are up-sampled for the second time and are not fused with the feature images which are up-sampled for the first time, so as to obtain the feature images which are up-sampled for the second time and are fused. The target detection device performs third upsampling on the second upsampled and fused feature map through the distraction residual error network, outputs the feature map through 64 channels and superimposes the feature map, and obtains the feature map which is upsampled and not fused for the third upsampling. And fusing the feature map which is up-sampled for the third time and is not fused with the feature map which is up-sampled for the second time and is fused, so as to obtain the feature map which is up-sampled for the third time and is fused.

In this embodiment, the receptive field of the feature map after downsampling is smaller, which is not beneficial to target detection, and the feature expression capability can be improved by upsampling the output feature map. The change of the geometric shape of the target can be self-adapted through deformable convolution, and the generalization of the convolutional neural network is improved.

In one embodiment, the step of regressing, based on the enhanced feature map, the three-dimensional box for identifying the target object, the orientation of the target object, and the offset of the center point of the target object, further comprises: adding a priori frame into the image based on the enhanced feature map, and regressing the length and width of the three-dimensional frame according to the priori frame; returning the orientation of the target object, the offset of the center point of the target object and the height of the three-dimensional frame according to the distance between the center point of the target object and the monocular camera; and obtaining the three-dimensional frame for identifying the target object in the image according to the length and the width of the three-dimensional frame and the height of the three-dimensional frame.

Wherein the prior box is an initial two-dimensional geometric box. It is understood that the two-dimensional geometric box is only length and width. It will be appreciated that the a priori frame may be multiple, and that by continually regressing, an accurate detection frame is ultimately obtained.

Specifically, the target detection device adds a plurality of prior frames into the image through a task head in the target object detection model based on the feature map enhanced by the detection head, and continuously compares the threshold value calculated by the prior frames with the reference threshold value in the trained target object detection model to obtain the optimal prior frame, namely, the length and the width of the regression three-dimensional frame are regressed. And according to the distance between the center point of the target object and the monocular camera, the target detection equipment returns the direction of the target object, the offset of the center point of the target object and the height of the three-dimensional frame through the task head in the target object detection model, and the three-dimensional frame for identifying the target object in the image is obtained according to the length and the width of the three-dimensional frame and the height of the three-dimensional frame.

In one embodiment, as shown in fig. 3, a monocular camera-based three-dimensional object detection device on a vehicle may identify a target object on an image with a three-dimensional frame from an image captured by the monocular camera.

In this embodiment, based on the enhanced feature image, the offset of the center point of the target object is regressed by the task header in the target object detection model, so as to adjust the three-dimensional frame for identifying the target object in the image, reduce the three-dimensional frame exceeding the image range, that is, reduce the situation that the three-dimensional frame is truncated, thereby avoiding the occurrence of truncation of the identified target object and improving the accuracy of the three-dimensional frame.

In one embodiment, the method further comprises: labeling label information on the sample image according to the point cloud coordinates; the point cloud coordinates are formed by sample images acquired by a laser radar; converting the point cloud coordinates into coordinates under a camera coordinate system through camera external parameters of the tag information; converting coordinates in a camera coordinate system into a pixel coordinate system through the camera of the tag information; and training a target object detection model according to the sample image converted to the pixel coordinate system.

The sample image is an image for abnormal detection of a sample. The point cloud coordinates are coordinates of a plurality of points in a cloud shape. The camera coordinate system is a coordinate system set at an angle of an image photographed by a camera. The camera external parameters comprise rotation parameters of three axes of the camera coordinate axes and translation parameters of the three axes. Camera internal parameters, including radial distortion coefficients and tangential distortion coefficients of the camera. The sample image converted to the pixel coordinate system is a pixel matrix.

Specifically, the user annotates the sample image acquired by the monocular camera. The user can input the sample image acquired by the monocular camera into the labeling software on the labeling software, adjust the point cloud coordinates of the labeling software added to the target object, add the value corresponding to the coordinates, and add the label information to the labeled target object. The tag information includes: picture identification, image category, camera internal parameters, two-dimensional frame, whether truncated, occlusion degree, orientation, and three-dimensional dimensions. The truncation may be represented by "0" for no truncation and "1" for truncation. Sample images in the autopilot scene are acquired by a monocular camera. And the user adds point cloud coordinates to the sample target object in the sample image and marks label information. The target detection device converts the point cloud coordinates into coordinates in a camera coordinate system through camera external parameters in the tag information, and then converts the coordinates in the camera coordinate system into a pixel coordinate system through camera internal parameters in the tag information. The user trains the target object detection model with the sample image converted to the pixel coordinate system.

In one embodiment, the camera extrinsic parameters of the fine-tuning camera may adjust the three degrees of freedom of the camera coordinates, i.e., pitch, yaw, and roll, accordingly, as well as translate the coordinate axes of the camera coordinates, i.e., x-axis, y-axis, and z-axis. The target detection device can convert the point cloud coordinates into coordinates in a camera coordinate system by fine-tuning the obtained offset in the tag information. The target detection device then converts the coordinates in the camera coordinate system to a pixel coordinate system by the radial distortion coefficient and tangential distortion coefficient of the camera and the pixel ratio.

In this embodiment, after sample images under different autopilot scenes are labeled, the sample images are put into the target object model for training, so that the accuracy of target object model detection can be improved.

In one embodiment, the method further comprises: in the process of using the trained target object detection model, the inference optimizer in the target object detection model is used for quantizing 32-bit or 16-bit data to be calculated in the target object detection model into data in the form of 8-bit integers in a mode of minimizing KL divergence (Kullback-Leibler divergence); data quantized to 8-bit integer form is calculated according to the target object detection model.

Specifically, the target detection device uses an inference optimizer in the target object detection model to quantize the 32-bit or 16-bit data to be calculated in the target object detection model into data in the form of an 8-bit integer by minimizing the KL divergence, and then calculates the data.

In one embodiment, for example, 2-bit or 16-bit data corresponding to the feature map to be convolved may be quantized into 8-bit integer form of data prior to upsampling, and then convolved.

In one embodiment, 1/10 of the training data can be taken as a calibration data set, in the target object detection model, the target object in the image is inferred in the form of 32-bit data, then histograms of each layer of activation values are collected, saturation quantization distributions of different thresholds are counted, and finally a threshold capable of minimizing KL divergence is found.

In one embodiment, the inference optimizer may be "TensorRT", a deep learning framework that only propagates forward.

In this embodiment, by quantizing the 32-bit or 16-bit data to be calculated into the data in the form of an 8-bit integer, the operation speed of the CPU or GPU of the target detection apparatus can be increased, and the speed of detecting the target object from the target object detection model is never increased, specifically, the speed can be increased by at least 1.5 times compared with that before the acceleration.

It should be understood that, although the steps in the flowchart of fig. 2 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least a portion of the steps in fig. 2 may include a plurality of steps or stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily sequential, but may be performed in rotation or alternatively with at least a portion of the steps or stages in other steps or other steps.

In one embodiment, as shown in fig. 4, there is provided a monocular camera-based three-dimensional object detection apparatus 400, comprising: an image acquisition module 402, a feature extraction module 404, an enhancement module 406, a regression module 408, and a detection module 410, wherein:

The image acquisition module 402 is configured to acquire an image under an autopilot scene acquired by a monocular camera.

The feature extraction module 404 is configured to input an image into a distraction residual error network in the trained target object detection model, upsample the image by using deformable convolution, and fuse each feature map obtained after upsampling with a feature map of a front lower layer.

And the enhancement module 406 is configured to perform feature enhancement on the fused feature map.

A regression module 408, configured to regress, based on the enhanced feature map, the three-dimensional frame for identifying the target object, the orientation of the target object, and the offset of the center point of the target object.

The detection module 410 is configured to adjust a position of the three-dimensional frame according to the offset, and obtain a target detection result of the target object.

In one embodiment, the feature extraction module 404 is further configured to input the image into a distraction residual network in the trained target object detection model; in the process of carrying out up-sampling on an image for a plurality of times by utilizing deformable convolution, aiming at each up-sampling, deforming a convolution kernel according to the geometric shape of a target object in the image to obtain a convolution kernel which is adaptive to the geometric shape; and up-sampling is carried out on the basis of the convolution check image which is adaptive to the geometric shape, so that a feature map which is consistent with the size of the target object in the receptive field is obtained.

In one embodiment, the feature extraction module 404 is further configured to upsample the image multiple times by using a deformable convolution, and fuse each feature map obtained after each upsampling with the feature map of the previous lower layer; the feature map of the front lower layer is an unfused feature map which is output after the last layer of sampling and corresponds to the previous layer of sampling before the current up-sampling.

In one embodiment, the regression module 408 is further configured to add a priori frame to the image based on the enhanced feature map, and regress the length and width of the three-dimensional frame according to the priori frame; returning the orientation of the target object, the offset of the center point of the target object and the height of the three-dimensional frame according to the distance between the center point of the target object and the monocular camera; and obtaining the three-dimensional frame for identifying the target object in the image according to the length and the width of the three-dimensional frame and the height of the three-dimensional frame.

In one embodiment, the apparatus further comprises:

The training module 401 is configured to label the sample image with label information according to the point cloud coordinates; the point cloud coordinates are formed by sample images acquired by a laser radar; converting the point cloud coordinates into coordinates under a camera coordinate system through camera external parameters of the tag information; converting coordinates in a camera coordinate system into a pixel coordinate system through the camera of the tag information; and training a target object detection model according to the sample image converted to the pixel coordinate system.

As shown in fig. 5, in one embodiment, the apparatus further comprises: a training module 401 and an acceleration module 412;

The acceleration module 412 is configured to quantize the 32-bit or 16-bit data to be calculated in the target object detection model into data in the form of an 8-bit integer by using an inference optimizer in the target object detection model in a manner of minimizing KL divergence in the process of using the trained target object detection model; data quantized to 8-bit integer form is calculated according to the target object detection model.

For specific limitations on the monocular camera-based three-dimensional object detection device, reference may be made to the above limitations on the monocular camera-based three-dimensional object detection method, and no further description is given here. The respective modules in the above-described monocular camera-based three-dimensional object detection apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be an object detection device on a vehicle in an autopilot scenario, the internal structure of which may be as shown in FIG. 6. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with external target detection equipment, and the wireless mode can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program, when executed by a processor, implements a monocular camera-based three-dimensional object detection method.

It will be appreciated by those skilled in the art that the structure shown in FIG. 6 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor when executing the computer program performing the steps of: acquiring an image under an automatic driving scene acquired by a monocular camera; inputting the image into a distraction residual error network in a trained target object detection model, upsampling the image by utilizing deformable convolution, and fusing each feature map obtained after upsampling with a feature map of a front lower layer; carrying out feature enhancement on the fused feature images; regression is performed on the three-dimensional frame used for identifying the target object, the orientation of the target object and the offset of the center point of the target object based on the enhanced feature map; and adjusting the position of the three-dimensional frame according to the offset, and obtaining a target detection result of the target object.

In one embodiment, the processor when executing the computer program further performs the steps of: inputting the image into a distraction residual error network in a trained target object detection model; in the process of carrying out up-sampling on an image for a plurality of times by utilizing deformable convolution, aiming at each up-sampling, deforming a convolution kernel according to the geometric shape of a target object in the image to obtain a convolution kernel which is adaptive to the geometric shape; and up-sampling is carried out on the basis of the convolution check image which is adaptive to the geometric shape, so that a feature map which is consistent with the size of the target object in the receptive field is obtained.

In one embodiment, the processor when executing the computer program further performs the steps of: performing up-sampling on the image for multiple times by utilizing deformable convolution, and fusing each feature image obtained after the up-sampling for multiple times with the feature image of the front lower layer; the feature map of the front lower layer is an unfused feature map which is output after the last layer of sampling and corresponds to the previous layer of sampling before the current up-sampling.

In one embodiment, the processor when executing the computer program further performs the steps of: adding a priori frame into the image based on the enhanced feature map, and regressing the length and width of the three-dimensional frame according to the priori frame; returning the orientation of the target object, the offset of the center point of the target object and the height of the three-dimensional frame according to the distance between the center point of the target object and the monocular camera; and obtaining the three-dimensional frame for identifying the target object in the image according to the length and the width of the three-dimensional frame and the height of the three-dimensional frame.

In one embodiment, the processor when executing the computer program further performs the steps of: labeling label information on the sample image according to the point cloud coordinates; the point cloud coordinates are formed by sample images acquired by a laser radar; converting the point cloud coordinates into coordinates under a camera coordinate system through camera external parameters of the tag information; converting coordinates in a camera coordinate system into a pixel coordinate system through the camera of the tag information; and training a target object detection model according to the sample image converted to the pixel coordinate system.

In one embodiment, the processor when executing the computer program further performs the steps of: in the process of using the trained target object detection model, the inference optimizer in the target object detection model is used for quantizing 32-bit or 16-bit data to be calculated in the target object detection model into data in the form of 8-bit integers in a mode of minimizing KL divergence; data quantized to 8-bit integer form is calculated according to the target object detection model.

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:

acquiring an image under an automatic driving scene acquired by a monocular camera; inputting the image into a distraction residual error network in a trained target object detection model, upsampling the image by utilizing deformable convolution, and fusing each feature map obtained after upsampling with a feature map of a front lower layer; carrying out feature enhancement on the fused feature images; regression is performed on the three-dimensional frame used for identifying the target object, the orientation of the target object and the offset of the center point of the target object based on the enhanced feature map; and adjusting the position of the three-dimensional frame according to the offset, and obtaining a target detection result of the target object.

In one embodiment, the computer program when executed by the processor further performs the steps of: inputting the image into a distraction residual error network in a trained target object detection model; in the process of carrying out up-sampling on an image for a plurality of times by utilizing deformable convolution, aiming at each up-sampling, deforming a convolution kernel according to the geometric shape of a target object in the image to obtain a convolution kernel which is adaptive to the geometric shape; and up-sampling is carried out on the basis of the convolution check image which is adaptive to the geometric shape, so that a feature map which is consistent with the size of the target object in the receptive field is obtained.

In one embodiment, the computer program when executed by the processor further performs the steps of: performing up-sampling on the image for multiple times by utilizing deformable convolution, and fusing the feature map obtained after each up-sampling with the feature map of the front lower layer; the feature map of the front lower layer is an unfused feature map which is output after the last layer of sampling and corresponds to the previous layer of sampling before the current up-sampling.

In one embodiment, the computer program when executed by the processor further performs the steps of: adding a priori frame into the image based on the enhanced feature map, and regressing the length and width of the three-dimensional frame according to the priori frame; returning the orientation of the target object, the offset of the center point of the target object and the height of the three-dimensional frame according to the distance between the center point of the target object and the monocular camera; and obtaining the three-dimensional frame for identifying the target object in the image according to the length and the width of the three-dimensional frame and the height of the three-dimensional frame.

In one embodiment, the computer program when executed by the processor further performs the steps of: labeling label information on the sample image according to the point cloud coordinates; the point cloud coordinates are formed by sample images acquired by a laser radar; converting the point cloud coordinates into coordinates under a camera coordinate system through camera external parameters of the tag information; converting coordinates in a camera coordinate system into a pixel coordinate system through the camera of the tag information; and training a target object detection model according to the sample image converted to the pixel coordinate system.

In one embodiment, the computer program when executed by the processor further performs the steps of: in the process of using the trained target object detection model, the inference optimizer in the target object detection model is used for quantizing 32-bit or 16-bit data to be calculated in the target object detection model into data in the form of 8-bit integers in a mode of minimizing KL divergence; data quantized to 8-bit integer form is calculated according to the target object detection model.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. A method for three-dimensional object detection based on a monocular camera, the method comprising:

Downsampling the image according to a preset step length, and outputting a downsampled feature map;

Upsampling the downsampled feature map multiple times using a deformable convolution;

Fusing the feature map obtained after each up-sampling with the feature map of the front lower layer; the front lower-layer feature map is an unfused feature map output after the last layer of sampling corresponding to each up-sampling;

Carrying out feature enhancement on the fused feature images;

2. The method of claim 1, wherein the step of upsampling the downsampled feature map a plurality of times using a deformable convolution further comprises:

3. The method of claim 1, wherein the step of regressing, based on the enhanced feature map, a three-dimensional box for identifying a target object, an orientation of the target object, and an offset of a center point of the target object, further comprises:

4. The method according to claim 1, wherein the target object detection model is obtained by a model training step comprising:

5. The method according to any one of claims 1 to 4, further comprising:

6. A monocular camera-based three-dimensional object detection apparatus, the apparatus comprising:

The feature extraction module is used for inputting the image into a distraction residual error network in the trained target object detection model; downsampling the image according to a preset step length, and outputting a downsampled feature map; upsampling the downsampled feature map multiple times using a deformable convolution; fusing the feature map obtained after each up-sampling with the feature map of the front lower layer; the front lower-layer feature map is an unfused feature map output after the last layer of sampling corresponding to each up-sampling;

7. The apparatus of claim 6, wherein the feature extraction module is further configured to, during the upsampling of the image multiple times with a deformable convolution, for each upsampling, deform a convolution kernel according to a geometry of a target object in the image to obtain a convolution kernel that is adapted to the geometry; and up-sampling the image based on the convolution check which is adaptive to the geometric shape, and obtaining a characteristic diagram of the receptive field conforming to the size of the target object.

8. The apparatus of claim 6, wherein the regression module is further configured to add a prior frame to the image based on the enhanced feature map and to regress a length and a width of the three-dimensional frame according to the prior frame; returning the direction of the target object, the offset of the center point of the target object and the height of the three-dimensional frame according to the distance between the center point of the target object and the monocular camera; and obtaining the three-dimensional frame for identifying the target object in the image according to the length and the width of the three-dimensional frame and the height of the three-dimensional frame.

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 5 when the computer program is executed.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 5.