CN112883790A

CN112883790A - 3D object detection method based on monocular camera

Info

Publication number: CN112883790A
Application number: CN202110056909.9A
Authority: CN
Inventors: 黄梓航; 伍小军; 周航; 刘妮妮; 董萌; 陈炫翰
Original assignee: Huizhou Desay SV Automotive Co Ltd
Current assignee: Huizhou Desay SV Automotive Co Ltd
Priority date: 2021-01-15
Filing date: 2021-01-15
Publication date: 2021-06-01
Also published as: WO2022151664A1

Abstract

The invention relates to a 3D object detection method based on a monocular camera, which comprises the following steps: establishing a depth estimation model, wherein the depth estimation model is used for acquiring a predicted depth map matched with original image data; acquiring original image data through a vehicle-mounted camera; acquiring a predicted depth map matched with original image data by using a depth estimation model; detecting a target object in the original image data; and projecting the target object into the corresponding prediction depth map to generate an anchoring area, and performing 3D reconstruction on the anchoring area to obtain a three-dimensional coordinate value of the target object in a world coordinate system. According to the 3D object detection method, three-dimensional coordinate information of the object can be obtained only by means of the monocular camera, the method does not depend on the assumption that the road surface is completely flat, is low in cost and high in detection precision, can provide accurate reference data for a driver, is beneficial to improvement of driving safety, and has important use value.

Description

3D object detection method based on monocular camera

Technical Field

The invention relates to the technical field of 3D object detection, in particular to a 3D object detection method based on a monocular camera.

Background

In computer vision, detecting an object of interest and inferring its three-dimensional characteristics is a core problem and is currently widely used. Especially in the last decade, with the rapid development of unmanned technology and mobile robots, object detection plays an extremely important role in the sensing system, and the accurate and efficient sensing system can effectively ensure the safety of the robot and other surrounding moving objects. In recent years, although two-dimensional object detection has been developed rapidly in unmanned systems, there is still a need for further improvement in converting a detected object from an image plane to a real-world posture. The task of conventional three-dimensional object detection usually depends heavily on depth sensors such as laser radar or millimeter wave radar, and the like, so that the calculation amount is large, and the cost is high.

In view of the fact that more and more vehicles are already equipped with high-definition cameras, it is becoming an industry trend to perform detection of 3D objects by means of monocular cameras to reduce costs. In the existing 3D object detection algorithm based on the monocular camera, the real-time performance and the precision are far inferior to those of methods (such as laser radar) using other sensors. This is because existing monocular camera-based 3D object detection algorithms all rely on an assumption that the ground (or earth) is flat. Based on this assumption, three-dimensional information can be modeled using a two-dimensional information source. For example, since the ground is assumed to be flat, the conventional method further assumes that the bottom of the two-dimensional target frame corresponding to the detected object is located on the ground plane. Therefore, when an object is detected, a simple geometric calculation can calculate the distance between the obstacle and the host vehicle based on the plane assumption.

However, the actual road surface may not be perfectly flat, and these conventional methods are affected when the road surface is curved or uneven. For example, when the ground plane is assumed to be flat, and not actually flat, the curves on the driving surface may lead to inaccurate predictions, and the estimation of the distance to obstacles in the environment may be judged to be too high or too low. In both cases, inaccurate distance estimates can have a direct negative impact on various operations of the vehicle, potentially compromising lateral and longitudinal control or driving safety and reliability. For example, underestimated distances may result in the failure of Adaptive Cruise Control (ACC) functions and, more importantly, Automatic Emergency Brake (AEB) functions in the prevention of potential traffic accidents. Conversely, overestimated distances may cause the ACC or AEB functions to be activated when not needed, thereby causing potential discomfort or injury to the occupant, while also reducing the occupant's confidence in the safe operation capabilities of the vehicle.

Disclosure of Invention

In order to overcome the defects, the invention provides a 3D object detection method based on a monocular camera, which comprises the following steps:

establishing a depth estimation model, wherein the depth estimation model is used for acquiring a predicted depth map matched with original image data;

acquiring original image data through a vehicle-mounted camera;

acquiring a predicted depth map matched with original image data by using a depth estimation model;

detecting a target object in the original image data;

and projecting the target object into the corresponding prediction depth map to generate an anchoring area, and performing 3D reconstruction on the anchoring area to obtain a three-dimensional coordinate value of the target object in a world coordinate system.

Further, the step of establishing the depth estimation model includes:

acquiring a plurality of frames of original image data and depth image data matched with the original image data, and establishing a training set, wherein each frame of original image data and the depth image data corresponding to each frame of original image data in the training set form a sample;

and (3) taking each sample in the training set as a training factor, and calculating a depth estimation model according to a Scale-innovative Error loss function.

Further, after the step of establishing the depth estimation model and before the step of detecting the target object in the raw image data, the method further comprises the step of establishing an object detection model:

and (3) training an object detection model by using a deep learning model frame Darknet53 as a feature extraction frame and using each original image data in a training set as a training factor according to a Focal local Loss function, wherein the object detection model is used for detecting a target object in the original image data.

Further, the Focal local Loss function is as follows:

FL(pt)＝-α(1-pt)γlog(pt)

wherein pt is a detection probability, α is an inter-class parameter, (1-pt) γ is a simple/difficult sample adjustment factor, and α ═ 0.5 γ ═ 2.

Further, the step of acquiring a plurality of frames of original image data and depth image data matched with each original image data and establishing a training set includes:

simultaneously acquiring a plurality of frames of original image data and laser radar data matched with the original image data;

performing time synchronization processing on each laser radar data and each original image data to form a one-to-one corresponding relation;

projecting the three-dimensional point cloud in the laser radar data into an image plane to form a point cloud picture;

respectively carrying out depth expansion processing on the point cloud images to obtain depth image data matched with the original image data;

and establishing a training set by using a plurality of frames of original image data and depth image data matched with each original image data.

Further, the step of projecting the three-dimensional point cloud in the lidar data into an image plane to form a point cloud picture includes:

acquiring an internal reference matrix of the vehicle-mounted camera;

calculating a rotation translation matrix between the vehicle-mounted camera and the vehicle-mounted laser radar by a combined calibration method;

and converting the three-dimensional point cloud in the laser radar data into a two-dimensional point cloud picture according to the internal reference matrix and the rotational translation matrix.

Further, the step of performing depth expansion processing on the point cloud image to obtain depth image data matched with the original image data includes:

reversing the point cloud chart;

performing first kernel expansion treatment on the point cloud picture subjected to the reversal treatment to complete the closure of the small hole;

performing first dynamic fuzzy outlier removal processing on the point cloud image subjected to the first kernel expansion processing by using a median filter;

performing second kernel expansion processing on the point cloud picture subjected to the first dynamic fuzzy outlier removal processing to complete hole pitch filling;

performing third inner core expansion treatment on the point cloud picture subjected to the second inner core expansion treatment to complete large hole closure;

performing second dynamic fuzzy outlier removal processing on the point cloud image subjected to the third kernel expansion processing by using a median filter;

and (3) removing the abnormal value by adopting a bilateral filter aiming at the point cloud picture subjected to the second dynamic fuzzy abnormal value removal processing, keeping local boundary characteristics, and realizing secondary reversion processing so as to obtain depth image data matched with the original image data.

Further, the step of obtaining a predicted depth map matched with the original image data by using the depth estimation model includes:

extracting characteristic parameters in original image data by using Dense121Net as a coding layer;

decoding the coding layer to obtain three branches, extracting relative local structural features under different sizes through the three branches, connecting the outputs of the three branches in series, and obtaining a series layer by unifying the size of the three branches into the size of an input image;

and performing convolution calculation on the serial layers, and analyzing a local structure to obtain a prediction depth map corresponding to the depth image data.

Further, the decoding the coding layer to obtain three branches to extract the relative local structural features under different sizes through the three branches, and serially connect the outputs of the three branches, and the step of obtaining the serial layer with the unified size as the size of the input image includes:

reducing the dimension of the coded features to H/8, extracting context structure information through a space pyramid pooling layer, connecting the extracted structure information to a local plane guiding layer, and analyzing local geometric structure information of the local plane guiding layer, so as to generate estimated depth features of a first branch;

reducing the dimension of the coded features to H/4, connecting the coded features in series with the depth features generated by the first branch, and connecting the coded features to a local plane guide layer to analyze the local geometric structure information of the coded features, so that estimated depth features of a second branch are generated;

reducing the dimension of the coded features to H/2, connecting the depth features generated by the second branch in series, and connecting the depth features to a local plane guide layer to analyze the local geometric structure information of the depth features, so as to generate estimated depth features of a third branch;

and connecting the estimated depth features generated by the first branch, the second branch and the third branch in series, and unifying the size of the estimated depth features into the size of the input image to obtain a series layer.

Further, the characteristic parameters include image texture, color and spatial structure.

Compared with the prior art, the invention has the following beneficial effects:

the invention discloses a 3D object detection method based on a monocular camera, which can accurately detect a 3D object only by means of the monocular camera, does not depend on the assumed basis that the road surface is completely flat in the whole calculation process, and has the advantages of obviously improved detection precision, capability of providing more accurate reference data for a driver, contribution to improving driving safety, obvious reduction of the detection cost of the 3D object and very important use value compared with the traditional detection scheme of executing a 3D target by means of the monocular camera.

Drawings

Fig. 1 is a schematic flow chart of a 3D object detection method based on a monocular camera in embodiment 1.

Fig. 2 is a schematic diagram of a training set establishment process in embodiment 1.

Fig. 3 is a schematic diagram of a 3D object detection method based on a monocular camera in embodiment 1.

Fig. 4 is a schematic flowchart of a specific process of obtaining a predicted depth map by using a depth estimation model in embodiment 1.

FIG. 5 is a diagram illustrating original image data and annotation information in example 1.

Fig. 6 is a diagram illustrating a predicted depth map and an anchor region in embodiment 1.

The drawings are for illustrative purposes only and are not to be construed as limiting the patent; for purposes of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced and do not represent actual dimensions; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted; the same or similar reference numerals correspond to the same or similar parts; the terms describing positional relationships in the drawings are for illustrative purposes only and are not to be construed as limiting the patent.

Detailed Description

The following detailed description of the preferred embodiments of the present invention is provided to enable those skilled in the art to more clearly understand the advantages and features of the present invention and to clearly define the scope of the present invention.

Example 1

The embodiment provides a 3D object detection method based on a monocular camera, and the method is mainly realized based on a vehicle-mounted camera and a vehicle-mounted laser radar. The vehicle-mounted camera and the vehicle-mounted laser radar can be arranged in one or more than one mode, and the number is not limited. The vehicle-mounted camera and the vehicle-mounted laser radar are arranged on the same side of the test vehicle or at a close position as much as possible so as to obtain original image data and laser radar data at the same angle.

As shown in fig. 1 to 6, a 3D object detection method based on a monocular camera includes the following steps:

101. and establishing a depth estimation model, wherein the depth estimation model is used for obtaining a prediction depth map matched with the original image data.

The depth estimation model is established mainly to be capable of rapidly acquiring a predicted depth map matched with original image data. In the specific process of establishing the depth estimation model, firstly, a plurality of frames of original image data and depth image data matched with the original image data are acquired, and a training set is established. And each frame of original image data and corresponding depth image data in the training set form a sample. And then, taking each sample in the training set as a training factor, and calculating a depth estimation model according to a Scale-innovative Error loss function.

The depth estimation model is used to obtain a predicted depth map that matches the original image data. Simply speaking, the prediction model is trained by using the formed training set fusion loss function, so as to obtain a final depth estimation model, the input of the depth estimation model is original image data, and the depth estimation model can directly output a corresponding prediction depth map according to characteristic parameters in the original image data. The pixel information in the predicted depth map refers to the distance between the object and the vehicle, and thus the depth estimation model is a model that measures the distance. In the technical scheme, the Scale-innovative Error loss function is as follows:

where Loss is a Loss function, n is the effective pixel, d_iRepresenting the depth at the i-position,

y_ithe depth true values corresponding to the eigenvalues and i, respectively, are found, and λ is 0.5, the best results are obtained.

In the technical scheme, a training set is established in order to acquire a plurality of frames of original image data and depth image data matched with each frame of original image data. Generally, an original image data and a laser radar data matched with the original image data are acquired simultaneously through a vehicle-mounted camera and a vehicle-mounted laser radar. Here, the matching means that the shooting angles of the laser radar data and the original image data and the object to be shot are matched with each other. And carrying out time synchronization processing on the laser radar data and the original image data to form a one-to-one correspondence relationship so as to ensure that the laser radar data and the original image data have good simultaneity, namely that the shooting time of the laser radar data and the shooting time of the original image data are also consistent. And then, projecting the three-dimensional point cloud in the laser radar data into an image plane to form a point cloud picture. And finally, performing depth expansion processing on the point cloud image to obtain depth image data matched with the original image data. A training set can be formed by forming a data set by a plurality of frames of original image data and depth image data matched with the original image data.

Generally speaking, when the original image data and the laser radar data are respectively obtained through the vehicle-mounted camera and the vehicle-mounted laser radar, the vehicle-mounted camera and the vehicle-mounted laser radar respectively record the time stamp of each frame of image. In the process of time synchronization processing of the laser radar data and the original image data, the time-matched laser radar data and the original image data can be obtained only by finding out the nearest original image data time stamp according to the time stamp of each frame of laser radar data to match.

In this embodiment, in the process of projecting the three-dimensional point cloud in the laser radar data into the image plane to form the point cloud image, an internal reference matrix of the vehicle-mounted camera needs to be obtained first (the internal reference matrix of the camera is fixed and can be directly obtained by a manufacturer in general), and a rotation and translation matrix between the vehicle-mounted camera and the vehicle-mounted laser radar is calculated by a joint calibration method. And according to the internal reference matrix and the rotation and translation matrix, projecting the three-dimensional point cloud in the laser radar data into an image plane so as to convert the three-dimensional point cloud in the laser radar data into a two-dimensional point cloud image. In a two-dimensional point cloud image, the pixel values are depth information of the laser radar points.

Preferably, in order to perform depth expansion processing on the point cloud image to obtain depth image data matched with the original image data, the point cloud image usually needs to be subjected to inversion processing, and during the inversion processing, the farthest distance is usually set to be 100 meters, so that D is set to be D_inv＝100–D_gtWherein D is_gtRepresenting true depth, e.g. by D_gtIs set to 16m, then D_invIt is 84 m. The inverted dot cloud was then subjected to a first kernel dilation process using a kernel matrix of 5x5 values of 1 to complete the aperture closure. And then, performing first dynamic fuzzy outlier removal processing on the point cloud image subjected to the first kernel expansion processing by using a median filter (the kernel size is 5). And performing second kernel expansion processing on the point cloud image subjected to the first dynamic blurring outlier removal processing by using a kernel with the 7x7 value of 1 to complete the hole pitch filling. And performing third core expansion treatment on the point cloud picture subjected to the second core expansion treatment by using the core with the value of 15x15 being 1 to complete macropore closure. Subsequently, the median filter (kernel size 5) is continuously used, and the point cloud image subjected to the third kernel extension processing is subjected to second dynamic blurring outlier removal processing. And finally, removing the abnormal value by adopting a bilateral filter aiming at the point cloud picture which is subjected to the second dynamic fuzzy abnormal value removal processing, and simultaneously keeping the local boundary characteristics. In removing outliers using a bilateral filter, the diameter may be set to 5, with color θ being 0.5 and θ' being 2. And carrying out secondary reversion processing on the depth original image obtained by removing abnormal values by the bilateral filter to obtainDepth image data (i.e., depth original map) matched to original image data, wherein depth information D is 100-D_inv。

For convenience of understanding, a specific example is given to the process of establishing the training set, for example, one-pass acquisition may acquire 2 ten thousand pictures and 1W point cloud data, perform data cleaning on the acquired data, perform time synchronization processing, and then convert the laser radar data into a point cloud image, perform deep expansion on the point cloud image, and the like. Assuming that 5000 effective original image data and 5000 corresponding depth image data are finally washed out, two sets of the original image data and the 5000 effective original image data form a group of samples, and the training sets are distinguished according to the ratio of 8:1:1, so that 4000 groups of training samples, 500 groups of verification samples and 500 groups of test samples exist.

102. And acquiring original image data through the vehicle-mounted camera.

After the depth estimation model is built, the corresponding predicted depth map can be directly obtained according to the original image data acquired by the vehicle-mounted camera. At this time, the original image data which really needs to be analyzed can be obtained through the vehicle-mounted camera so as to detect the 3D information of the object in the original image data.

103. And acquiring a predicted depth map matched with the original image data by using the depth estimation model.

In the process of acquiring the predicted depth map matched with the original image data by using the depth estimation model, the basic working principle is that the Dense121Net is adopted as a coding layer to extract the characteristic parameters in the depth image data. And then decoding the coding layer to obtain three branches, wherein the three branches have different decoding sizes. Relative local structural features under different sizes are extracted through the three branches, outputs of the three branches are connected in series, the size of the input image is unified, and a series connection layer is obtained. And finally, performing convolution calculation on the serial layers, and analyzing a local structure to obtain a prediction depth map corresponding to the depth image data. And predicting the size of each pixel point in the depth map to be the depth value.

In this embodiment, in order to obtain the serial layer of the three branches, it is necessary to reduce the dimension of the coded dense feature to H/8, extract the context structure information through the spatial pyramid pooling layer, and connect the extracted structure information to the local plane guiding layer (8x8) to analyze the local geometric structure information thereof, so as to generate the estimated depth feature of the first branch. Wherein, the expansion rate of the spatial pyramid pooling layer is 3, 6, 12, 18 and 24. And then reducing the dimension of the coded dense features to H/4, connecting the depth features generated by the first branch in series, and connecting the depth features to a local plane guide layer (4x4) to analyze local geometric structure information of the depth features, thereby generating estimated depth features of the second branch. And finally, reducing the dimension of the coded dense features to H/2, connecting the depth features generated by the second branch in series, and connecting the depth features to a local plane guide layer (2x2) to analyze the local geometric structure information of the depth features, so that the estimated depth features of the third branch are generated. And connecting the estimated depth features generated by the first branch, the second branch and the third branch in series, and unifying the size of the estimated depth features into the size of the input image to obtain a series layer. The convolutional layers are connected in series, and finally a prediction depth map corresponding to the depth image data is generated.

The characteristic parameters in the present invention include, but are not limited to, image texture, color, and spatial structure.

104. A target object is detected in the raw image data.

In a specific implementation process, the original image data may be labeled by using the two-dimensional bounding box, so that all target objects in the original image data are detected and labeled. The target object referred to herein includes at least cars, trucks, vans, pedestrians, riders, and the like.

Preferably, after the step of establishing the depth estimation model and before the step of detecting the target object in the raw image data, an object detection model can be optionally established. And detecting and labeling the target object in the original image data by using the established object detection model. Specifically, the establishment of the object detection model also needs to be completed based on the established training set. Generally, an object detection training set is formed with each detection object in each raw image data in the training set. And (3) training an object detection model by using a deep learning model frame Darknet53 as a feature extraction frame and using a detection object of each original image data in a training set as a training factor according to a Focal local Loss function. The input to the object detection model is raw image data and the output is a target object, such as a person, vehicle or truck. The object detection model and the depth estimation model are unrelated and can respectively and independently operate.

The following Focal local Loss function in the technical scheme is as follows:

FL(p_t)＝-α(1-p_t)^γlog(p_t)；

wherein, the p is_tFor detection probability, the alpha is an inter-class parameter, (1-p)_t)^γThe factor is adjusted for simple/difficult samples and works best when α is 0.5 γ 2.

105. And projecting the target object into the corresponding prediction depth map to generate an anchoring area, and performing 3D reconstruction on the anchoring area to obtain a three-dimensional coordinate value of the target object in a world coordinate system.

In the specific implementation process, the operation difficulty of directly projecting the outline of the detected object into the predicted depth map is relatively large, and the anchor region can be generated as long as the two-dimensional bounding boxes formed by the object detection in step 104 are projected into the corresponding predicted depth maps one by one. Equivalently, the two-dimensional bounding box surrounding the detection object is directly projected to the corresponding predicted depth map, so that the contour of the detection object is directly projected in the predicted depth map instead, and the operation difficulty is reduced. And then 3D reconstruction is carried out on the anchoring area, so that the three-dimensional coordinate value of the detection object in the world coordinate system is obtained.

After repeated verification, the three-dimensional coordinate information of the object within 100 meters is obtained by the 3D object detection method based on the monocular camera disclosed by the embodiment, so that higher precision can be realized under the condition of lower cost, and the calculation efficiency is also obviously improved.

The embodiment provides a 3D object detection method based on monocular camera, can accomplish the accurate detection to the 3D object with the help of monocular camera and on-vehicle laser radar, in the whole calculation process, do not rely on the assumption basis that the road surface is level and smooth completely, compare with the traditional detection scheme who carries out the 3D target with the help of monocular camera, the detection precision has obtained obvious improvement, can provide comparatively accurate reference data for the driver, do benefit to and improve driving safety, the detection cost of 3D object has obviously been reduced simultaneously, very important use value has.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A3D object detection method based on a monocular camera is characterized by comprising the following steps:

acquiring original image data through a vehicle-mounted camera;

detecting a target object in the original image data;

2. The monocular camera-based 3D object detection method of claim 1, wherein the establishing a depth estimation model step comprises:

3. The monocular camera-based 3D object detecting method according to claim 2, wherein after the step of establishing the depth estimation model, before the step of detecting the target object in the raw image data, further comprising the step of establishing an object detection model:

4. The monocular camera-based 3D object detection method of claim 3, wherein the Focal local Loss function is as follows:

FL(p_t)＝-α(1-p_t)^γlog(p_t)

wherein, the p is_tFor detection probability, the alpha is an inter-class parameter, (1-p)_t)^γThe factor is adjusted for simple/difficult samples, and α ═ 0.5 γ ═ 2.

5. The 3D object detection method based on a monocular camera according to claim 2, wherein the step of acquiring a plurality of frames of original image data and depth image data matched with each of the original image data and establishing a training set comprises:

simultaneously acquiring a plurality of frames of original image data and laser radar data matched with the original image data; performing time synchronization processing on each laser radar data and each original image data to form a one-to-one corresponding relation;

6. The monocular camera-based 3D object detection method of claim 5, wherein the projecting the three-dimensional point cloud in the lidar data into the image plane to form a point cloud map comprises: acquiring an internal reference matrix of the vehicle-mounted camera;

7. The 3D object detection method based on the monocular camera as claimed in claim 5, wherein the step of performing the depth expansion processing on the point cloud image to obtain the depth image data matched with the original image data comprises:

reversing the point cloud chart;

performing third inner core expansion treatment on the point cloud picture subjected to the second inner core expansion treatment to complete large hole closure; performing second dynamic fuzzy outlier removal processing on the point cloud image subjected to the third kernel expansion processing by using a median filter;

8. The monocular camera-based 3D object detection method of claim 1, wherein the step of obtaining the predicted depth map matching the original image data using the depth estimation model comprises:

decoding the coding layer to obtain three branches, extracting relative local structural features under different sizes through the three branches, connecting the outputs of the three branches in series, and obtaining a series layer by unifying the size of the three branches into the size of an input image; and performing convolution calculation on the serial layers, and analyzing a local structure to obtain a prediction depth map corresponding to the depth image data.

9. The monocular camera-based 3D object detection method according to claim 8, wherein the decoding the encoded layer, obtaining three branches, extracting relative local structural features at different sizes through the three branches, and concatenating outputs of the three branches, the unified size being an input image size, obtaining a concatenated layer step includes:

10. The monocular camera-based 3D object detection method of claim 8, wherein the characteristic parameters include image texture, color, and spatial structure.