CN112883790A - 3D object detection method based on monocular camera - Google Patents

3D object detection method based on monocular camera Download PDF

Info

Publication number
CN112883790A
CN112883790A CN202110056909.9A CN202110056909A CN112883790A CN 112883790 A CN112883790 A CN 112883790A CN 202110056909 A CN202110056909 A CN 202110056909A CN 112883790 A CN112883790 A CN 112883790A
Authority
CN
China
Prior art keywords
image data
depth
original image
point cloud
object detection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110056909.9A
Other languages
Chinese (zh)
Inventor
黄梓航
伍小军
周航
刘妮妮
董萌
陈炫翰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huizhou Desay SV Automotive Co Ltd
Original Assignee
Huizhou Desay SV Automotive Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huizhou Desay SV Automotive Co Ltd filed Critical Huizhou Desay SV Automotive Co Ltd
Priority to CN202110056909.9A priority Critical patent/CN112883790A/en
Publication of CN112883790A publication Critical patent/CN112883790A/en
Priority to PCT/CN2021/102534 priority patent/WO2022151664A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/70Denoising; Smoothing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/64Three-dimensional objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20024Filtering details
    • G06T2207/20032Median filtering

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention relates to a 3D object detection method based on a monocular camera, which comprises the following steps: establishing a depth estimation model, wherein the depth estimation model is used for acquiring a predicted depth map matched with original image data; acquiring original image data through a vehicle-mounted camera; acquiring a predicted depth map matched with original image data by using a depth estimation model; detecting a target object in the original image data; and projecting the target object into the corresponding prediction depth map to generate an anchoring area, and performing 3D reconstruction on the anchoring area to obtain a three-dimensional coordinate value of the target object in a world coordinate system. According to the 3D object detection method, three-dimensional coordinate information of the object can be obtained only by means of the monocular camera, the method does not depend on the assumption that the road surface is completely flat, is low in cost and high in detection precision, can provide accurate reference data for a driver, is beneficial to improvement of driving safety, and has important use value.

Description

3D object detection method based on monocular camera
Technical Field
The invention relates to the technical field of 3D object detection, in particular to a 3D object detection method based on a monocular camera.
Background
In computer vision, detecting an object of interest and inferring its three-dimensional characteristics is a core problem and is currently widely used. Especially in the last decade, with the rapid development of unmanned technology and mobile robots, object detection plays an extremely important role in the sensing system, and the accurate and efficient sensing system can effectively ensure the safety of the robot and other surrounding moving objects. In recent years, although two-dimensional object detection has been developed rapidly in unmanned systems, there is still a need for further improvement in converting a detected object from an image plane to a real-world posture. The task of conventional three-dimensional object detection usually depends heavily on depth sensors such as laser radar or millimeter wave radar, and the like, so that the calculation amount is large, and the cost is high.
In view of the fact that more and more vehicles are already equipped with high-definition cameras, it is becoming an industry trend to perform detection of 3D objects by means of monocular cameras to reduce costs. In the existing 3D object detection algorithm based on the monocular camera, the real-time performance and the precision are far inferior to those of methods (such as laser radar) using other sensors. This is because existing monocular camera-based 3D object detection algorithms all rely on an assumption that the ground (or earth) is flat. Based on this assumption, three-dimensional information can be modeled using a two-dimensional information source. For example, since the ground is assumed to be flat, the conventional method further assumes that the bottom of the two-dimensional target frame corresponding to the detected object is located on the ground plane. Therefore, when an object is detected, a simple geometric calculation can calculate the distance between the obstacle and the host vehicle based on the plane assumption.
However, the actual road surface may not be perfectly flat, and these conventional methods are affected when the road surface is curved or uneven. For example, when the ground plane is assumed to be flat, and not actually flat, the curves on the driving surface may lead to inaccurate predictions, and the estimation of the distance to obstacles in the environment may be judged to be too high or too low. In both cases, inaccurate distance estimates can have a direct negative impact on various operations of the vehicle, potentially compromising lateral and longitudinal control or driving safety and reliability. For example, underestimated distances may result in the failure of Adaptive Cruise Control (ACC) functions and, more importantly, Automatic Emergency Brake (AEB) functions in the prevention of potential traffic accidents. Conversely, overestimated distances may cause the ACC or AEB functions to be activated when not needed, thereby causing potential discomfort or injury to the occupant, while also reducing the occupant's confidence in the safe operation capabilities of the vehicle.
Disclosure of Invention
In order to overcome the defects, the invention provides a 3D object detection method based on a monocular camera, which comprises the following steps:
establishing a depth estimation model, wherein the depth estimation model is used for acquiring a predicted depth map matched with original image data;
acquiring original image data through a vehicle-mounted camera;
acquiring a predicted depth map matched with original image data by using a depth estimation model;
detecting a target object in the original image data;
and projecting the target object into the corresponding prediction depth map to generate an anchoring area, and performing 3D reconstruction on the anchoring area to obtain a three-dimensional coordinate value of the target object in a world coordinate system.
Further, the step of establishing the depth estimation model includes:
acquiring a plurality of frames of original image data and depth image data matched with the original image data, and establishing a training set, wherein each frame of original image data and the depth image data corresponding to each frame of original image data in the training set form a sample;
and (3) taking each sample in the training set as a training factor, and calculating a depth estimation model according to a Scale-innovative Error loss function.
Further, after the step of establishing the depth estimation model and before the step of detecting the target object in the raw image data, the method further comprises the step of establishing an object detection model:
and (3) training an object detection model by using a deep learning model frame Darknet53 as a feature extraction frame and using each original image data in a training set as a training factor according to a Focal local Loss function, wherein the object detection model is used for detecting a target object in the original image data.
Further, the Focal local Loss function is as follows:
FL(pt)=-α(1-pt)γlog(pt)
wherein pt is a detection probability, α is an inter-class parameter, (1-pt) γ is a simple/difficult sample adjustment factor, and α ═ 0.5 γ ═ 2.
Further, the step of acquiring a plurality of frames of original image data and depth image data matched with each original image data and establishing a training set includes:
simultaneously acquiring a plurality of frames of original image data and laser radar data matched with the original image data;
performing time synchronization processing on each laser radar data and each original image data to form a one-to-one corresponding relation;
projecting the three-dimensional point cloud in the laser radar data into an image plane to form a point cloud picture;
respectively carrying out depth expansion processing on the point cloud images to obtain depth image data matched with the original image data;
and establishing a training set by using a plurality of frames of original image data and depth image data matched with each original image data.
Further, the step of projecting the three-dimensional point cloud in the lidar data into an image plane to form a point cloud picture includes:
acquiring an internal reference matrix of the vehicle-mounted camera;
calculating a rotation translation matrix between the vehicle-mounted camera and the vehicle-mounted laser radar by a combined calibration method;
and converting the three-dimensional point cloud in the laser radar data into a two-dimensional point cloud picture according to the internal reference matrix and the rotational translation matrix.
Further, the step of performing depth expansion processing on the point cloud image to obtain depth image data matched with the original image data includes:
reversing the point cloud chart;
performing first kernel expansion treatment on the point cloud picture subjected to the reversal treatment to complete the closure of the small hole;
performing first dynamic fuzzy outlier removal processing on the point cloud image subjected to the first kernel expansion processing by using a median filter;
performing second kernel expansion processing on the point cloud picture subjected to the first dynamic fuzzy outlier removal processing to complete hole pitch filling;
performing third inner core expansion treatment on the point cloud picture subjected to the second inner core expansion treatment to complete large hole closure;
performing second dynamic fuzzy outlier removal processing on the point cloud image subjected to the third kernel expansion processing by using a median filter;
and (3) removing the abnormal value by adopting a bilateral filter aiming at the point cloud picture subjected to the second dynamic fuzzy abnormal value removal processing, keeping local boundary characteristics, and realizing secondary reversion processing so as to obtain depth image data matched with the original image data.
Further, the step of obtaining a predicted depth map matched with the original image data by using the depth estimation model includes:
extracting characteristic parameters in original image data by using Dense121Net as a coding layer;
decoding the coding layer to obtain three branches, extracting relative local structural features under different sizes through the three branches, connecting the outputs of the three branches in series, and obtaining a series layer by unifying the size of the three branches into the size of an input image;
and performing convolution calculation on the serial layers, and analyzing a local structure to obtain a prediction depth map corresponding to the depth image data.
Further, the decoding the coding layer to obtain three branches to extract the relative local structural features under different sizes through the three branches, and serially connect the outputs of the three branches, and the step of obtaining the serial layer with the unified size as the size of the input image includes:
reducing the dimension of the coded features to H/8, extracting context structure information through a space pyramid pooling layer, connecting the extracted structure information to a local plane guiding layer, and analyzing local geometric structure information of the local plane guiding layer, so as to generate estimated depth features of a first branch;
reducing the dimension of the coded features to H/4, connecting the coded features in series with the depth features generated by the first branch, and connecting the coded features to a local plane guide layer to analyze the local geometric structure information of the coded features, so that estimated depth features of a second branch are generated;
reducing the dimension of the coded features to H/2, connecting the depth features generated by the second branch in series, and connecting the depth features to a local plane guide layer to analyze the local geometric structure information of the depth features, so as to generate estimated depth features of a third branch;
and connecting the estimated depth features generated by the first branch, the second branch and the third branch in series, and unifying the size of the estimated depth features into the size of the input image to obtain a series layer.
Further, the characteristic parameters include image texture, color and spatial structure.
Compared with the prior art, the invention has the following beneficial effects:
the invention discloses a 3D object detection method based on a monocular camera, which can accurately detect a 3D object only by means of the monocular camera, does not depend on the assumed basis that the road surface is completely flat in the whole calculation process, and has the advantages of obviously improved detection precision, capability of providing more accurate reference data for a driver, contribution to improving driving safety, obvious reduction of the detection cost of the 3D object and very important use value compared with the traditional detection scheme of executing a 3D target by means of the monocular camera.
Drawings
Fig. 1 is a schematic flow chart of a 3D object detection method based on a monocular camera in embodiment 1.
Fig. 2 is a schematic diagram of a training set establishment process in embodiment 1.
Fig. 3 is a schematic diagram of a 3D object detection method based on a monocular camera in embodiment 1.
Fig. 4 is a schematic flowchart of a specific process of obtaining a predicted depth map by using a depth estimation model in embodiment 1.
FIG. 5 is a diagram illustrating original image data and annotation information in example 1.
Fig. 6 is a diagram illustrating a predicted depth map and an anchor region in embodiment 1.
The drawings are for illustrative purposes only and are not to be construed as limiting the patent; for purposes of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced and do not represent actual dimensions; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted; the same or similar reference numerals correspond to the same or similar parts; the terms describing positional relationships in the drawings are for illustrative purposes only and are not to be construed as limiting the patent.
Detailed Description
The following detailed description of the preferred embodiments of the present invention is provided to enable those skilled in the art to more clearly understand the advantages and features of the present invention and to clearly define the scope of the present invention.
Example 1
The embodiment provides a 3D object detection method based on a monocular camera, and the method is mainly realized based on a vehicle-mounted camera and a vehicle-mounted laser radar. The vehicle-mounted camera and the vehicle-mounted laser radar can be arranged in one or more than one mode, and the number is not limited. The vehicle-mounted camera and the vehicle-mounted laser radar are arranged on the same side of the test vehicle or at a close position as much as possible so as to obtain original image data and laser radar data at the same angle.
As shown in fig. 1 to 6, a 3D object detection method based on a monocular camera includes the following steps:
101. and establishing a depth estimation model, wherein the depth estimation model is used for obtaining a prediction depth map matched with the original image data.
The depth estimation model is established mainly to be capable of rapidly acquiring a predicted depth map matched with original image data. In the specific process of establishing the depth estimation model, firstly, a plurality of frames of original image data and depth image data matched with the original image data are acquired, and a training set is established. And each frame of original image data and corresponding depth image data in the training set form a sample. And then, taking each sample in the training set as a training factor, and calculating a depth estimation model according to a Scale-innovative Error loss function.
The depth estimation model is used to obtain a predicted depth map that matches the original image data. Simply speaking, the prediction model is trained by using the formed training set fusion loss function, so as to obtain a final depth estimation model, the input of the depth estimation model is original image data, and the depth estimation model can directly output a corresponding prediction depth map according to characteristic parameters in the original image data. The pixel information in the predicted depth map refers to the distance between the object and the vehicle, and thus the depth estimation model is a model that measures the distance. In the technical scheme, the Scale-innovative Error loss function is as follows:
Figure BDA0002900953660000061
where Loss is a Loss function, n is the effective pixel, diRepresenting the depth at the i-position,
Figure BDA0002900953660000062
Figure BDA0002900953660000063
yithe depth true values corresponding to the eigenvalues and i, respectively, are found, and λ is 0.5, the best results are obtained.
In the technical scheme, a training set is established in order to acquire a plurality of frames of original image data and depth image data matched with each frame of original image data. Generally, an original image data and a laser radar data matched with the original image data are acquired simultaneously through a vehicle-mounted camera and a vehicle-mounted laser radar. Here, the matching means that the shooting angles of the laser radar data and the original image data and the object to be shot are matched with each other. And carrying out time synchronization processing on the laser radar data and the original image data to form a one-to-one correspondence relationship so as to ensure that the laser radar data and the original image data have good simultaneity, namely that the shooting time of the laser radar data and the shooting time of the original image data are also consistent. And then, projecting the three-dimensional point cloud in the laser radar data into an image plane to form a point cloud picture. And finally, performing depth expansion processing on the point cloud image to obtain depth image data matched with the original image data. A training set can be formed by forming a data set by a plurality of frames of original image data and depth image data matched with the original image data.
Generally speaking, when the original image data and the laser radar data are respectively obtained through the vehicle-mounted camera and the vehicle-mounted laser radar, the vehicle-mounted camera and the vehicle-mounted laser radar respectively record the time stamp of each frame of image. In the process of time synchronization processing of the laser radar data and the original image data, the time-matched laser radar data and the original image data can be obtained only by finding out the nearest original image data time stamp according to the time stamp of each frame of laser radar data to match.
In this embodiment, in the process of projecting the three-dimensional point cloud in the laser radar data into the image plane to form the point cloud image, an internal reference matrix of the vehicle-mounted camera needs to be obtained first (the internal reference matrix of the camera is fixed and can be directly obtained by a manufacturer in general), and a rotation and translation matrix between the vehicle-mounted camera and the vehicle-mounted laser radar is calculated by a joint calibration method. And according to the internal reference matrix and the rotation and translation matrix, projecting the three-dimensional point cloud in the laser radar data into an image plane so as to convert the three-dimensional point cloud in the laser radar data into a two-dimensional point cloud image. In a two-dimensional point cloud image, the pixel values are depth information of the laser radar points.
Preferably, in order to perform depth expansion processing on the point cloud image to obtain depth image data matched with the original image data, the point cloud image usually needs to be subjected to inversion processing, and during the inversion processing, the farthest distance is usually set to be 100 meters, so that D is set to be Dinv=100–DgtWherein D isgtRepresenting true depth, e.g. by DgtIs set to 16m, then DinvIt is 84 m. The inverted dot cloud was then subjected to a first kernel dilation process using a kernel matrix of 5x5 values of 1 to complete the aperture closure. And then, performing first dynamic fuzzy outlier removal processing on the point cloud image subjected to the first kernel expansion processing by using a median filter (the kernel size is 5). And performing second kernel expansion processing on the point cloud image subjected to the first dynamic blurring outlier removal processing by using a kernel with the 7x7 value of 1 to complete the hole pitch filling. And performing third core expansion treatment on the point cloud picture subjected to the second core expansion treatment by using the core with the value of 15x15 being 1 to complete macropore closure. Subsequently, the median filter (kernel size 5) is continuously used, and the point cloud image subjected to the third kernel extension processing is subjected to second dynamic blurring outlier removal processing. And finally, removing the abnormal value by adopting a bilateral filter aiming at the point cloud picture which is subjected to the second dynamic fuzzy abnormal value removal processing, and simultaneously keeping the local boundary characteristics. In removing outliers using a bilateral filter, the diameter may be set to 5, with color θ being 0.5 and θ' being 2. And carrying out secondary reversion processing on the depth original image obtained by removing abnormal values by the bilateral filter to obtainDepth image data (i.e., depth original map) matched to original image data, wherein depth information D is 100-Dinv
For convenience of understanding, a specific example is given to the process of establishing the training set, for example, one-pass acquisition may acquire 2 ten thousand pictures and 1W point cloud data, perform data cleaning on the acquired data, perform time synchronization processing, and then convert the laser radar data into a point cloud image, perform deep expansion on the point cloud image, and the like. Assuming that 5000 effective original image data and 5000 corresponding depth image data are finally washed out, two sets of the original image data and the 5000 effective original image data form a group of samples, and the training sets are distinguished according to the ratio of 8:1:1, so that 4000 groups of training samples, 500 groups of verification samples and 500 groups of test samples exist.
102. And acquiring original image data through the vehicle-mounted camera.
After the depth estimation model is built, the corresponding predicted depth map can be directly obtained according to the original image data acquired by the vehicle-mounted camera. At this time, the original image data which really needs to be analyzed can be obtained through the vehicle-mounted camera so as to detect the 3D information of the object in the original image data.
103. And acquiring a predicted depth map matched with the original image data by using the depth estimation model.
In the process of acquiring the predicted depth map matched with the original image data by using the depth estimation model, the basic working principle is that the Dense121Net is adopted as a coding layer to extract the characteristic parameters in the depth image data. And then decoding the coding layer to obtain three branches, wherein the three branches have different decoding sizes. Relative local structural features under different sizes are extracted through the three branches, outputs of the three branches are connected in series, the size of the input image is unified, and a series connection layer is obtained. And finally, performing convolution calculation on the serial layers, and analyzing a local structure to obtain a prediction depth map corresponding to the depth image data. And predicting the size of each pixel point in the depth map to be the depth value.
In this embodiment, in order to obtain the serial layer of the three branches, it is necessary to reduce the dimension of the coded dense feature to H/8, extract the context structure information through the spatial pyramid pooling layer, and connect the extracted structure information to the local plane guiding layer (8x8) to analyze the local geometric structure information thereof, so as to generate the estimated depth feature of the first branch. Wherein, the expansion rate of the spatial pyramid pooling layer is 3, 6, 12, 18 and 24. And then reducing the dimension of the coded dense features to H/4, connecting the depth features generated by the first branch in series, and connecting the depth features to a local plane guide layer (4x4) to analyze local geometric structure information of the depth features, thereby generating estimated depth features of the second branch. And finally, reducing the dimension of the coded dense features to H/2, connecting the depth features generated by the second branch in series, and connecting the depth features to a local plane guide layer (2x2) to analyze the local geometric structure information of the depth features, so that the estimated depth features of the third branch are generated. And connecting the estimated depth features generated by the first branch, the second branch and the third branch in series, and unifying the size of the estimated depth features into the size of the input image to obtain a series layer. The convolutional layers are connected in series, and finally a prediction depth map corresponding to the depth image data is generated.
The characteristic parameters in the present invention include, but are not limited to, image texture, color, and spatial structure.
104. A target object is detected in the raw image data.
In a specific implementation process, the original image data may be labeled by using the two-dimensional bounding box, so that all target objects in the original image data are detected and labeled. The target object referred to herein includes at least cars, trucks, vans, pedestrians, riders, and the like.
Preferably, after the step of establishing the depth estimation model and before the step of detecting the target object in the raw image data, an object detection model can be optionally established. And detecting and labeling the target object in the original image data by using the established object detection model. Specifically, the establishment of the object detection model also needs to be completed based on the established training set. Generally, an object detection training set is formed with each detection object in each raw image data in the training set. And (3) training an object detection model by using a deep learning model frame Darknet53 as a feature extraction frame and using a detection object of each original image data in a training set as a training factor according to a Focal local Loss function. The input to the object detection model is raw image data and the output is a target object, such as a person, vehicle or truck. The object detection model and the depth estimation model are unrelated and can respectively and independently operate.
The following Focal local Loss function in the technical scheme is as follows:
FL(pt)=-α(1-pt)γlog(pt);
wherein, the p istFor detection probability, the alpha is an inter-class parameter, (1-p)t)γThe factor is adjusted for simple/difficult samples and works best when α is 0.5 γ 2.
105. And projecting the target object into the corresponding prediction depth map to generate an anchoring area, and performing 3D reconstruction on the anchoring area to obtain a three-dimensional coordinate value of the target object in a world coordinate system.
In the specific implementation process, the operation difficulty of directly projecting the outline of the detected object into the predicted depth map is relatively large, and the anchor region can be generated as long as the two-dimensional bounding boxes formed by the object detection in step 104 are projected into the corresponding predicted depth maps one by one. Equivalently, the two-dimensional bounding box surrounding the detection object is directly projected to the corresponding predicted depth map, so that the contour of the detection object is directly projected in the predicted depth map instead, and the operation difficulty is reduced. And then 3D reconstruction is carried out on the anchoring area, so that the three-dimensional coordinate value of the detection object in the world coordinate system is obtained.
After repeated verification, the three-dimensional coordinate information of the object within 100 meters is obtained by the 3D object detection method based on the monocular camera disclosed by the embodiment, so that higher precision can be realized under the condition of lower cost, and the calculation efficiency is also obviously improved.
The embodiment provides a 3D object detection method based on monocular camera, can accomplish the accurate detection to the 3D object with the help of monocular camera and on-vehicle laser radar, in the whole calculation process, do not rely on the assumption basis that the road surface is level and smooth completely, compare with the traditional detection scheme who carries out the 3D target with the help of monocular camera, the detection precision has obtained obvious improvement, can provide comparatively accurate reference data for the driver, do benefit to and improve driving safety, the detection cost of 3D object has obviously been reduced simultaneously, very important use value has.
It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims (10)

1. A3D object detection method based on a monocular camera is characterized by comprising the following steps:
establishing a depth estimation model, wherein the depth estimation model is used for acquiring a predicted depth map matched with original image data;
acquiring original image data through a vehicle-mounted camera;
acquiring a predicted depth map matched with original image data by using a depth estimation model;
detecting a target object in the original image data;
and projecting the target object into the corresponding prediction depth map to generate an anchoring area, and performing 3D reconstruction on the anchoring area to obtain a three-dimensional coordinate value of the target object in a world coordinate system.
2. The monocular camera-based 3D object detection method of claim 1, wherein the establishing a depth estimation model step comprises:
acquiring a plurality of frames of original image data and depth image data matched with the original image data, and establishing a training set, wherein each frame of original image data and the depth image data corresponding to each frame of original image data in the training set form a sample;
and (3) taking each sample in the training set as a training factor, and calculating a depth estimation model according to a Scale-innovative Error loss function.
3. The monocular camera-based 3D object detecting method according to claim 2, wherein after the step of establishing the depth estimation model, before the step of detecting the target object in the raw image data, further comprising the step of establishing an object detection model:
and (3) training an object detection model by using a deep learning model frame Darknet53 as a feature extraction frame and using each original image data in a training set as a training factor according to a Focal local Loss function, wherein the object detection model is used for detecting a target object in the original image data.
4. The monocular camera-based 3D object detection method of claim 3, wherein the Focal local Loss function is as follows:
FL(pt)=-α(1-pt)γlog(pt)
wherein, the p istFor detection probability, the alpha is an inter-class parameter, (1-p)t)γThe factor is adjusted for simple/difficult samples, and α ═ 0.5 γ ═ 2.
5. The 3D object detection method based on a monocular camera according to claim 2, wherein the step of acquiring a plurality of frames of original image data and depth image data matched with each of the original image data and establishing a training set comprises:
simultaneously acquiring a plurality of frames of original image data and laser radar data matched with the original image data; performing time synchronization processing on each laser radar data and each original image data to form a one-to-one corresponding relation;
projecting the three-dimensional point cloud in the laser radar data into an image plane to form a point cloud picture;
respectively carrying out depth expansion processing on the point cloud images to obtain depth image data matched with the original image data;
and establishing a training set by using a plurality of frames of original image data and depth image data matched with each original image data.
6. The monocular camera-based 3D object detection method of claim 5, wherein the projecting the three-dimensional point cloud in the lidar data into the image plane to form a point cloud map comprises: acquiring an internal reference matrix of the vehicle-mounted camera;
calculating a rotation translation matrix between the vehicle-mounted camera and the vehicle-mounted laser radar by a combined calibration method;
and converting the three-dimensional point cloud in the laser radar data into a two-dimensional point cloud picture according to the internal reference matrix and the rotational translation matrix.
7. The 3D object detection method based on the monocular camera as claimed in claim 5, wherein the step of performing the depth expansion processing on the point cloud image to obtain the depth image data matched with the original image data comprises:
reversing the point cloud chart;
performing first kernel expansion treatment on the point cloud picture subjected to the reversal treatment to complete the closure of the small hole;
performing first dynamic fuzzy outlier removal processing on the point cloud image subjected to the first kernel expansion processing by using a median filter;
performing second kernel expansion processing on the point cloud picture subjected to the first dynamic fuzzy outlier removal processing to complete hole pitch filling;
performing third inner core expansion treatment on the point cloud picture subjected to the second inner core expansion treatment to complete large hole closure; performing second dynamic fuzzy outlier removal processing on the point cloud image subjected to the third kernel expansion processing by using a median filter;
and (3) removing the abnormal value by adopting a bilateral filter aiming at the point cloud picture subjected to the second dynamic fuzzy abnormal value removal processing, keeping local boundary characteristics, and realizing secondary reversion processing so as to obtain depth image data matched with the original image data.
8. The monocular camera-based 3D object detection method of claim 1, wherein the step of obtaining the predicted depth map matching the original image data using the depth estimation model comprises:
extracting characteristic parameters in original image data by using Dense121Net as a coding layer;
decoding the coding layer to obtain three branches, extracting relative local structural features under different sizes through the three branches, connecting the outputs of the three branches in series, and obtaining a series layer by unifying the size of the three branches into the size of an input image; and performing convolution calculation on the serial layers, and analyzing a local structure to obtain a prediction depth map corresponding to the depth image data.
9. The monocular camera-based 3D object detection method according to claim 8, wherein the decoding the encoded layer, obtaining three branches, extracting relative local structural features at different sizes through the three branches, and concatenating outputs of the three branches, the unified size being an input image size, obtaining a concatenated layer step includes:
reducing the dimension of the coded features to H/8, extracting context structure information through a space pyramid pooling layer, connecting the extracted structure information to a local plane guiding layer, and analyzing local geometric structure information of the local plane guiding layer, so as to generate estimated depth features of a first branch;
reducing the dimension of the coded features to H/4, connecting the coded features in series with the depth features generated by the first branch, and connecting the coded features to a local plane guide layer to analyze the local geometric structure information of the coded features, so that estimated depth features of a second branch are generated;
reducing the dimension of the coded features to H/2, connecting the depth features generated by the second branch in series, and connecting the depth features to a local plane guide layer to analyze the local geometric structure information of the depth features, so as to generate estimated depth features of a third branch;
and connecting the estimated depth features generated by the first branch, the second branch and the third branch in series, and unifying the size of the estimated depth features into the size of the input image to obtain a series layer.
10. The monocular camera-based 3D object detection method of claim 8, wherein the characteristic parameters include image texture, color, and spatial structure.
CN202110056909.9A 2021-01-15 2021-01-15 3D object detection method based on monocular camera Pending CN112883790A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110056909.9A CN112883790A (en) 2021-01-15 2021-01-15 3D object detection method based on monocular camera
PCT/CN2021/102534 WO2022151664A1 (en) 2021-01-15 2021-06-25 3d object detection method based on monocular camera

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110056909.9A CN112883790A (en) 2021-01-15 2021-01-15 3D object detection method based on monocular camera

Publications (1)

Publication Number Publication Date
CN112883790A true CN112883790A (en) 2021-06-01

Family

ID=76048445

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110056909.9A Pending CN112883790A (en) 2021-01-15 2021-01-15 3D object detection method based on monocular camera

Country Status (2)

Country Link
CN (1) CN112883790A (en)
WO (1) WO2022151664A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113705432A (en) * 2021-08-26 2021-11-26 京东鲲鹏(江苏)科技有限公司 Model training and three-dimensional target detection method, device, equipment and medium
WO2022151664A1 (en) * 2021-01-15 2022-07-21 惠州市德赛西威汽车电子股份有限公司 3d object detection method based on monocular camera
CN114842287A (en) * 2022-03-25 2022-08-02 中国科学院自动化研究所 Monocular three-dimensional target detection model training method and device of depth-guided deformer

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115937325B (en) * 2022-09-27 2023-06-23 上海几何伙伴智能驾驶有限公司 Vehicle-end camera calibration method combined with millimeter wave radar information
CN115546216B (en) * 2022-12-02 2023-03-31 深圳海星智驾科技有限公司 Tray detection method, device, equipment and storage medium
CN115622571B (en) * 2022-12-16 2023-03-10 电子科技大学 Radar target identification method based on data processing

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229366A (en) * 2017-12-28 2018-06-29 北京航空航天大学 Deep learning vehicle-installed obstacle detection method based on radar and fusing image data
EP3525131A1 (en) * 2018-02-09 2019-08-14 Bayerische Motoren Werke Aktiengesellschaft Methods and apparatuses for object detection in a scene represented by depth data of a range detection sensor and image data of a camera
CN112001958A (en) * 2020-10-28 2020-11-27 浙江浙能技术研究院有限公司 Virtual point cloud three-dimensional target detection method based on supervised monocular depth estimation

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112883790A (en) * 2021-01-15 2021-06-01 惠州市德赛西威汽车电子股份有限公司 3D object detection method based on monocular camera

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229366A (en) * 2017-12-28 2018-06-29 北京航空航天大学 Deep learning vehicle-installed obstacle detection method based on radar and fusing image data
EP3525131A1 (en) * 2018-02-09 2019-08-14 Bayerische Motoren Werke Aktiengesellschaft Methods and apparatuses for object detection in a scene represented by depth data of a range detection sensor and image data of a camera
CN112001958A (en) * 2020-10-28 2020-11-27 浙江浙能技术研究院有限公司 Virtual point cloud three-dimensional target detection method based on supervised monocular depth estimation

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022151664A1 (en) * 2021-01-15 2022-07-21 惠州市德赛西威汽车电子股份有限公司 3d object detection method based on monocular camera
CN113705432A (en) * 2021-08-26 2021-11-26 京东鲲鹏(江苏)科技有限公司 Model training and three-dimensional target detection method, device, equipment and medium
CN114842287A (en) * 2022-03-25 2022-08-02 中国科学院自动化研究所 Monocular three-dimensional target detection model training method and device of depth-guided deformer
CN114842287B (en) * 2022-03-25 2022-12-06 中国科学院自动化研究所 Monocular three-dimensional target detection model training method and device of depth-guided deformer

Also Published As

Publication number Publication date
WO2022151664A1 (en) 2022-07-21

Similar Documents

Publication Publication Date Title
WO2022151664A1 (en) 3d object detection method based on monocular camera
US10885398B2 (en) Joint 3D object detection and orientation estimation via multimodal fusion
US11615709B2 (en) Image generating apparatus, image generating method, and recording medium
US11527077B2 (en) Advanced driver assist system, method of calibrating the same, and method of detecting object in the same
US11120280B2 (en) Geometry-aware instance segmentation in stereo image capture processes
EP3686775B1 (en) Method for detecting pseudo-3d bounding box based on cnn capable of converting modes according to poses of objects using instance segmentation
CN111369617B (en) 3D target detection method of monocular view based on convolutional neural network
JP7135665B2 (en) VEHICLE CONTROL SYSTEM, VEHICLE CONTROL METHOD AND COMPUTER PROGRAM
CN110969064B (en) Image detection method and device based on monocular vision and storage equipment
CN111627001B (en) Image detection method and device
US20230005278A1 (en) Lane extraction method using projection transformation of three-dimensional point cloud map
KR101483742B1 (en) Lane Detection method for Advanced Vehicle
CN114118252A (en) Vehicle detection method and detection device based on sensor multivariate information fusion
CN114495064A (en) Monocular depth estimation-based vehicle surrounding obstacle early warning method
CN112654998B (en) Lane line detection method and device
CN107220632B (en) Road surface image segmentation method based on normal characteristic
CN116310673A (en) Three-dimensional target detection method based on fusion of point cloud and image features
KR20200040187A (en) Learning method and testing method for monitoring blind spot of vehicle, and learning device and testing device using the same
CN114118247A (en) Anchor-frame-free 3D target detection method based on multi-sensor fusion
CN116189150B (en) Monocular 3D target detection method, device, equipment and medium based on fusion output
US20230109473A1 (en) Vehicle, electronic apparatus, and control method thereof
CN114648639B (en) Target vehicle detection method, system and device
CN116403186A (en) Automatic driving three-dimensional target detection method based on FPN Swin Transformer and Pointernet++
CN116052120A (en) Excavator night object detection method based on image enhancement and multi-sensor fusion
Bharadhwaj et al. Deep learning-based 3D object detection using LiDAR and image data fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination