CN113139602A - 3D target detection method and system based on monocular camera and laser radar fusion - Google Patents
3D target detection method and system based on monocular camera and laser radar fusion Download PDFInfo
- Publication number
- CN113139602A CN113139602A CN202110447403.0A CN202110447403A CN113139602A CN 113139602 A CN113139602 A CN 113139602A CN 202110447403 A CN202110447403 A CN 202110447403A CN 113139602 A CN113139602 A CN 113139602A
- Authority
- CN
- China
- Prior art keywords
- point cloud
- cloud data
- monocular camera
- point
- target detection
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 50
- 230000004927 fusion Effects 0.000 title claims abstract description 48
- 230000011218 segmentation Effects 0.000 claims abstract description 55
- 239000011159 matrix material Substances 0.000 claims description 21
- 238000011176 pooling Methods 0.000 claims description 19
- 238000004364 calculation method Methods 0.000 claims description 7
- 238000003384 imaging method Methods 0.000 claims description 5
- 238000013519 translation Methods 0.000 claims description 5
- 238000007670 refining Methods 0.000 claims description 4
- 238000000605 extraction Methods 0.000 claims description 2
- 238000000034 method Methods 0.000 abstract description 16
- 230000000007 visual effect Effects 0.000 abstract description 7
- 238000007499 fusion processing Methods 0.000 abstract description 6
- 230000006870 function Effects 0.000 description 7
- 238000012549 training Methods 0.000 description 7
- 230000001629 suppression Effects 0.000 description 5
- 235000004522 Pentaglottis sempervirens Nutrition 0.000 description 4
- 238000001914 filtration Methods 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000004807 localization Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000000717 retained effect Effects 0.000 description 2
- 238000012216 screening Methods 0.000 description 2
- 240000004050 Pentaglottis sempervirens Species 0.000 description 1
- 241000905137 Veronica schmidtiana Species 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000007500 overflow downdraw method Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000009966 trimming Methods 0.000 description 1
- 239000013598 vector Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01S—RADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
- G01S17/00—Systems using the reflection or reradiation of electromagnetic waves other than radio waves, e.g. lidar systems
- G01S17/66—Tracking systems using electromagnetic waves other than radio waves
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01S—RADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
- G01S17/00—Systems using the reflection or reradiation of electromagnetic waves other than radio waves, e.g. lidar systems
- G01S17/86—Combinations of lidar systems with systems other than lidar, radar or sonar, e.g. with direction finders
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01S—RADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
- G01S17/00—Systems using the reflection or reradiation of electromagnetic waves other than radio waves, e.g. lidar systems
- G01S17/88—Lidar systems specially adapted for specific applications
- G01S17/93—Lidar systems specially adapted for specific applications for anti-collision purposes
- G01S17/931—Lidar systems specially adapted for specific applications for anti-collision purposes of land vehicles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/136—Segmentation; Edge detection involving thresholding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/181—Segmentation; Edge detection involving edge growing; involving edge linking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/194—Segmentation; Edge detection involving foreground-background segmentation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/60—Analysis of geometric attributes
- G06T7/66—Analysis of geometric attributes of image moments or centre of gravity
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/80—Analysis of captured images to determine intrinsic or extrinsic camera parameters, i.e. camera calibration
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10028—Range image; Depth image; 3D point clouds
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10032—Satellite or aerial image; Remote sensing
- G06T2207/10044—Radar image
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Remote Sensing (AREA)
- Radar, Positioning & Navigation (AREA)
- Electromagnetism (AREA)
- Computer Networks & Wireless Communication (AREA)
- Bioinformatics & Computational Biology (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Geometry (AREA)
- Traffic Control Systems (AREA)
- Length Measuring Devices By Optical Means (AREA)
Abstract
The invention relates to a 3D target detection method and a system based on monocular camera and laser radar fusion, wherein the method comprises the following steps: acquiring an image acquired by a monocular camera; calculating an instance segmentation score of each pixel point in the image based on an instance segmentation network; acquiring 3D point cloud data of a laser radar; fusing the instance segmentation scores and the 3D point cloud data to obtain fused 3D point cloud data; and performing 3D target detection on the fused 3D point cloud data by adopting a point cloud depth model algorithm to obtain a 3D boundary frame of the detected object. According to the invention, through the data fusion process, the problem that the visual angle of the monocular camera is inconsistent with the visual angle of the laser radar in the fusion process can be effectively solved, and the fusion efficiency is higher compared with the prior art.
Description
Technical Field
The invention relates to the technical field of fusion of laser radars and cameras, in particular to a 3D target detection method and system based on fusion of a monocular camera and a laser radar.
Background
With the rapid development of the fields of artificial intelligence and big data, the automatic driving technology is greatly promoted, and higher requirements are provided for the environment perception capability of the automatic driving automobile. The multi-sensor fusion technology can solve the problem of inherent defects of a single sensor, and improve the stability and safety of the automatic driving system.
The image sensor has high resolution, but poor depth estimation precision, stability and robustness; the laser radar has low resolution, but the point cloud ranging accuracy is very high, and the anti-interference capability to the outdoor environment is also strong, so that the effective advantage complementation can be formed by combining the sparse depth data and the dense image depth data of the laser radar, and the method is also the mainstream and the key point of the sensor fusion research in the field of automatic driving at present.
However, the current fusion method of the image sensor and the laser radar cannot effectively solve the problem of difference of visual angles and data characteristics of different sensors, the fusion efficiency is low, compared with the detection method using a single laser radar sensor, the calculation cost is greatly increased, and the improved detection effect is not ideal enough.
Disclosure of Invention
The invention aims to provide a 3D target detection method and a system based on monocular camera and laser radar fusion, which are used for solving the problems of view angle difference and data characteristic difference existing in the fusion of image and laser radar data, improving the detection efficiency and precision and carrying out accurate and rapid 3D target detection.
In order to achieve the purpose, the invention provides the following scheme:
A3D target detection method based on monocular camera and laser radar fusion comprises the following steps:
acquiring an image acquired by a monocular camera;
calculating an instance segmentation score of each pixel point in the image based on an instance segmentation network;
acquiring 3D point cloud data of a laser radar;
fusing the instance segmentation scores and the 3D point cloud data to obtain fused 3D point cloud data;
and performing 3D target detection on the fused 3D point cloud data by adopting a point cloud depth model algorithm to obtain a 3D boundary frame of the detected object.
Optionally, the output of the example segmentation network includes a classification branch and a mask branch, where the classification branch is used to predict semantic categories of objects and obtain corresponding probability values, and the mask branch is used to calculate example masks of the objects.
Optionally, the calculating the instance segmentation score of each pixel point in the image based on the instance segmentation network specifically includes:
obtaining a prediction probability value of the classification branch;
judging whether the prediction probability value is larger than a first threshold value or not;
if so, acquiring a position index corresponding to the prediction probability value;
dividing the mask branches into an X direction and a Y direction;
calculating a mask according to the position index, the X direction and the Y direction;
acquiring a mask which is larger than a second threshold value in the masks;
performing local maximum search on the mask which is larger than the second threshold value to obtain the maximum value of the mask;
and carrying out size scaling on the mask maximum value according to the size of the original image to obtain an example segmentation score.
Optionally, the example segmentation score and the 3D point cloud data are fused to obtain fused 3D point cloud data, and the method specifically includes:
acquiring external parameters of a monocular camera and a laser radar, wherein the external parameters comprise a rotation matrix and a translation matrix;
projecting the 3D point cloud data of the laser radar to a monocular camera three-dimensional coordinate system according to the external parameters;
acquiring internal parameters of a monocular camera, wherein the internal parameters comprise an internal parameter matrix and a distortion parameter matrix;
projecting points under the three-dimensional coordinate system of the monocular camera to an imaging plane according to the internal reference to obtain the corresponding relation between the 3D point cloud data and the image pixels of the laser radar;
and adding the instance segmentation score of each pixel point in the image to the 3D point cloud data according to the corresponding relation between the 3D point cloud data of the laser radar and the image pixels to obtain fused 3D point cloud data.
Optionally, the depth model algorithm of the point cloud performs 3D target detection on the fused 3D point cloud data to obtain a 3D bounding box of the detected object, and specifically includes:
segmenting the fused 3D point cloud data through learning features point by point to obtain segmented foreground points;
generating a 3D proposal according to the segmented foreground points;
3D point cloud data after pooling fusion according to the 3D proposal and corresponding point characteristics;
and generating a 3D boundary frame of the detected object according to the 3D point cloud data after pooling and the point characteristics corresponding to the point cloud data.
Optionally, the first threshold is 0.1, and the second threshold is 0.5.
A3D target detection system based on monocular camera and lidar fusion comprises:
the image acquisition module is used for acquiring an image acquired by the monocular camera;
the example segmentation score calculation module is used for calculating the example segmentation score of each pixel point in the image based on an example segmentation network;
the 3D point cloud data acquisition module is used for acquiring 3D point cloud data of the laser radar;
the data fusion module is used for fusing the instance segmentation scores with the 3D point cloud data to obtain fused 3D point cloud data;
and the target detection module is used for carrying out 3D target detection on the fused 3D point cloud data by adopting a point cloud depth model algorithm to obtain a 3D boundary frame of the detected object.
Optionally, the example segmentation score calculating module specifically includes:
the classification branch unit is used for acquiring the prediction probability value of the classification branch;
the first judging unit is used for judging whether the prediction probability value is larger than a first threshold value or not;
the position index unit is used for acquiring a position index corresponding to the prediction probability value when the prediction probability value is larger than a first threshold value;
a mask branching unit for dividing the mask branch into an X direction and a Y direction;
a mask calculation unit for calculating a mask according to the position index, the X direction and the Y direction;
a second judging unit, configured to obtain a mask that is greater than a second threshold value from among the masks;
the local search unit is used for performing local maximum search on the mask which is greater than the second threshold value to obtain the maximum value of the mask;
and the example division score calculating unit is used for carrying out size scaling on the mask maximum value according to the original image size to obtain the example division score.
Optionally, the data fusion module specifically includes:
the external parameter acquisition unit is used for acquiring external parameters of the monocular camera and the laser radar, and the external parameters comprise a rotation matrix and a translation matrix;
the first projection unit is used for projecting the 3D point cloud data of the laser radar to a monocular camera three-dimensional coordinate system according to the external parameters;
the internal reference acquisition unit is used for acquiring internal reference of the monocular camera, and the internal reference comprises an internal reference matrix and a distortion parameter matrix;
the second projection unit is used for projecting points under the three-dimensional coordinate system of the monocular camera to an imaging plane according to the internal reference to obtain the corresponding relation between the 3D point cloud data and the image pixels of the laser radar;
and the data fusion unit is used for adding the instance segmentation scores of each pixel point in the image to the 3D point cloud data according to the corresponding relation between the 3D point cloud data of the laser radar and the image pixels to obtain fused 3D point cloud data.
Optionally, the target detection module specifically includes:
the foreground extraction unit is used for segmenting the fused 3D point cloud data through learning features point by point to obtain segmented foreground points;
a 3D proposal generating unit, which is used for generating a 3D proposal according to the segmented foreground points;
the point cloud data pooling unit is used for pooling the fused 3D point cloud data and corresponding point characteristics according to the 3D proposal;
and the 3D bounding box refining unit is used for generating a 3D bounding box of the detected object according to the 3D point cloud data after the pooling and the point characteristics corresponding to the 3D point cloud data.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
according to the 3D target detection method based on the fusion of the monocular camera and the laser radar, the problem that the visual angle of the monocular camera is inconsistent with the visual angle of the laser radar in the fusion process can be effectively solved through the data fusion process, and the fusion efficiency is higher compared with the prior art.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a flow chart of a 3D target detection method based on monocular camera and laser radar fusion according to the present invention;
FIG. 2 is a block diagram of a 3D target detection system based on the fusion of a monocular camera and a laser radar.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The invention aims to provide a 3D target detection method and a system based on monocular camera and laser radar fusion, which are used for solving the problems of view angle difference and data characteristic difference existing in the fusion of image and laser radar data, improving the detection efficiency and precision and carrying out accurate and rapid 3D target detection.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Fig. 1 is a flowchart of a 3D target detection method based on monocular camera and lidar fusion according to the present invention, and as shown in fig. 1, a 3D target detection method based on monocular camera and lidar fusion includes:
step 101: acquiring an image acquired by a monocular camera;
step 102: calculating an instance segmentation score of each pixel point in the image based on an instance segmentation network;
where an example segmentation network divides a picture into n x n grids that have two tasks if the centroid of an object is located in a certain grid: (1) predicting a semantic category of the object; (2) an instance mask of the object is generated.
The two tasks of the example segmentation network are realized through a classification branch and a mask branch of the example segmentation network, and simultaneously, objects with different sizes are distributed to feature maps with different levels by using the feature pyramid network and are sequentially used as size categories of the objects.
It should be noted that, the specific design method of the example segmentation network is as follows: the example splits the network output into two branches: a classification branch and a mask branch. Equally dividing the picture into S multiplied by S grids, wherein the size of the classification branch is S multiplied by C, and C is the number of categories; mask branch size is H × W × S2,S2The predicted maximum number of instances corresponds to the original image position from top to bottom and from left to right. When the center of the target object falls into the grid (i, j), the corresponding position of the classification branch and the corresponding channel of the mask branch are responsible for predicting the object.
The method comprises the following specific steps: (1) for each mesh, the classification branch predicts a C-dimensional output representing the probability of a semantic class. Extracting the prediction probability values of all the classified branches, filtering by using a first threshold (for example, 0.1), and filtering probability values larger than the first threshold; (2) obtaining position indexes i, j corresponding to the remaining classifications after filtering; (3) dividing a mask branch into X and Y directions, and acquiring the mask of the category by an X branch i channel and a Y branch j channel in an element-wise multiplication mode so as to establish a one-to-one corresponding relation between the semantic category and the mask; (4) screening the masks by using a second threshold (for example, 0.5), and screening masks larger than the second threshold; (5) performing local maximum search on all the screened masks; (6) and scaling the final mask obtained after the local maximum search to the size of the original image to obtain the instance segmentation score of each pixel point.
Step 103: acquiring 3D point cloud data of a laser radar;
step 104: fusing the instance segmentation scores and the 3D point cloud data to obtain fused 3D point cloud data;
the data fusion process is mainly characterized by 1) joint calibration: finding a space conversion relation from the laser radar to the camera, and projecting laser radar points to an image through a conversion matrix; 2) information fusion: and adding the example segmentation score obtained by each pixel point to the laser radar point, and realizing the two steps.
The method comprises the following specific steps: (1) acquiring external parameters (a rotation matrix and a translation matrix) of the monocular camera and the laser radar, and projecting points under a three-dimensional coordinate system of the laser radar point cloud to the three-dimensional coordinate system of the monocular camera; (2) obtaining monocular camera internal parameters (an internal parameter matrix and a distortion parameter matrix) through monocular camera calibration, and projecting points under a monocular camera three-dimensional coordinate system to an imaging plane so as to establish a corresponding relation between laser radar point cloud and image pixels; (3) and adding the instance segmentation score obtained by each pixel point to the laser radar point according to the corresponding relation.
It should be noted that if there are overlapping fields of view of multiple monocular cameras, there may be a case where the lidar points are projected on multiple images simultaneously, and at this time, the instance division scores in one image are randomly selected.
Step 105: and performing 3D target detection on the fused 3D point cloud data by adopting a point cloud depth model algorithm to obtain a 3D boundary frame of the detected object.
The method comprises the following specific steps: 1) generating a 3D proposal: learning the characteristics of laser radar points of segmentation scores of the fused image examples point by point, segmenting an original point cloud, and generating a 3D proposal from segmented foreground points;
2) b, point cloud regional pooling: pooling each point and its features according to the position of each 3D proposal;
3)3D bounding box optimization: and converting the points after 3D proposal pooling into standard coordinates, learning local spatial features and global semantic features, and using the points for 3D bounding box optimization and confidence prediction.
The depth model in the embodiment of the invention mainly comprises three modules: the system comprises a 3D proposal generation module, a point cloud region pooling module and a standard 3D bounding box refining module. Wherein, the 3D proposal generation module divides the original point cloud (namely laser radar point cloud data of the fusion image instance division score) by learning the characteristics point by point, and simultaneously generates the 3D proposal from the already divided foreground points; the point cloud area pooling module pools the 3D points from the previous stage and their corresponding point features according to each 3D proposal with the aim of learning more specific local features of each 3D proposal; namely, the pooled object is fused 3D point cloud data and has the function of retaining key features; and the standard 3D bounding box refining module receives the pooling points of each 3D proposal and the related characteristics of the pooling points, finely adjusts the position of the 3D box and the confidence coefficient of the foreground object and generates a refined detected object 3D bounding box.
Specifically, the 3D proposal generation module in object detection specifically includes the following parts:
(1) learning point cloud representation: in order to learn distinctive point-by-point features to describe the original point cloud, the embodiment of the invention uses PointNet + + with multi-scale grouping as a backbone network;
(2) foreground point segmentation: foreground points can provide sufficient information in predicting the location and direction of objects with which they are associated. In order to learn how to segment foreground points, the point cloud network needs to capture context to make accurate point-by-point predictions. The 3D proposal generation method designed by the embodiment of the invention directly generates the 3D frame proposal from the foreground point, namely, the foreground segmentation and the 3D frame proposal generation are carried out simultaneously. In view of the point-wise nature of the backbone network coding, embodiments of the present invention add a segmentation header for estimating the foreground mask and a bounding box regression header for generating the 3D proposal. For an external scene with a large scale, the number of foreground points is much smaller than that of background points, so the embodiment of the present invention uses a focus loss function to handle the imbalance-like problem:
setting alpha during training point cloud segmentationt=0.25,γ=2。
(3) Bin-based three-dimensional bounding box generation: a 3D bounding box is represented in the lidar coordinate system as (x, y, z, h, w, l, θ), where (x, y, z) is the target center position, (h, w, l) is the size of the target, and θ is the target direction in top view. To constrain the generated 3D box proposal, embodiments of the present invention propose estimating the 3D bounding box of an object based on the regression loss function of the bins. To estimate the center position of an object, embodiments of the present invention split the region around each foreground point into a series of discrete bins along the Z, X axis. Specifically, the embodiment of the present invention sets a search range S for each X, Z axis of the current foreground point, and each one-dimensional search range is divided into bins of equal length δ to represent the center (X, Z) of the different objects in the X-Z plane. The localization loss function for the X, Z axis consists of two parts: one part is bin classification along each X, Z axis and one part is residual regression in the classified bins. For the center position Y along the Y-axis, the regression is done directly using the smoothed L1 loss function. The positioning formula is as follows:
wherein (x)(p),y(p),z(p)) Is the coordinate of the foreground point of interest, (x)p,yp,zp) Is the center coordinates of its corresponding object,is the true value of the bin assigned along the X, Z axes,is the true residual for further localization fine tuning in the assigned bin, and iota is the normalized bin length.
(4) Setting object properties: the embodiment of the invention divides the direction 2 pi into n bins, and then calculates bin classification targets in the directions of x and zAnd residual regression objectThe dimensions (h, w, l) of the object are directly regressed by calculating the average target dimension for each class over the entire training set.
(5) Initializing and setting related parameters: in the inference phase, for bin-based prediction parameters x, z, θ, the bin center with the highest prediction confidence is selected first, and the prediction residuals are added to obtain the fine-tuned parameters. For other parameters of the direct regression, including y, h, w, l, the prediction residuals are added to their initial values.
(6) Training loss function: regression loss L of the entire 3D bounding box under different training loss termsregIs represented as follows:
wherein N isposIs the number of the points of sight of the foreground,andis the bin allocation and residual for which the foreground point p is predicted,andis a calculated true object, FclsIs a cross-entropy loss of classification, FregA smooth L1 loss.
(7) Non-maxima suppression of training and reasoning: to remove redundant proposals, non-maximum suppression of the bird's-eye-based orientation IoU needs to be used to generate a small (no specific number requirement) high quality proposal. The threshold of bird's eye view IoU was 0.85 at the time of training, and the non-maximum suppression retained the first 300 proposals for subsequent subnetwork training. The threshold for setting the bird's eye view IoU at inference is 0.8, and the non-maximum suppression retains the first 100 proposals for use by subsequent trim subnetworks.
The point cloud area pooling module specifically comprises the following parts:
(1) expanding a 3D recommendation box: for each 3D recommendation box bi=(xi,yi,zi,hi,wi,li,θi) The appropriate zoom-in operation is required to create a new 3D frameTo encode additional information from his environment, where η is a fixed value used to enlarge the size of the box.
(2) Determine if the point is within the expanded frame: for each point p ═ x(p),y(p),z(p)) Performing an inside/outside test to determine if the point is in the expanded recommended boxIf so, the point and its features are retained for fine-tuning bi. Features associated with the inner point p include: its 3D point coordinate (x)(p),y(p),z(p))∈R3Its laser reflection intensity r(p)E.g. R, its predictive segmentation mask m from the previous stage(p)E {0, 1}, and its C-dimensional learning point feature from the previous stage represents f(p)∈Rc. By including a split mask m(p)To distinguish the foreground point or background point in the expanded frame, the feature f of the learning point(p)Valuable information is encoded by learning for segmentation and proposal generation.
The 3D bounding box refinement module specifically comprises the following parts:
(1) regular transformation: to take advantage of the high recall recommendation box generated by the 3D proposal generation module and estimate the residuals of the recommendation box parameters, embodiments of the present invention convert the pooling points belonging to each proposal into the canonical coordinate system of the corresponding 3D proposal. One proposed canonical coordinate system of 3D is represented as: the origin is in the middle of the recommendation box; the local X and Z axes are approximately parallel to the ground plane, X points in the proposed heading direction, and the other Z axis is perpendicular to X; the Y-axis is kept coincident with the lidar coordinate system.
(2) Fine-tuning the composition of the sub-networks: fine tuning sub-network from transformed local spatial point features p and their global semantic features f from the 3D proposal generation module(p)And (4) forming, and performing fine adjustment on a better frame and confidence.
(3) Defects and solutions to regular variations: while canonical transforms enable robust local spatial feature learning,but inevitably the depth information of each object is lost. For example, due to the fixed angular scanning resolution of the lidar sensor, distant objects typically have fewer points than nearby objects. To compensate for the loss of depth information, embodiments of the present invention willAdded to the feature at point p.
(4) After all the features of the proposal are obtained, for each proposal, the local spatial features of its associated points are first correlatedAnd an additional feature [ r(p),m(p),d(p)]Connecting to several full connection layers (the specific number is determined according to the situation), coding their local features into global features f(p)The same dimension. And then, connecting the local features and the global features to feed the local features and the global features into a network to obtain differentiated feature vectors, and performing confidence classification and fine adjustment of a frame.
(5) The box proposes the loss of refinement: the proposed fine-tuning employs a bin-based regression loss. If IoU for a true box and a 3D box proposal is greater than 0.55, then the true box is assigned to the 3D box proposal to learn box trimming. The overall loss function for the entire module is as follows:
where β is the 3D proposal from the 3D proposal generating module, βposIs a regression proposal, prob, that holds positive valuesiIs estimatedConfidence of (1), labeliIs its corresponding label. Finally, the overlapped bounding box is removed by applying directional non-maximum suppression of the bird's eye IoU threshold value of 0.01, and a three-dimensional bounding box of the detected object is generated.
In addition, corresponding to the above method, the present invention further provides a 3D target detection system based on monocular camera and lidar fusion, as shown in fig. 2, specifically including:
an image acquisition module 201, configured to acquire an image acquired by a monocular camera;
an example segmentation score calculation module 202, configured to calculate an example segmentation score of each pixel point in the image based on an example segmentation network;
a 3D point cloud data obtaining module 203, configured to obtain 3D point cloud data of the laser radar;
a data fusion module 204, configured to fuse the instance segmentation scores with the 3D point cloud data to obtain fused 3D point cloud data;
and the target detection module 205 is configured to perform 3D target detection on the fused 3D point cloud data by using a point cloud depth model algorithm to obtain a 3D bounding box of the detected object.
Due to the adoption of the technical scheme, the invention has the following advantages:
the 3D target detection method based on the fusion of the monocular camera and the laser radar can effectively solve the problem that the visual angle of the monocular camera is inconsistent with the visual angle of the laser radar in the fusion process, and compared with the prior art, the fusion efficiency is higher.
Compared with the prior art, the 3D target detection method based on the fusion of the monocular camera and the laser radar can improve the detection precision of small objects by adding detail information in case segmentation.
The 3D target detection method based on the fusion of the monocular camera and the laser radar adopts the instance segmentation as the fusion means, and the output information can not only act on the 3D target detection but also can act on tasks such as depth estimation, multi-target tracking and the like in an automatic driving task.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.
Claims (10)
1. A3D target detection method based on monocular camera and laser radar fusion is characterized by comprising the following steps:
acquiring an image acquired by a monocular camera;
calculating an instance segmentation score of each pixel point in the image based on an instance segmentation network;
acquiring 3D point cloud data of a laser radar;
fusing the instance segmentation scores and the 3D point cloud data to obtain fused 3D point cloud data;
and performing 3D target detection on the fused 3D point cloud data by adopting a point cloud depth model algorithm to obtain a 3D boundary frame of the detected object.
2. The monocular camera and lidar fusion based 3D target detection method of claim 1, wherein the output of the instance segmentation network comprises a classification branch for predicting semantic categories of objects and deriving corresponding probability values, and a mask branch for calculating instance masks of objects.
3. The monocular camera and lidar fusion based 3D target detection method of claim 2, wherein the calculating an instance segmentation score for each pixel point in the image based on an instance segmentation network specifically comprises:
obtaining a prediction probability value of the classification branch;
judging whether the prediction probability value is larger than a first threshold value or not;
if so, acquiring a position index corresponding to the prediction probability value;
dividing the mask branches into an X direction and a Y direction;
calculating a mask according to the position index, the X direction and the Y direction;
acquiring a mask which is larger than a second threshold value in the masks;
performing local maximum search on the mask which is larger than the second threshold value to obtain the maximum value of the mask;
and carrying out size scaling on the mask maximum value according to the size of the original image to obtain an example segmentation score.
4. The monocular camera and lidar fusion based 3D target detection method of claim 1, wherein fusing the instance segmentation score with the 3D point cloud data to obtain fused 3D point cloud data specifically comprises:
acquiring external parameters of a monocular camera and a laser radar, wherein the external parameters comprise a rotation matrix and a translation matrix;
projecting the 3D point cloud data of the laser radar to a monocular camera three-dimensional coordinate system according to the external parameters;
acquiring internal parameters of a monocular camera, wherein the internal parameters comprise an internal parameter matrix and a distortion parameter matrix;
projecting points under the three-dimensional coordinate system of the monocular camera to an imaging plane according to the internal reference to obtain the corresponding relation between the 3D point cloud data and the image pixels of the laser radar;
and adding the instance segmentation score of each pixel point in the image to the 3D point cloud data according to the corresponding relation between the 3D point cloud data of the laser radar and the image pixels to obtain fused 3D point cloud data.
5. The monocular camera and lidar fusion based 3D target detection method of claim 1, wherein the depth model algorithm of the point cloud performs 3D target detection on the fused 3D point cloud data to obtain a 3D bounding box of the detected object, specifically comprising:
segmenting the fused 3D point cloud data through learning features point by point to obtain segmented foreground points;
generating a 3D proposal according to the segmented foreground points;
3D point cloud data after pooling fusion according to the 3D proposal and corresponding point characteristics;
and generating a 3D boundary frame of the detected object according to the 3D point cloud data after pooling and the point characteristics corresponding to the point cloud data.
6. The monocular camera and lidar fusion based 3D target detection method of claim 1, wherein the first threshold is 0.1 and the second threshold is 0.5.
7. A3D target detection system based on monocular camera and lidar fusion, characterized by comprising:
the image acquisition module is used for acquiring an image acquired by the monocular camera;
the example segmentation score calculation module is used for calculating the example segmentation score of each pixel point in the image based on an example segmentation network;
the 3D point cloud data acquisition module is used for acquiring 3D point cloud data of the laser radar;
the data fusion module is used for fusing the instance segmentation scores with the 3D point cloud data to obtain fused 3D point cloud data;
and the target detection module is used for carrying out 3D target detection on the fused 3D point cloud data by adopting a point cloud depth model algorithm to obtain a 3D boundary frame of the detected object.
8. The monocular camera and lidar fusion based 3D target detection system of claim 7, wherein the instance segmentation score calculation module specifically comprises:
the classification branch unit is used for acquiring the prediction probability value of the classification branch;
the first judging unit is used for judging whether the prediction probability value is larger than a first threshold value or not;
the position index unit is used for acquiring a position index corresponding to the prediction probability value when the prediction probability value is larger than a first threshold value;
a mask branching unit for dividing the mask branch into an X direction and a Y direction;
a mask calculation unit for calculating a mask according to the position index, the X direction and the Y direction;
a second judging unit, configured to obtain a mask that is greater than a second threshold value from among the masks;
the local search unit is used for performing local maximum search on the mask which is greater than the second threshold value to obtain the maximum value of the mask;
and the example division score calculating unit is used for carrying out size scaling on the mask maximum value according to the original image size to obtain the example division score.
9. The monocular camera and lidar fusion based 3D target detection system of claim 7, wherein the data fusion module specifically comprises:
the external parameter acquisition unit is used for acquiring external parameters of the monocular camera and the laser radar, and the external parameters comprise a rotation matrix and a translation matrix;
the first projection unit is used for projecting the 3D point cloud data of the laser radar to a monocular camera three-dimensional coordinate system according to the external parameters;
the internal reference acquisition unit is used for acquiring internal reference of the monocular camera, and the internal reference comprises an internal reference matrix and a distortion parameter matrix;
the second projection unit is used for projecting points under the three-dimensional coordinate system of the monocular camera to an imaging plane according to the internal reference to obtain the corresponding relation between the 3D point cloud data and the image pixels of the laser radar;
and the data fusion unit is used for adding the instance segmentation scores of each pixel point in the image to the 3D point cloud data according to the corresponding relation between the 3D point cloud data of the laser radar and the image pixels to obtain fused 3D point cloud data.
10. The monocular camera and lidar fusion based 3D target detection system of claim 7, wherein the target detection module specifically comprises:
the foreground extraction unit is used for segmenting the fused 3D point cloud data through learning features point by point to obtain segmented foreground points;
a 3D proposal generating unit, which is used for generating a 3D proposal according to the segmented foreground points;
the point cloud data pooling unit is used for pooling the fused 3D point cloud data and corresponding point characteristics according to the 3D proposal;
and the 3D bounding box refining unit is used for generating a 3D bounding box of the detected object according to the 3D point cloud data after the pooling and the point characteristics corresponding to the 3D point cloud data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110447403.0A CN113139602A (en) | 2021-04-25 | 2021-04-25 | 3D target detection method and system based on monocular camera and laser radar fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110447403.0A CN113139602A (en) | 2021-04-25 | 2021-04-25 | 3D target detection method and system based on monocular camera and laser radar fusion |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113139602A true CN113139602A (en) | 2021-07-20 |
Family
ID=76811961
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110447403.0A Pending CN113139602A (en) | 2021-04-25 | 2021-04-25 | 3D target detection method and system based on monocular camera and laser radar fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113139602A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113985445A (en) * | 2021-08-24 | 2022-01-28 | 中国北方车辆研究所 | 3D target detection algorithm based on data fusion of camera and laser radar |
CN114359181A (en) * | 2021-12-17 | 2022-04-15 | 上海应用技术大学 | Intelligent traffic target fusion detection method and system based on image and point cloud |
CN114913209A (en) * | 2022-07-14 | 2022-08-16 | 南京后摩智能科技有限公司 | Multi-target tracking network construction method and device based on overlook projection |
CN116265862A (en) * | 2021-12-16 | 2023-06-20 | 动态Ad有限责任公司 | Vehicle, system and method for a vehicle, and storage medium |
CN117994504A (en) * | 2024-04-03 | 2024-05-07 | 国网江苏省电力有限公司常州供电分公司 | Target detection method and target detection device |
CN118015411A (en) * | 2024-02-27 | 2024-05-10 | 北京化工大学 | Automatic driving-oriented large vision language model increment learning method and device |
CN117994504B (en) * | 2024-04-03 | 2024-07-02 | 国网江苏省电力有限公司常州供电分公司 | Target detection method and target detection device |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110472534A (en) * | 2019-07-31 | 2019-11-19 | 厦门理工学院 | 3D object detection method, device, equipment and storage medium based on RGB-D data |
-
2021
- 2021-04-25 CN CN202110447403.0A patent/CN113139602A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110472534A (en) * | 2019-07-31 | 2019-11-19 | 厦门理工学院 | 3D object detection method, device, equipment and storage medium based on RGB-D data |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113985445A (en) * | 2021-08-24 | 2022-01-28 | 中国北方车辆研究所 | 3D target detection algorithm based on data fusion of camera and laser radar |
CN116265862A (en) * | 2021-12-16 | 2023-06-20 | 动态Ad有限责任公司 | Vehicle, system and method for a vehicle, and storage medium |
CN114359181A (en) * | 2021-12-17 | 2022-04-15 | 上海应用技术大学 | Intelligent traffic target fusion detection method and system based on image and point cloud |
CN114359181B (en) * | 2021-12-17 | 2024-01-26 | 上海应用技术大学 | Intelligent traffic target fusion detection method and system based on image and point cloud |
CN114913209A (en) * | 2022-07-14 | 2022-08-16 | 南京后摩智能科技有限公司 | Multi-target tracking network construction method and device based on overlook projection |
CN118015411A (en) * | 2024-02-27 | 2024-05-10 | 北京化工大学 | Automatic driving-oriented large vision language model increment learning method and device |
CN117994504A (en) * | 2024-04-03 | 2024-05-07 | 国网江苏省电力有限公司常州供电分公司 | Target detection method and target detection device |
CN117994504B (en) * | 2024-04-03 | 2024-07-02 | 国网江苏省电力有限公司常州供电分公司 | Target detection method and target detection device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111797716B (en) | Single target tracking method based on Siamese network | |
CN110675418B (en) | Target track optimization method based on DS evidence theory | |
CN110929692B (en) | Three-dimensional target detection method and device based on multi-sensor information fusion | |
CN113139602A (en) | 3D target detection method and system based on monocular camera and laser radar fusion | |
CN113159151B (en) | Multi-sensor depth fusion 3D target detection method for automatic driving | |
CN109903331B (en) | Convolutional neural network target detection method based on RGB-D camera | |
CN111830502B (en) | Data set establishing method, vehicle and storage medium | |
CN110689562A (en) | Trajectory loop detection optimization method based on generation of countermeasure network | |
CN111563415A (en) | Binocular vision-based three-dimensional target detection system and method | |
CN114022830A (en) | Target determination method and target determination device | |
CN109919026B (en) | Surface unmanned ship local path planning method | |
CN110941996A (en) | Target and track augmented reality method and system based on generation of countermeasure network | |
CN113643345A (en) | Multi-view road intelligent identification method based on double-light fusion | |
CN114114312A (en) | Three-dimensional target detection method based on fusion of multi-focal-length camera and laser radar | |
CN111292369A (en) | Pseudo-point cloud data generation method for laser radar | |
CN114399734A (en) | Forest fire early warning method based on visual information | |
CN111260687A (en) | Aerial video target tracking method based on semantic perception network and related filtering | |
CN115115690A (en) | Video residual decoding device and associated method | |
CN112529917A (en) | Three-dimensional target segmentation method, device, equipment and storage medium | |
CN116958927A (en) | Method and device for identifying short column based on BEV (binary image) graph | |
CN116664851A (en) | Automatic driving data extraction method based on artificial intelligence | |
CN115861709A (en) | Intelligent visual detection equipment based on convolutional neural network and method thereof | |
CN115249269A (en) | Object detection method, computer program product, storage medium, and electronic device | |
CN114758087A (en) | Method and device for constructing city information model | |
CN113569803A (en) | Multi-mode data fusion lane target detection method and system based on multi-scale convolution |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |