CN113139602A

CN113139602A - 3D target detection method and system based on monocular camera and laser radar fusion

Info

Publication number: CN113139602A
Application number: CN202110447403.0A
Authority: CN
Inventors: 张宇轩; 郝洁; 陈兵; 邓海
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2021-04-25
Filing date: 2021-04-25
Publication date: 2021-07-20

Abstract

The invention relates to a 3D target detection method and a system based on monocular camera and laser radar fusion, wherein the method comprises the following steps: acquiring an image acquired by a monocular camera; calculating an instance segmentation score of each pixel point in the image based on an instance segmentation network; acquiring 3D point cloud data of a laser radar; fusing the instance segmentation scores and the 3D point cloud data to obtain fused 3D point cloud data; and performing 3D target detection on the fused 3D point cloud data by adopting a point cloud depth model algorithm to obtain a 3D boundary frame of the detected object. According to the invention, through the data fusion process, the problem that the visual angle of the monocular camera is inconsistent with the visual angle of the laser radar in the fusion process can be effectively solved, and the fusion efficiency is higher compared with the prior art.

Description

3D target detection method and system based on monocular camera and laser radar fusion

Technical Field

The invention relates to the technical field of fusion of laser radars and cameras, in particular to a 3D target detection method and system based on fusion of a monocular camera and a laser radar.

Background

With the rapid development of the fields of artificial intelligence and big data, the automatic driving technology is greatly promoted, and higher requirements are provided for the environment perception capability of the automatic driving automobile. The multi-sensor fusion technology can solve the problem of inherent defects of a single sensor, and improve the stability and safety of the automatic driving system.

The image sensor has high resolution, but poor depth estimation precision, stability and robustness; the laser radar has low resolution, but the point cloud ranging accuracy is very high, and the anti-interference capability to the outdoor environment is also strong, so that the effective advantage complementation can be formed by combining the sparse depth data and the dense image depth data of the laser radar, and the method is also the mainstream and the key point of the sensor fusion research in the field of automatic driving at present.

However, the current fusion method of the image sensor and the laser radar cannot effectively solve the problem of difference of visual angles and data characteristics of different sensors, the fusion efficiency is low, compared with the detection method using a single laser radar sensor, the calculation cost is greatly increased, and the improved detection effect is not ideal enough.

Disclosure of Invention

The invention aims to provide a 3D target detection method and a system based on monocular camera and laser radar fusion, which are used for solving the problems of view angle difference and data characteristic difference existing in the fusion of image and laser radar data, improving the detection efficiency and precision and carrying out accurate and rapid 3D target detection.

In order to achieve the purpose, the invention provides the following scheme:

A3D target detection method based on monocular camera and laser radar fusion comprises the following steps:

acquiring an image acquired by a monocular camera;

calculating an instance segmentation score of each pixel point in the image based on an instance segmentation network;

acquiring 3D point cloud data of a laser radar;

fusing the instance segmentation scores and the 3D point cloud data to obtain fused 3D point cloud data;

and performing 3D target detection on the fused 3D point cloud data by adopting a point cloud depth model algorithm to obtain a 3D boundary frame of the detected object.

Optionally, the output of the example segmentation network includes a classification branch and a mask branch, where the classification branch is used to predict semantic categories of objects and obtain corresponding probability values, and the mask branch is used to calculate example masks of the objects.

Optionally, the calculating the instance segmentation score of each pixel point in the image based on the instance segmentation network specifically includes:

obtaining a prediction probability value of the classification branch;

judging whether the prediction probability value is larger than a first threshold value or not;

if so, acquiring a position index corresponding to the prediction probability value;

dividing the mask branches into an X direction and a Y direction;

calculating a mask according to the position index, the X direction and the Y direction;

acquiring a mask which is larger than a second threshold value in the masks;

performing local maximum search on the mask which is larger than the second threshold value to obtain the maximum value of the mask;

and carrying out size scaling on the mask maximum value according to the size of the original image to obtain an example segmentation score.

Optionally, the example segmentation score and the 3D point cloud data are fused to obtain fused 3D point cloud data, and the method specifically includes:

acquiring external parameters of a monocular camera and a laser radar, wherein the external parameters comprise a rotation matrix and a translation matrix;

projecting the 3D point cloud data of the laser radar to a monocular camera three-dimensional coordinate system according to the external parameters;

acquiring internal parameters of a monocular camera, wherein the internal parameters comprise an internal parameter matrix and a distortion parameter matrix;

projecting points under the three-dimensional coordinate system of the monocular camera to an imaging plane according to the internal reference to obtain the corresponding relation between the 3D point cloud data and the image pixels of the laser radar;

and adding the instance segmentation score of each pixel point in the image to the 3D point cloud data according to the corresponding relation between the 3D point cloud data of the laser radar and the image pixels to obtain fused 3D point cloud data.

Optionally, the depth model algorithm of the point cloud performs 3D target detection on the fused 3D point cloud data to obtain a 3D bounding box of the detected object, and specifically includes:

segmenting the fused 3D point cloud data through learning features point by point to obtain segmented foreground points;

generating a 3D proposal according to the segmented foreground points;

3D point cloud data after pooling fusion according to the 3D proposal and corresponding point characteristics;

and generating a 3D boundary frame of the detected object according to the 3D point cloud data after pooling and the point characteristics corresponding to the point cloud data.

Optionally, the first threshold is 0.1, and the second threshold is 0.5.

A3D target detection system based on monocular camera and lidar fusion comprises:

the image acquisition module is used for acquiring an image acquired by the monocular camera;

the example segmentation score calculation module is used for calculating the example segmentation score of each pixel point in the image based on an example segmentation network;

the 3D point cloud data acquisition module is used for acquiring 3D point cloud data of the laser radar;

the data fusion module is used for fusing the instance segmentation scores with the 3D point cloud data to obtain fused 3D point cloud data;

and the target detection module is used for carrying out 3D target detection on the fused 3D point cloud data by adopting a point cloud depth model algorithm to obtain a 3D boundary frame of the detected object.

Optionally, the example segmentation score calculating module specifically includes:

the classification branch unit is used for acquiring the prediction probability value of the classification branch;

the first judging unit is used for judging whether the prediction probability value is larger than a first threshold value or not;

the position index unit is used for acquiring a position index corresponding to the prediction probability value when the prediction probability value is larger than a first threshold value;

a mask branching unit for dividing the mask branch into an X direction and a Y direction;

a mask calculation unit for calculating a mask according to the position index, the X direction and the Y direction;

a second judging unit, configured to obtain a mask that is greater than a second threshold value from among the masks;

the local search unit is used for performing local maximum search on the mask which is greater than the second threshold value to obtain the maximum value of the mask;

and the example division score calculating unit is used for carrying out size scaling on the mask maximum value according to the original image size to obtain the example division score.

Optionally, the data fusion module specifically includes:

the external parameter acquisition unit is used for acquiring external parameters of the monocular camera and the laser radar, and the external parameters comprise a rotation matrix and a translation matrix;

the first projection unit is used for projecting the 3D point cloud data of the laser radar to a monocular camera three-dimensional coordinate system according to the external parameters;

the internal reference acquisition unit is used for acquiring internal reference of the monocular camera, and the internal reference comprises an internal reference matrix and a distortion parameter matrix;

the second projection unit is used for projecting points under the three-dimensional coordinate system of the monocular camera to an imaging plane according to the internal reference to obtain the corresponding relation between the 3D point cloud data and the image pixels of the laser radar;

and the data fusion unit is used for adding the instance segmentation scores of each pixel point in the image to the 3D point cloud data according to the corresponding relation between the 3D point cloud data of the laser radar and the image pixels to obtain fused 3D point cloud data.

Optionally, the target detection module specifically includes:

the foreground extraction unit is used for segmenting the fused 3D point cloud data through learning features point by point to obtain segmented foreground points;

a 3D proposal generating unit, which is used for generating a 3D proposal according to the segmented foreground points;

the point cloud data pooling unit is used for pooling the fused 3D point cloud data and corresponding point characteristics according to the 3D proposal;

and the 3D bounding box refining unit is used for generating a 3D bounding box of the detected object according to the 3D point cloud data after the pooling and the point characteristics corresponding to the 3D point cloud data.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

according to the 3D target detection method based on the fusion of the monocular camera and the laser radar, the problem that the visual angle of the monocular camera is inconsistent with the visual angle of the laser radar in the fusion process can be effectively solved through the data fusion process, and the fusion efficiency is higher compared with the prior art.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a flow chart of a 3D target detection method based on monocular camera and laser radar fusion according to the present invention;

FIG. 2 is a block diagram of a 3D target detection system based on the fusion of a monocular camera and a laser radar.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Fig. 1 is a flowchart of a 3D target detection method based on monocular camera and lidar fusion according to the present invention, and as shown in fig. 1, a 3D target detection method based on monocular camera and lidar fusion includes:

step 101: acquiring an image acquired by a monocular camera;

step 102: calculating an instance segmentation score of each pixel point in the image based on an instance segmentation network;

where an example segmentation network divides a picture into n x n grids that have two tasks if the centroid of an object is located in a certain grid: (1) predicting a semantic category of the object; (2) an instance mask of the object is generated.

The two tasks of the example segmentation network are realized through a classification branch and a mask branch of the example segmentation network, and simultaneously, objects with different sizes are distributed to feature maps with different levels by using the feature pyramid network and are sequentially used as size categories of the objects.

It should be noted that, the specific design method of the example segmentation network is as follows: the example splits the network output into two branches: a classification branch and a mask branch. Equally dividing the picture into S multiplied by S grids, wherein the size of the classification branch is S multiplied by C, and C is the number of categories; mask branch size is H × W × S²，S²The predicted maximum number of instances corresponds to the original image position from top to bottom and from left to right. When the center of the target object falls into the grid (i, j), the corresponding position of the classification branch and the corresponding channel of the mask branch are responsible for predicting the object.

The method comprises the following specific steps: (1) for each mesh, the classification branch predicts a C-dimensional output representing the probability of a semantic class. Extracting the prediction probability values of all the classified branches, filtering by using a first threshold (for example, 0.1), and filtering probability values larger than the first threshold; (2) obtaining position indexes i, j corresponding to the remaining classifications after filtering; (3) dividing a mask branch into X and Y directions, and acquiring the mask of the category by an X branch i channel and a Y branch j channel in an element-wise multiplication mode so as to establish a one-to-one corresponding relation between the semantic category and the mask; (4) screening the masks by using a second threshold (for example, 0.5), and screening masks larger than the second threshold; (5) performing local maximum search on all the screened masks; (6) and scaling the final mask obtained after the local maximum search to the size of the original image to obtain the instance segmentation score of each pixel point.

Step 103: acquiring 3D point cloud data of a laser radar;

step 104: fusing the instance segmentation scores and the 3D point cloud data to obtain fused 3D point cloud data;

the data fusion process is mainly characterized by 1) joint calibration: finding a space conversion relation from the laser radar to the camera, and projecting laser radar points to an image through a conversion matrix; 2) information fusion: and adding the example segmentation score obtained by each pixel point to the laser radar point, and realizing the two steps.

The method comprises the following specific steps: (1) acquiring external parameters (a rotation matrix and a translation matrix) of the monocular camera and the laser radar, and projecting points under a three-dimensional coordinate system of the laser radar point cloud to the three-dimensional coordinate system of the monocular camera; (2) obtaining monocular camera internal parameters (an internal parameter matrix and a distortion parameter matrix) through monocular camera calibration, and projecting points under a monocular camera three-dimensional coordinate system to an imaging plane so as to establish a corresponding relation between laser radar point cloud and image pixels; (3) and adding the instance segmentation score obtained by each pixel point to the laser radar point according to the corresponding relation.

It should be noted that if there are overlapping fields of view of multiple monocular cameras, there may be a case where the lidar points are projected on multiple images simultaneously, and at this time, the instance division scores in one image are randomly selected.

Step 105: and performing 3D target detection on the fused 3D point cloud data by adopting a point cloud depth model algorithm to obtain a 3D boundary frame of the detected object.

The method comprises the following specific steps: 1) generating a 3D proposal: learning the characteristics of laser radar points of segmentation scores of the fused image examples point by point, segmenting an original point cloud, and generating a 3D proposal from segmented foreground points;

2) b, point cloud regional pooling: pooling each point and its features according to the position of each 3D proposal;

3)3D bounding box optimization: and converting the points after 3D proposal pooling into standard coordinates, learning local spatial features and global semantic features, and using the points for 3D bounding box optimization and confidence prediction.

The depth model in the embodiment of the invention mainly comprises three modules: the system comprises a 3D proposal generation module, a point cloud region pooling module and a standard 3D bounding box refining module. Wherein, the 3D proposal generation module divides the original point cloud (namely laser radar point cloud data of the fusion image instance division score) by learning the characteristics point by point, and simultaneously generates the 3D proposal from the already divided foreground points; the point cloud area pooling module pools the 3D points from the previous stage and their corresponding point features according to each 3D proposal with the aim of learning more specific local features of each 3D proposal; namely, the pooled object is fused 3D point cloud data and has the function of retaining key features; and the standard 3D bounding box refining module receives the pooling points of each 3D proposal and the related characteristics of the pooling points, finely adjusts the position of the 3D box and the confidence coefficient of the foreground object and generates a refined detected object 3D bounding box.

Specifically, the 3D proposal generation module in object detection specifically includes the following parts:

(1) learning point cloud representation: in order to learn distinctive point-by-point features to describe the original point cloud, the embodiment of the invention uses PointNet + + with multi-scale grouping as a backbone network;

(2) foreground point segmentation: foreground points can provide sufficient information in predicting the location and direction of objects with which they are associated. In order to learn how to segment foreground points, the point cloud network needs to capture context to make accurate point-by-point predictions. The 3D proposal generation method designed by the embodiment of the invention directly generates the 3D frame proposal from the foreground point, namely, the foreground segmentation and the 3D frame proposal generation are carried out simultaneously. In view of the point-wise nature of the backbone network coding, embodiments of the present invention add a segmentation header for estimating the foreground mask and a bounding box regression header for generating the 3D proposal. For an external scene with a large scale, the number of foreground points is much smaller than that of background points, so the embodiment of the present invention uses a focus loss function to handle the imbalance-like problem:

wherein the content of the first and second substances,

setting alpha during training point cloud segmentation_t＝0.25，γ＝2。

(3) Bin-based three-dimensional bounding box generation: a 3D bounding box is represented in the lidar coordinate system as (x, y, z, h, w, l, θ), where (x, y, z) is the target center position, (h, w, l) is the size of the target, and θ is the target direction in top view. To constrain the generated 3D box proposal, embodiments of the present invention propose estimating the 3D bounding box of an object based on the regression loss function of the bins. To estimate the center position of an object, embodiments of the present invention split the region around each foreground point into a series of discrete bins along the Z, X axis. Specifically, the embodiment of the present invention sets a search range S for each X, Z axis of the current foreground point, and each one-dimensional search range is divided into bins of equal length δ to represent the center (X, Z) of the different objects in the X-Z plane. The localization loss function for the X, Z axis consists of two parts: one part is bin classification along each X, Z axis and one part is residual regression in the classified bins. For the center position Y along the Y-axis, the regression is done directly using the smoothed L1 loss function. The positioning formula is as follows:

wherein (x)^(p)，y^(p)，z^(p)) Is the coordinate of the foreground point of interest, (x)^p，y^p，z^p) Is the center coordinates of its corresponding object,

is the true value of the bin assigned along the X, Z axes,

is the true residual for further localization fine tuning in the assigned bin, and iota is the normalized bin length.

(4) Setting object properties: the embodiment of the invention divides the direction 2 pi into n bins, and then calculates bin classification targets in the directions of x and z

And residual regression object

The dimensions (h, w, l) of the object are directly regressed by calculating the average target dimension for each class over the entire training set.

(5) Initializing and setting related parameters: in the inference phase, for bin-based prediction parameters x, z, θ, the bin center with the highest prediction confidence is selected first, and the prediction residuals are added to obtain the fine-tuned parameters. For other parameters of the direct regression, including y, h, w, l, the prediction residuals are added to their initial values.

(6) Training loss function: regression loss L of the entire 3D bounding box under different training loss terms_regIs represented as follows:

wherein N is_posIs the number of the points of sight of the foreground,

and

is the bin allocation and residual for which the foreground point p is predicted,

and

is a calculated true object, F_clsIs a cross-entropy loss of classification, F_regA smooth L1 loss.

(7) Non-maxima suppression of training and reasoning: to remove redundant proposals, non-maximum suppression of the bird's-eye-based orientation IoU needs to be used to generate a small (no specific number requirement) high quality proposal. The threshold of bird's eye view IoU was 0.85 at the time of training, and the non-maximum suppression retained the first 300 proposals for subsequent subnetwork training. The threshold for setting the bird's eye view IoU at inference is 0.8, and the non-maximum suppression retains the first 100 proposals for use by subsequent trim subnetworks.

The point cloud area pooling module specifically comprises the following parts:

(1) expanding a 3D recommendation box: for each 3D recommendation box b_i＝(x_i，y_i，z_i，h_i，w_i，l_i，θ_i) The appropriate zoom-in operation is required to create a new 3D frame

To encode additional information from his environment, where η is a fixed value used to enlarge the size of the box.

(2) Determine if the point is within the expanded frame: for each point p ═ x^(p)，y^(p)，z^(p)) Performing an inside/outside test to determine if the point is in the expanded recommended box

If so, the point and its features are retained for fine-tuning b_i. Features associated with the inner point p include: its 3D point coordinate (x)^(p)，y^(p)，z^(p))∈R³Its laser reflection intensity r^(p)E.g. R, its predictive segmentation mask m from the previous stage^(p)E {0, 1}, and its C-dimensional learning point feature from the previous stage represents f^(p)∈R^c. By including a split mask m^(p)To distinguish the foreground point or background point in the expanded frame, the feature f of the learning point^(p)Valuable information is encoded by learning for segmentation and proposal generation.

The 3D bounding box refinement module specifically comprises the following parts:

(1) regular transformation: to take advantage of the high recall recommendation box generated by the 3D proposal generation module and estimate the residuals of the recommendation box parameters, embodiments of the present invention convert the pooling points belonging to each proposal into the canonical coordinate system of the corresponding 3D proposal. One proposed canonical coordinate system of 3D is represented as: the origin is in the middle of the recommendation box; the local X and Z axes are approximately parallel to the ground plane, X points in the proposed heading direction, and the other Z axis is perpendicular to X; the Y-axis is kept coincident with the lidar coordinate system.

(2) Fine-tuning the composition of the sub-networks: fine tuning sub-network from transformed local spatial point features p and their global semantic features f from the 3D proposal generation module^(p)And (4) forming, and performing fine adjustment on a better frame and confidence.

(3) Defects and solutions to regular variations: while canonical transforms enable robust local spatial feature learning,but inevitably the depth information of each object is lost. For example, due to the fixed angular scanning resolution of the lidar sensor, distant objects typically have fewer points than nearby objects. To compensate for the loss of depth information, embodiments of the present invention will

Added to the feature at point p.

(4) After all the features of the proposal are obtained, for each proposal, the local spatial features of its associated points are first correlated

And an additional feature [ r^(p)，m^(p)，d^(p)]Connecting to several full connection layers (the specific number is determined according to the situation), coding their local features into global features f^(p)The same dimension. And then, connecting the local features and the global features to feed the local features and the global features into a network to obtain differentiated feature vectors, and performing confidence classification and fine adjustment of a frame.

(5) The box proposes the loss of refinement: the proposed fine-tuning employs a bin-based regression loss. If IoU for a true box and a 3D box proposal is greater than 0.55, then the true box is assigned to the 3D box proposal to learn box trimming. The overall loss function for the entire module is as follows:

where β is the 3D proposal from the 3D proposal generating module, β_posIs a regression proposal, prob, that holds positive values_iIs estimated

Confidence of (1), label_iIs its corresponding label. Finally, the overlapped bounding box is removed by applying directional non-maximum suppression of the bird's eye IoU threshold value of 0.01, and a three-dimensional bounding box of the detected object is generated.

In addition, corresponding to the above method, the present invention further provides a 3D target detection system based on monocular camera and lidar fusion, as shown in fig. 2, specifically including:

an image acquisition module 201, configured to acquire an image acquired by a monocular camera;

an example segmentation score calculation module 202, configured to calculate an example segmentation score of each pixel point in the image based on an example segmentation network;

a 3D point cloud data obtaining module 203, configured to obtain 3D point cloud data of the laser radar;

a data fusion module 204, configured to fuse the instance segmentation scores with the 3D point cloud data to obtain fused 3D point cloud data;

and the target detection module 205 is configured to perform 3D target detection on the fused 3D point cloud data by using a point cloud depth model algorithm to obtain a 3D bounding box of the detected object.

Due to the adoption of the technical scheme, the invention has the following advantages:

the 3D target detection method based on the fusion of the monocular camera and the laser radar can effectively solve the problem that the visual angle of the monocular camera is inconsistent with the visual angle of the laser radar in the fusion process, and compared with the prior art, the fusion efficiency is higher.

Compared with the prior art, the 3D target detection method based on the fusion of the monocular camera and the laser radar can improve the detection precision of small objects by adding detail information in case segmentation.

The 3D target detection method based on the fusion of the monocular camera and the laser radar adopts the instance segmentation as the fusion means, and the output information can not only act on the 3D target detection but also can act on tasks such as depth estimation, multi-target tracking and the like in an automatic driving task.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A3D target detection method based on monocular camera and laser radar fusion is characterized by comprising the following steps:

acquiring an image acquired by a monocular camera;

acquiring 3D point cloud data of a laser radar;

2. The monocular camera and lidar fusion based 3D target detection method of claim 1, wherein the output of the instance segmentation network comprises a classification branch for predicting semantic categories of objects and deriving corresponding probability values, and a mask branch for calculating instance masks of objects.

3. The monocular camera and lidar fusion based 3D target detection method of claim 2, wherein the calculating an instance segmentation score for each pixel point in the image based on an instance segmentation network specifically comprises:

obtaining a prediction probability value of the classification branch;

dividing the mask branches into an X direction and a Y direction;

acquiring a mask which is larger than a second threshold value in the masks;

4. The monocular camera and lidar fusion based 3D target detection method of claim 1, wherein fusing the instance segmentation score with the 3D point cloud data to obtain fused 3D point cloud data specifically comprises:

5. The monocular camera and lidar fusion based 3D target detection method of claim 1, wherein the depth model algorithm of the point cloud performs 3D target detection on the fused 3D point cloud data to obtain a 3D bounding box of the detected object, specifically comprising:

generating a 3D proposal according to the segmented foreground points;

6. The monocular camera and lidar fusion based 3D target detection method of claim 1, wherein the first threshold is 0.1 and the second threshold is 0.5.

7. A3D target detection system based on monocular camera and lidar fusion, characterized by comprising:

8. The monocular camera and lidar fusion based 3D target detection system of claim 7, wherein the instance segmentation score calculation module specifically comprises:

9. The monocular camera and lidar fusion based 3D target detection system of claim 7, wherein the data fusion module specifically comprises:

10. The monocular camera and lidar fusion based 3D target detection system of claim 7, wherein the target detection module specifically comprises: