CN117789160A

CN117789160A - Multi-mode fusion target detection method and system based on cluster optimization

Info

Publication number: CN117789160A
Application number: CN202311569090.1A
Authority: CN
Inventors: 肖进胜; 周剑; 谢红刚; 宋成芳; 章红平
Original assignee: NATION ENGINEERING RESEARCH CENTER FOR SATELLITE POSITIONING SYSTEM; Wuhan University WHU
Current assignee: NATION ENGINEERING RESEARCH CENTER FOR SATELLITE POSITIONING SYSTEM; Wuhan University WHU
Priority date: 2023-11-21
Filing date: 2023-11-21
Publication date: 2024-03-29

Abstract

The invention provides a multi-mode fusion target detection method and system based on cluster optimization, wherein the method comprises the following steps: acquiring a 2D image and a 3D point cloud containing an object to be detected; inputting the 2D image into a two-dimensional object detection network with a CBMA attention module, and acquiring a 2D detection frame of each object in the 2D image; coordinate mapping is carried out on the 2D detection frame and the 3D point cloud of each target, and a conical region of interest of each target is generated; inputting the conical region of interest of each target into a three-dimensional target detection network, and acquiring a 3D detection frame of each target; based on the 2D detection frame and the 3D detection frame of each target, a clustering method fusing priori information points acquires a final 3D detection frame of each target. The invention adopts a 3D target detection network architecture based on cluster optimization, and greatly improves the average detection precision of smaller targets such as pedestrians, riding pedestrians and the like.

Description

Multi-mode fusion target detection method and system based on cluster optimization

Technical Field

The invention relates to the field of target detection, in particular to a multi-mode fusion target detection method and system based on cluster optimization.

Background

In the field of target detection on roads, for detection of small targets, it is found that the point cloud has sparsity, for example, the point cloud of the KITTI data set is projected onto a corresponding RGB image, only about 3% of pixels have the corresponding point cloud, and the number of target point clouds of about half of Moderate and Hard difficulty in the data set is less than 60, which results in incomplete structure and semantic information of the 3D small target, and the target to be detected is easily confused with the background. In particular, the far pedestrian and the point cloud of the straight bar may exhibit almost the same geometry, resulting in false positives.

To achieve this objective, the background technology mainly used is the image point cloud detection algorithm based on F-PointNet, and the technology still has some defects in the current application environment.

The first type is due to the problem of imbalance of large and small target data in the KITTI data set:

1. data acquisition deviation problem: most KITTI datasets are acquired by onboard sensors (e.g., lidar and cameras) that are more likely to detect large targets, such as automobiles and trucks. This results in a relatively rich large target data, while the amount of small target data (such as pedestrians and bicycles) is limited. This imbalance may make the deep learning model more prone to learn large targets, while the detection of small targets performs poorly.

2. Deviation problem of target detection model: many traditional target detection models, such as Faster R-CNN, yolo, etc., are designed to focus on the detection of large targets, but perform poorly on small targets. This is because these models typically have large loss values for large targets during the training process, thereby affecting learning of small targets. Model bias is therefore also a sub-problem that leads to the problem of large and small target imbalance.

The second category is the problem that the distant 3D small target point cloud is sparse and is easy to be confused with the background:

1. sparse point cloud problem: a small target at a distance may be far away, resulting in very sparse point cloud data collected by a lidar or other sensor. This means that there are a large number of missing data points in the point cloud, making it difficult to accurately capture the shape and characteristics of the target. How to process these sparse point cloud data in order to accurately detect and identify small objects is an important issue.

2. Background confusion problem: in a complex urban environment, the point clouds of small objects may be confused with the surrounding background, as they may have similar characteristics to the environmental elements of buildings, roads, trees, etc. This results in indistinguishable between the target and the background, increasing the risk of false detection. Solving this problem requires developing algorithms to accurately distinguish small objects from the background and utilizing contextual information to improve recognition accuracy.

3. Remote target detection problems-off target detection is more challenging than near target detection because the point cloud of small targets may be disturbed by wide range scattering, illumination variations, and sensor limitations at far distances. Therefore, how to develop algorithms to efficiently detect and identify small targets at a distance is another important issue.

Disclosure of Invention

Aiming at the technical problems in the prior art, the invention provides a multi-mode fusion target detection method and system based on cluster optimization.

According to a first aspect of the present invention, there is provided a method for detecting a multi-modal fusion target based on cluster optimization, including:

acquiring a 2D image and a 3D point cloud containing an object to be detected;

inputting the 2D image into a two-dimensional object detection network with a CBMA attention module, and acquiring a 2D detection frame of each object in the 2D image;

coordinate mapping is carried out on the 2D detection frame and the 3D point cloud of each target, and a conical region of interest of each target is generated;

inputting the conical region of interest of each target into a three-dimensional target detection network, and acquiring a 3D detection frame of each target;

based on the 2D detection frame and the 3D detection frame of each target, a clustering method fusing priori information points acquires a final 3D detection frame of each target.

On the basis of the technical scheme, the invention can also make the following improvements.

Optionally, the two-dimensional object detection network is a Yolov5 network including CBAM attention mechanism enhancement, wherein training the two-dimensional object detection network includes:

acquiring an original training set, wherein the original training data set comprises a plurality of 2D images, each 2D image comprises a large target and/or a small target, the large target is a target with a size exceeding a preset size, and the small target is a target with a size smaller than the preset size;

after the large targets in the partial 2D images are compressed in equal proportion, pasting and copying the large targets into the partial 2D images which do not contain the targets of the class, and obtaining a training data set after expanding the small targets;

training the Yolov5 network based on the expanded training data set to obtain a trained two-dimensional target detection network;

the Yolov5 network comprises a CBAM attention module, wherein the CBAM attention module comprises a spatial attention sub-module and a channel attention sub-module, and the spatial attention and the channel attention of the Yolov5 network to a target attention area are respectively enhanced.

Optionally, the clustering method based on the 2D detection frame and the 3D detection frame of each target and fusing the prior information points obtains a final 3D detection frame of each target, which includes:

for a target with the height of the 2D detection frame being larger than that of a preset pixel, directly taking the 3D detection frame of the target as a final 3D detection frame of the target;

for targets with 2D detection frames but without 3D detection frames, when the confidence coefficient of the 2D detection frame of the target is larger than a first preset confidence coefficient, acquiring a final 3D detection frame of each target according to the 2D detection frame of the target and the conical region of interest of the target by a clustering method fusing priori information points;

and for targets with 2D detection frames and 3D detection frames, when the confidence coefficient of the 3D detection frames of the targets is smaller than a second preset confidence coefficient, acquiring the final 3D detection frame of each target according to the 2D detection frames of the targets and the conical interested areas of the targets by a clustering method fusing priori information points.

Optionally, the clustering method for fusing prior information points according to the 2D detection frame of the target and the conical interested area of the target obtains a final 3D detection frame of each target, including:

clustering the point cloud of the conical region of interest of the target based on Euclidean distance to obtain at least one clustering result;

when a plurality of clustering results exist, the clustering result closest to the center of the 2D detection frame of the target is selected as the final 3D detection frame of the target.

Optionally, when there are a plurality of clustering results closest to the center of the 2D detection frame of the target, selecting the clustering result with the largest number of point clouds as the final 3D detection frame of the target.

Optionally, the method further comprises repairing the position and the size of each small target:

calculating the average size of each class of small targets according to the size of each small target output by the three-dimensional target detection network, wherein the average size comprises average length, width and height;

the size of the final 3D detection frame of the small object of each category is adjusted to the average size.

Optionally, the original training data set records the corresponding relation between the height of the 2D detection frame of each type of target and the distance of the 2D detection frame relative to the laser radar;

when the clustering result is screened, determining the position range of the center of the 3D detection frame of the target according to the height of the 2D detection frame of the target and the corresponding relation between the height of the 2D detection frame of each type of target and the distance of the 2D detection frame relative to the laser radar;

if the central position of the clustering result of the target is in the position range, the clustering result is reserved, otherwise, the clustering result is discarded.

Optionally, the original training data set records a rotation angle of a 2D detection frame of each class of targets;

when screening the clustering results, determining the rotation angle of each clustering result according to the included angle of the connecting line of the farthest point and the closest point of the point cloud in the horizontal direction in the clustering results relative to the x axis, and if the difference between the rotation angle of the clustering result and the rotation angle of the 2D detection frame of the same class of targets is smaller than a preset difference value, reserving the clustering result, otherwise, discarding the clustering result.

According to a second aspect of the present invention, there is provided a cluster-optimization-based multi-modal fusion target detection system, comprising:

the first acquisition module is used for acquiring a 2D image containing an object to be detected and a 3D point cloud;

the second acquisition module is used for inputting the 2D image into a two-dimensional object detection network with a CBMA attention module to acquire a 2D detection frame of each object in the 2D image;

the mapping module is used for carrying out coordinate mapping on the 2D detection frame and the 3D point cloud of each target to generate a conical region of interest of each target;

the third acquisition module is used for inputting the conical region of interest of each target into the three-dimensional target detection network to acquire a 3D detection frame of each target;

and the fusion module is used for acquiring a final 3D detection frame of each target based on the 2D detection frame and the 3D detection frame of each target by a clustering method of fusing priori information points.

According to a third aspect of the present invention, there is provided an electronic device comprising a memory, a processor for implementing the steps of a cluster-optimization-based multimodal fusion target detection method when executing a computer management class program stored in the memory.

According to a fourth aspect of the present invention, there is provided a computer readable storage medium having stored thereon a computer management class program which, when executed by a processor, implements the steps of a cluster optimization based multimodal fusion target detection method.

According to the multi-mode fusion target detection method and system based on cluster optimization, a 3D target detection network architecture based on cluster optimization is adopted, and average detection accuracy of smaller targets such as pedestrians and riding pedestrians is greatly improved.

Drawings

FIG. 1 is a flow chart of a multi-mode fusion target detection method based on cluster optimization provided by the invention;

FIG. 2 is an effect diagram of pasting and copying a large object after scaling down into other images;

FIG. 3 is a schematic diagram of the structure of a CBAM attention module;

FIG. 4 is a schematic view of a point cloud clustering flow;

FIG. 5 is a schematic diagram of cluster frame complementation;

FIG. 6 is a schematic clustering diagram in occlusion cases;

FIG. 7-1 is a schematic diagram of a 2D frame height of a pedestrian and its distance from a lidar;

FIG. 7-2 is a graphical representation of the 2D frame height of a cyclist versus distance from the lidar;

FIG. 8 is a schematic view of rotational misalignment;

FIG. 9 is a schematic diagram of the overall architecture of a cluster-optimization-based multi-modal fusion target detection method;

FIG. 10 is a schematic structural diagram of a cluster-optimization-based multi-modal fusion target detection system;

fig. 11 is a schematic hardware structure of a possible computer readable storage medium according to the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention. In addition, the technical features of each embodiment or the single embodiment provided by the invention can be combined with each other at will to form a feasible technical scheme, and the combination is not limited by the sequence of steps and/or the structural composition mode, but is necessarily based on the fact that a person of ordinary skill in the art can realize the combination, and when the technical scheme is contradictory or can not realize, the combination of the technical scheme is not considered to exist and is not within the protection scope of the invention claimed.

Fig. 1 is a flowchart of a multi-mode fusion target detection method based on cluster optimization, where the method, as shown in fig. 1, includes:

step 1, acquiring a 2D image containing an object to be detected and a 3D point cloud.

It is understood that when detecting a target, a 2D image and a 3D point cloud containing the detected target are acquired, and then the target is subjected to 3D detection based on the 2D image and the 3D point cloud.

And 2, inputting the 2D image into a two-dimensional object detection network with a CBMA attention module, and acquiring a 2D detection frame of each object in the 2D image.

It can be understood that when performing object detection, the 2D image is input into a two-dimensional object detection network, and a 2D detection frame of each object in the 2D image is acquired, wherein the 2D detection frame is a rectangular detection frame.

The two-dimensional object detection network is a Yolov5 network containing CBAM attention mechanism enhancement, wherein training of the two-dimensional object detection network comprises:

It can be appreciated that there is a serious imbalance of large and small targets in the KITTI dataset, i.e., the number of large targets is far greater than the number of small targets. In order to improve the detection precision of the Yolov5 aiming at the small target, the image part of the KITTI data set is subjected to data enhancement by utilizing an improved copy and paste module, so that the detection performance of the network on the small target is improved. Here, instead of copying and pasting the object directly back to the original image, the object with a height greater than 25 pixels in the petestrian (Pedestrian) and Cyclist (Cyclist) categories is scaled down equally and then randomly pasted into the image that does not contain some instances of the category, thereby increasing the number of small object instances. After data enhancement, the number of the Cyclist class examples is doubled and the number of the Pederthian class examples is increased by 30% compared with the case of no data enhancement.

And training the Yolov5 network based on the data set after data enhancement to obtain a trained two-dimensional target detection network. FIG. 2 is a diagram showing the enlarged effect of the copy and paste module on the KITTI data set, and the enhanced annotation object is shown in the box of FIG. 2. In the process of data enhancement, the new target of copy and paste is ensured not to coincide with the original target. In this way, the contribution of small targets to the loss in training can be increased, preventing the network from converging on larger targets.

Wherein the two-dimensional object detection network is a Yolov5 network based on the improvement of a CBAM attention module, CBAM (Convolutional Block Attention Module) is an attention mechanism for a convolutional neural network, the network structure of which is shown in fig. 3, and is a combination of a spatial attention module SAM (Spatial Attention Module) and a channel attention module CAM (Channel Attention Module). In spatial attention, the network model learns the weight of each pixel point, so that the feature vectors at different positions are weighted, and the model is focused on important areas of the current task. The channel attention module learns the importance of each channel of the feature map, thereby adjusting the weights of the channels to enhance the overall performance of the model.

Adding a CBAM attention module to the 2D detector can effectively improve the ability of the network to extract features of the target, and therefore, to improve the detection of small targets by Yolov5, a CBAM attention module is added to Yolov5 to improve performance.

And 3, carrying out coordinate mapping on the 2D detection frame and the 3D point cloud of each target, and generating a conical region of interest of each target.

And 4, inputting the conical region of interest of each target into a three-dimensional target detection network, and acquiring a 3D detection frame of each target.

It can be understood that the 2D detection frame and the 3D point cloud of each target in the 2D image detected by the two-dimensional target detection network are subjected to coordinate mapping, so as to generate a cone-shaped region of interest of each target. And then inputting the conical region of interest of each target into a three-dimensional target detection network to obtain a 3D detection frame of each target.

And 5, acquiring a final 3D detection frame of each target by a clustering method fusing priori information points based on the 2D detection frame and the 3D detection frame of each target.

It can be understood that the final 3D detection frame of each target is determined according to the conditions of the 2D detection frame and the 3D detection frame of each target, and the fusion processing is performed on the results of each 2D detection frame and each 3D detection frame based on the clustering optimization module.

And for the target with the height of the 2D detection frame being larger than that of the preset pixel, directly taking the 3D detection frame of the target as the final 3D detection frame of the target. For example, for the targets of the automobile category and the targets with the 2D detection frame height greater than 30 pixels, the 3D detection frame of the target is directly output without using a clustering module to optimize the detection result in the 3D target detection part, i.e. if the target is a large target, without using a clustering optimization module.

And for targets with 2D detection frames but without 3D detection frames, when the confidence coefficient of the 2D detection frame of the target is larger than a first preset confidence coefficient, acquiring the final 3D detection frame of each target according to the 2D detection frame of the target and the conical region of interest of the target by a clustering method fusing priori information points.

It can be understood that for the target detected by the 2D part but not detected by the 3D part, the target only has a 2D detection frame, and the clustering optimization module is used for supplementing the 2D detection result only when the confidence of the 2D detection frame is greater than 0.2, so as to obtain the 3D detection frame of the target. When the confidence of the 2D detection frame of the target is smaller than 0.2, the 2D detection frame of the target is not trusted, and the target is not detected or is not detected.

It can be appreciated that, for the case where both the 2D portion and the 3D portion are detected, when the confidence of the 3D detection frame of the target is less than 0.5, the clustering module is used to optimize the 2D detection result to obtain the 3D detection frame of the target. And when the confidence coefficient of the target 3D detection frame is larger than 0.5, directly outputting the 3D detection frame of the target detected by the three-dimensional target detection network as a final result.

As an embodiment, the clustering method for fusing prior information points according to the 2D detection frame of the target and the tapered region of interest of the target obtains a final 3D detection frame of each target, including: clustering the point cloud of the conical region of interest of the target based on Euclidean distance to obtain at least one clustering result; when a plurality of clustering results exist, the clustering result closest to the center of the 2D detection frame of the target is selected as the final 3D detection frame of the target.

It is understood that clustering is an unsupervised machine learning technique that groups data based on similarity principles. Firstly, selecting one feature of data and setting a feature threshold value, and then continuously merging classes with the two features closest to each other and smaller than the threshold value until clustering is completed when the features of all the classes are larger than the threshold value. The clustering module provided by the embodiment of the invention is based on European clustering of point clouds. The Euclidean clustering is a clustering algorithm based on Euclidean distance measurement, adjacent points of each point cloud can be searched through the established topological relation, euclidean distances of the adjacent points are calculated, and then clustering is completed according to the Euclidean distances, and a specific flow is shown in fig. 4.

The point cloud clustering generally generates a plurality of clustering results, and if the number of the clustering results is greater than 1, screening and optimizing the clustering results. Since the target is often located at the center of the 2D detection frame, the clustering result closest to the center in the point cloud cone-shaped interested area generated by the 2D detection frame is most likely to be the target, and therefore, the clustering result closest to the center of the 2D detection frame of the target is selected from a plurality of clustering results to be used as the final 3D detection frame of the target. When a plurality of clustering results are closest to the center of the 2D detection frame of the target, selecting the clustering result with the largest cloud quantity as the final 3D detection frame of the target.

The implementation of the point cloud clustering is not difficult, but how to guarantee the accuracy of the clustering result is a problem. Through multiple experiments, it can be found that even if some screening is performed on the original clustering result, many erroneous results still exist. In the research, the inaccuracy of the clustering result is mainly caused by the following three problems, and the reasons and solutions for the problems are respectively described below:

firstly, a 3D target detection algorithm based on deep learning can obtain some priori information of a sample through training, and when the 3D target is predicted, a frame with the size similar to that of an actual target can be obtained even if the shape of a point cloud is incomplete, so that an evaluation script can judge the predicted frame as a positive sample when calculating the intersection ratio. The clustering is an unsupervised algorithm, and the boundary of the prediction frame obtained by the clustering is limited by the outermost point, so that the size of the target can not be automatically complemented. As shown in fig. 5, the solid line frame is an actual target frame, the dotted line frame is a frame generated by clustering, and it can be seen that although the clustering frame correctly frames the target point cloud, since the target size is not automatically complemented, the intersection ratio of the prediction frame and the actual frame is smaller than the threshold value, and therefore the evaluation script can judge the prediction frame as a wrong result.

To solve the above problems, the present invention further includes repairing the position and size of each small object: calculating the average size of each class of small targets according to the size of each small target output by the three-dimensional target detection network, wherein the average size comprises average length, width and height; the size of the final 3D detection frame of the small object of each category is adjusted to the average size.

Specifically, the clustering prediction frame is optimized by using prior information of the targets, and for the categories of petestrian (pedestrians) and Cyclist (Cyclist), all samples in the KITTI data set are counted, so that the average sizes of the two categories of targets are calculated. Wherein the length, width and height of the Pederstrian class are respectively 0.83 meter, 0.64 meter and 1.77 meter; the average length, width and height of the Cyclist class were 1.77 meters, 0.58 meters and 1.73 meters, respectively. After the clustering is completed, the clustering prediction frame is complemented according to the position of the clustering center and the prior information of the target size, as shown by a dotted line frame in fig. 5, so that the clustering result is better matched with the target actual frame.

Secondly, a plurality of results exist in the clustering, and although the target with the largest clustering center point number in the front view of the point cloud interest area can be preferentially selected as the final result, if the shielding exists, the correctness of the clustering result still cannot be ensured. As shown in fig. 6, it can be seen from the 2D image that there are four targets in total, two of which are severely occluded, when the clustering method is used to produce a result at the wrong location.

In order to solve the above problem, the original training data set records the corresponding relation between the height of the 2D detection frame of each type of target and the distance of the 2D detection frame relative to the laser radar; when the clustering result is screened, determining the position range of the center of the 3D detection frame of the target according to the height of the 2D detection frame of the target and the corresponding relation between the height of the 2D detection frame of each type of target and the distance of the 2D detection frame relative to the laser radar; if the central position of the clustering result of the target is in the position range, the clustering result is reserved, otherwise, the clustering result is discarded.

It can be understood that, in order to solve the second type of problem described above, the embodiment of the present invention counts the correspondence between the height of the 2D detection frame of the same target in the KITTI training set and the distance between the 2D detection frame and the laser radar, fig. 7-1 shows the correspondence between the height of the 2D detection frame of a Pedestrian (petestrian) and the distance between the 2D detection frame and the laser radar, and fig. 7-2 shows the correspondence between the height of the 2D detection frame and the distance between the 2D detection frame and the laser radar, wherein the horizontal axis is the height of the 2D detection frame, the unit is a pixel, and the vertical axis is the distance between the 3D target and the laser radar, and the unit is meter. When the clustering result is screened, according to the height of the 2D detection frame, the reasonable range of the 3D center of the target can be determined, and thus, a part of error results can be discarded.

Thirdly, after determining the cluster center and the length, width and height of the 3D frame, the rotation angle of the target needs to be known so as to generate the prediction frame better. As shown in fig. 8, the solid line represents the actual target frame, the broken line represents the clustered target frame, and for the Cyclist class, the wrong rotation angle greatly affects the calculation of the cross ratio.

To solve this problem, the original training dataset records the rotation angles of the 2D detection frames of each class of objects; when screening the clustering results, determining the rotation angle of each clustering result according to the included angle of the connecting line of the farthest point and the closest point of the point cloud in the horizontal direction in the clustering results relative to the x axis, and if the difference between the rotation angle of the clustering result and the rotation angle of the 2D detection frame of the same class of targets is smaller than a preset difference value, reserving the clustering result, otherwise, discarding the clustering result.

For the third class of problems, the rotation angle of the prediction frame can be directly determined according to the included angle of the line between the farthest point and the closest point of the point cloud in the horizontal direction in the clustering result relative to the x axis, the rotation angle of the prediction frame is different from the rotation angle of the 2D detection frame, if the difference between the rotation angle of the prediction frame and the rotation angle of the 2D detection frame is too large, the target prediction is inaccurate, and the target prediction is abandoned.

Referring to fig. 9, in order to provide an overall flowchart of a multi-mode fusion target detection method based on cluster optimization, using a Yolov5 with enhanced CBAM attention mechanism as a two-dimensional target detection network, generating a cone-shaped region of interest by using a 2D detection result, sending point cloud data in the region of interest into a three-dimensional target detection network and a clustering module at the same time, and then integrating the outputs of the three-dimensional target detection network and the clustering module to determine a final result. When the target output by the two-dimensional target detection network is a larger target, clustering is not performed on point clouds, and the result output by the three-dimensional target detection network is directly used. If the target is a small target, comprehensively judging according to the height of the 2D detection frame, the score output by the three-dimensional target detection network and the clustering result, and finally outputting a final detection frame result.

The invention adopts a 3D target detection network architecture based on cluster optimization, and greatly improves the average detection precision of smaller targets such as pedestrians, riding pedestrians and the like.

From the performance point of view, tests are performed on the kitti data set, and compared with a classical network VoxelNet, the algorithm of the invention is improved under three difficulties. Compared with F-PointNet, according to the invention, under the difficulty of modeling, AP (average precision) of three categories is respectively increased by 1.22%,5.23% and 5.07%, under the difficulty of hard, AP of three categories is respectively increased by 2.6%,6.76% and 8.48%, and compared with PV-RCNN, the class of cyclist under the difficulty of hard is increased by 3.1%.

From the generalization perspective, the data enhancement algorithm provided by the method can be migrated to other data sets, and has a wide application value. The algorithm improves the generalization capability of the model and can help the model to adapt to different scenes and data distribution better. The model is exposed to a number of variations during training and is more likely to handle different data situations during testing.

From the viewpoint of efficiency, the adopted feature level fusion scheme of the point cloud and the image is significantly reduced in the amount of calculation compared with the point level fusion scheme. Feature level fusion typically involves fusing high-level feature maps without performing a fusion operation at each point of the original input data. This can significantly reduce computational complexity, and can improve training and reasoning efficiency of the model, especially for large-scale input data.

Referring to fig. 10, a cluster optimization-based multi-mode fusion target detection system of the present invention is provided, which includes a first acquisition module 1001, a second acquisition module 1002, a mapping module 1003, a third acquisition module 1004, and a fusion module 1005, wherein:

a first obtaining module 1001, configured to obtain a 2D image and a 3D point cloud that include an object to be detected;

a second obtaining module 1002, configured to input the 2D image into a two-dimensional object detection network with a CBMA attention module, to obtain a 2D detection frame of each object in the 2D image;

the mapping module 1003 is configured to coordinate-map the 2D detection frame and the 3D point cloud of each target, and generate a cone-shaped region of interest of each target;

a third obtaining module 1004, configured to input the tapered region of interest of each target into a three-dimensional target detection network, and obtain a 3D detection frame of each target;

the fusion module 1005 is configured to obtain a final 3D detection frame of each target based on the 2D detection frame and the 3D detection frame of each target by using a clustering method of fusing a priori information points.

It can be understood that the multi-mode fusion target detection system based on cluster optimization provided by the invention corresponds to the multi-mode fusion target detection method based on cluster optimization provided by the foregoing embodiments, and relevant technical features of the multi-mode fusion target detection system based on cluster optimization can refer to relevant technical features of a slice fusion method of a multi-source heterogeneous model, which are not described herein.

Referring to fig. 11, fig. 11 is a schematic diagram of a computer readable storage medium according to an embodiment of the invention. As shown in fig. 11, the present embodiment provides a computer-readable storage medium 1100 on which a computer program 1111 is stored, which when executed by a processor, implements the steps of a cluster-optimization-based multi-modal fusion target detection method.

According to the multi-mode fusion target detection method and system based on cluster optimization, which are provided by the embodiment of the invention, aiming at the problem of unbalanced data of a large target and a small target, the copy and paste module is adopted to improve, and the large target is stuck to other pictures after being reduced in an equal ratio so as to increase the number of the small targets, thereby increasing the contribution value of the small targets to loss in training. The point cloud European clustering module is used for optimizing and supplementing the detection result output by the three-dimensional target detection network, the detection precision of the whole network to the 3D small target is greatly improved, and the CBAM attention module is added in the two-dimensional target detection Yolov5 network, so that the network focuses on important areas of images more to improve the detection precision of the Yolov5 to the small target, and the network generates a better cone-shaped point cloud interested area by utilizing the 2D detection result.

In the foregoing embodiments, the descriptions of the embodiments are focused on, and for those portions of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A multi-mode fusion target detection method based on cluster optimization is characterized by comprising the following steps:

acquiring a 2D image and a 3D point cloud containing an object to be detected;

2. The multi-modal fusion target detection method of claim 1, wherein the two-dimensional target detection network is a Yolov5 network including CBAM attention mechanism enhancement, wherein training the two-dimensional target detection network comprises:

3. The method for detecting a multi-modal fusion target according to claim 1, wherein the clustering method for fusing a priori information points based on the 2D detection frame and the 3D detection frame of each target obtains a final 3D detection frame of each target, comprising:

4. The method for detecting a multi-modal fusion target according to claim 3, wherein the clustering method for fusing a priori information points according to the 2D detection frame of the target and the tapered region of interest of the target obtains a final 3D detection frame of each target, comprising:

5. The method according to claim 4, wherein when there are a plurality of clustering results closest to the center of the 2D detection frame of the target, the clustering result with the largest number of point clouds is selected as the final 3D detection frame of the target.

6. The method of claim 4, further comprising repairing the position and size of each small object:

7. The multi-modal fusion target detection method of claim 4, wherein the raw training dataset records a correspondence of the height of the 2D detection frame of each class of targets to its distance relative to the lidar;

8. The multi-modal fusion target detection method of claim 4, wherein the raw training dataset records rotational angles of 2D detection frames for each class of targets;

9. A cluster optimization-based multi-modal fusion target detection system, comprising:

10. A computer readable storage medium, having stored thereon a computer management class program which when executed by a processor implements the steps of the cluster optimization based multimodal fusion target detection method of any of claims 1-8.