CN117746359A

CN117746359A - Target detection method, target detection device, electronic equipment and readable storage medium

Info

Publication number: CN117746359A
Application number: CN202311587759.XA
Authority: CN
Inventors: 韦沁言; 裴雨听; 熊伟; 徐娜
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2023-11-24
Filing date: 2023-11-24
Publication date: 2024-03-22

Abstract

The embodiment of the invention provides a target detection method, a target detection device, electronic equipment and a readable storage medium, wherein the target detection method, the target detection device, the electronic equipment and the readable storage medium are used for extracting characteristics of cloud data of a target point to obtain target characteristics; performing feature coding on target features based on a two-dimensional encoder in a target detection model, and determining a target output result; the target output result comprises a first feature vector; acquiring a second feature vector corresponding to the first feature vector based on a semantic fusion model in the target detection model; based on a scale fusion model in the target detection model, carrying out multi-scale fusion processing on the target output result and the second feature vector to obtain a target feature map; and determining target detection information corresponding to the target object based on the target detection frame and the target feature map in the target detection model. Therefore, the comprehensiveness of the feature information expressed by the target feature map is improved, the information expression capacity of the target feature map is improved, and the detection precision of target detection is further improved.

Description

Target detection method, target detection device, electronic equipment and readable storage medium

Technical Field

The present invention relates to the field of target detection technologies, and in particular, to a target detection method, a target detection device, an electronic device, and a readable storage medium.

Background

With the acceleration of the urban process and the increase of the number of vehicles, the technology of target detection in traffic scenes is always one of the important directions for developing technologies such as automatic driving and auxiliary driving.

In the related art, a two-dimensional image is often acquired through a vehicle, and target detection is performed on the two-dimensional image to acquire related information of a target, but positioning errors of the target detected by the two-dimensional visual image in space are large, so that the target detection method based on the two-dimensional image is low in detection precision.

Disclosure of Invention

In order to overcome the problems in the related art, the invention provides a target detection method, a target detection device, an electronic device and a readable storage medium.

In a first aspect, the present invention provides a target detection method, the method comprising:

extracting characteristics of the cloud data of the target point to obtain target characteristics; the target point cloud data comprises a target object;

performing feature coding on the target features based on a two-dimensional encoder in a target detection model, and determining a target output result; the target output result comprises a first feature vector;

acquiring a second feature vector corresponding to the first feature vector based on a semantic fusion model in the target detection model; the second feature vector is used for representing multistage semantic features corresponding to the target point cloud data;

Performing multi-scale fusion processing on the target output result and the second feature vector based on a scale fusion model in the target detection model to obtain a target feature map; the target feature map is used for representing multi-scale semantic features corresponding to the target point cloud data;

and determining target detection information corresponding to the target object based on a target detection frame in the target detection model and the target feature map.

Optionally, the feature extraction of the target point cloud data to obtain the target feature includes:

acquiring cloud data of a target point;

dividing the target point cloud data into a plurality of columnar voxels according to a target segmentation rule;

and encoding the plurality of columnar voxels to obtain the target feature.

Optionally, the two-dimensional encoder includes a first convolution network, a second convolution network, a third convolution network, and a fourth convolution network, and the target output result further includes a first output feature, a second output feature, and a third output feature; the two-dimensional encoder in the target detection model is used for carrying out feature encoding on the target features to determine a target output result, and the method comprises the following steps:

based on the first convolution network, extracting the characteristics of the target characteristics to obtain first output characteristics;

Performing feature extraction on the first output feature based on the second convolution network to obtain the second output feature;

performing feature extraction on the second output feature based on the third convolution network to obtain the third output feature;

and carrying out feature extraction on the third output feature based on the fourth convolution network to obtain the first feature vector.

Optionally, the obtaining, based on a semantic fusion model in the target detection model, a second feature vector corresponding to the first feature vector includes:

inputting the first feature vector into the semantic fusion model; the semantic fusion model comprises a dense convolution network and an inverse convolution network;

performing convolution processing on the first feature vector based on the dense convolution network to obtain a first convolution result;

deconvolution processing is carried out on the first convolution result based on the deconvolution network, so that a second convolution result is obtained;

and carrying out fusion processing on the first characteristic vector and the second convolution result to obtain the second characteristic vector.

Optionally, the performing multi-scale fusion processing on the target output result and the second feature vector based on the scale fusion model in the target detection model to obtain a target feature map includes:

Determining a first fusion feature according to the second output result and the first feature based on the scale fusion model;

determining a second fusion feature based on the first fusion feature and the third output result of the second feature;

the second feature is obtained by fusing the first feature vector after up-sampling processing and the third output result, and the first feature is obtained by up-sampling processing on the basis of the second feature;

determining a third fusion feature based on the second fusion feature and the second feature vector;

and fusing the first fusion feature, the second fusion feature and the third fusion feature to obtain the target feature map.

Optionally, the determining a first fusion feature according to the second output result and the first feature includes:

upsampling the first feature vector and fusing the first feature vector with the third output feature to obtain the second feature;

performing up-sampling processing on the second feature to obtain the first feature; the method comprises the steps of carrying out a first treatment on the surface of the

And fusing the first characteristics and the second output results according to different weights to obtain the first fused characteristics.

Optionally, the determining a second fusion feature based on the first fusion feature and the third output result of the second feature includes:

and fusing the first fusion feature, the second feature and the third output result according to different weights to obtain the second fusion feature.

In a second aspect, the present invention provides an object detection apparatus, the apparatus comprising:

the first extraction module is used for extracting characteristics of the cloud data of the target point to obtain target characteristics; the target point cloud data comprises a target object;

the first coding module is used for carrying out feature coding on the target features based on a two-dimensional coder in the target detection model, and determining a target output result; the target output result comprises a first feature vector;

the first acquisition module is used for acquiring a second feature vector corresponding to the first feature vector based on a semantic fusion model in the target detection model; the second feature vector is used for representing multistage semantic features corresponding to the target point cloud data;

the first fusion module is used for carrying out multi-scale fusion processing on the target output result and the second feature vector based on a scale fusion model in the target detection model to obtain a target feature map; the target feature map is used for representing multi-scale semantic features corresponding to the target point cloud data;

And the first determining module is used for determining target detection information corresponding to the target object based on a target detection frame in the target detection model and the target feature map.

In a third aspect, the present invention provides an electronic device comprising: a processor, a memory and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the object detection method of any one of the above first aspects when executing the program.

In a fourth aspect, the present invention provides a readable storage medium, which when executed by a processor of an electronic device, enables the electronic device to perform the steps of the object detection method as in any one of the embodiments of the first aspect described above.

In the embodiment of the invention, the target characteristic is obtained by carrying out characteristic extraction on the target point cloud data, so that the characteristic information corresponding to the target point cloud data can be initially obtained, further, the target characteristic is subjected to characteristic coding based on the two-dimensional encoder, the target output result is determined, deep characteristic information can be further extracted in a layered manner, and the information expression capability of the first characteristic vector is improved. The second feature vector corresponding to the first feature vector is obtained based on the semantic fusion model, so that semantic information of different grades corresponding to the first feature vector can be better learned, and the information expression capacity of the second feature vector is enriched. Further, based on the scale fusion module, the target output result and the second feature vector are subjected to multi-scale fusion processing to obtain a target feature map, and the bottom layer features are gradually transferred to the high-level features through the scale fusion module to obtain features from different resolutions, so that the obtained target feature map can pay attention to details of objects and global context information at the same time, the comprehensiveness of feature information expressed by the target feature map is improved, and the information expression capability of the target feature map is further improved. Therefore, as the information expression capability of the target feature map is stronger and the expressed features are more comprehensive, the target detection information corresponding to the target object determined based on the target detection frame and the target feature map can be more accurate, and the detection precision of target detection is further improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of steps of a target detection method according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of target detection based on a target detection model according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating specific steps of a target detection method according to an embodiment of the present invention;

FIG. 4 is a block diagram of an object detection apparatus according to an embodiment of the present invention;

fig. 5 is a block diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Fig. 1 is a flowchart of steps of a target detection method according to an embodiment of the present invention, where, as shown in fig. 1, the method may include:

step 101, extracting characteristics of cloud data of a target point to obtain target characteristics; the target point cloud data comprises a target object.

In the embodiment of the invention, in the running process of the vehicle, target point cloud data acquired by the vehicle are acquired, wherein the target point cloud data are data sets of points under a certain coordinate system, and the target point cloud data can be acquired by a laser radar deployed on the vehicle. For example, in the case that target detection is required to perform auxiliary judgment in the vehicle driving process, target point cloud data acquired by the laser radar of the vehicle at the current moment is acquired. Similarly, the vehicle can also continuously acquire target point cloud data through the laser radar, continuously acquire a plurality of target point cloud data, and perform target detection operation based on each target point cloud data. The target point cloud data can comprise target objects, the types of the target objects can comprise living bodies and non-living bodies, and the target objects can also comprise moving targets, fixed targets and the like. By way of example, the target object may be a pedestrian, an automobile, a bus, or the like.

And extracting the characteristics of the cloud data of the target point to obtain the target characteristics. The feature extraction method can be a column method or a voxel method, and the target feature can be a feature obtained by processing target point cloud data by adopting the column method or the voxel method. For example, the process of processing the target point cloud data using the pillar method may process the point cloud data using the PointNet network architecture to create a vertical pillar structure. Specifically, the encoder divides the point cloud data into vertically oriented columnar voxels, each voxel containing a certain number of points. PointNet is then used to encode each voxel in columns, producing a fixed length feature vector, i.e., the target feature.

102, carrying out feature coding on the target features based on a two-dimensional encoder in a target detection model, and determining a target output result; the target output result includes a first feature vector.

In the embodiment of the invention, the target feature is used as the input of the target detection model, and the target feature is processed by the two-dimensional encoder to obtain the first feature vector output by the two-dimensional encoder. Based on the architecture of the two-dimensional encoder, a target output result may be determined. The target detection model can comprise a two-dimensional encoder, a semantic fusion model, a scale fusion model and a target detection framework. The two-dimensional encoder may refer to a two-dimensional sparse convolutional encoder, and specifically may include a multi-layer sparse convolutional network and a pooling layer, each layer of sparse convolutional network may include a sparse convolutional layer, a batch normalization layer (Batch Normalization Layer) and a ReLU activation function, the sparse convolutional layers in each layer of sparse convolutional network may include convolution kernels with different side lengths, which are respectively used for extracting feature information of different layers, a batch normalization layer and a ReLU activation function may be disposed after the sparse convolutional layer, which are used for accelerating the convergence speed of the model and improving the stability of the model, and a maximum pooling layer may be disposed after the last sparse convolutional network, which is used for further reducing the dimension of the information extracted by the sparse convolutional network, so as to reduce the calculation amount and strengthen the invariance of the image features, and increase the robustness in terms of image offset, rotation and the like. The target output result may include a result output by each layer of convolution network in the two-dimensional encoder, where the first feature vector is a result output by a last layer of sparse convolution network in the two-dimensional encoder, that is, the first feature vector is a final result obtained by processing the target feature by each convolution network in the two-dimensional encoder. For example, in the case where the number of sparse convolutional networks is 4, the target output result may include a first output feature of the first sparse convolutional network output, a second output feature of the second sparse convolutional network output, a third output feature of the third sparse convolutional network output, and a first feature vector of the fourth sparse convolutional network output.

Step 103, acquiring a second feature vector corresponding to the first feature vector based on a semantic fusion model in the target detection model; the second feature vector is used for representing multistage semantic features corresponding to the target point cloud data.

In the embodiment of the invention, after the first feature vector is obtained, in order to more fully fuse the feature information under multi-level semantics, the semantic information corresponding to the first feature vector can be further obtained through a semantic fusion model. The semantic fusion model may include a dense convolution network and an inverse convolution network, where the dense convolution network may include multiple layers of dense convolution layers, a batch normalization layer, and a ReLU activation function, and the dense convolution network is configured to obtain high-level semantics in the first feature vector local structure. Deconvolution networks, which are used to upsample and restore feature vectors of dense convolutional networks to the original input dimensions, may include deconvolution layers and ReLU functions to recover lost detail information, which helps preserve and recover spatial structure and fine-grained information of the input data (i.e., the first feature vector). Further, residual connection is carried out on the intermediate result obtained through the multilayer dense convolution network and the inverse convolution network processing and the first characteristic vector, so that a second characteristic vector is obtained. In the process of acquiring the second feature vector, the intermediate result retaining the high-level semantic features is fused with the first feature vector retaining the low-level semantic features, so that the second feature vector can represent the multi-level semantic features corresponding to the cloud data of the target point.

104, performing multi-scale fusion processing on the target output result and the second feature vector based on a scale fusion model in the target detection model to obtain a target feature map; and the target feature map is used for representing multi-scale semantic features corresponding to the target point cloud data.

In the embodiment of the invention, in order to more fully fuse the characteristic information under multiple scales, the target characteristic diagram can be obtained by performing multiple-scale fusion processing on the basis of the target output result corresponding to the two-dimensional encoder and the second characteristic vector through a scale fusion model. The target feature map is used for representing feature information of different scales corresponding to target point cloud data, and the target feature map can be a pseudo two-dimensional image.

The structure of the scale fusion model may be similar to a bi-directional feature pyramid network by progressively passing the underlying features from bottom to top to the higher-level features to provide features from different resolutions, enabling the network to focus on the details of the object and the global context at the same time. And then, the target output results corresponding to the two-dimensional encoder are transversely connected to transfer the upper-layer characteristics to the lower-layer characteristics, so that the characteristic representation capability is further enhanced, the information transfer and gradient flow in the training process are facilitated, and the convergence speed of the network is increased. The scale fusion model can comprise three layers of convolution networks, and the three layers of convolution networks are respectively used for fusion based on target output results corresponding to the two-dimensional encoder and second feature vectors corresponding to the semantic fusion model so as to combine semantic information of different scales of different layers and further improve the accuracy of feature fusion.

The scale fusion model obtains a target feature map based on a target output result and a second feature vector, wherein the target output result corresponds to output results of different convolution networks, the second feature vector corresponds to multi-level semantic features, the obtained target feature map is further fused on the basis, the expressed semantic information is more abundant, and the target feature map can be used for representing the multi-scale semantic features corresponding to the target point cloud data. Correspondingly, in the process of detecting the target in the vehicle running process, the characteristic information of the target object can be better expressed through the target characteristic diagram obtained by the target detection model, so that the target detection information obtained based on the target characteristic diagram can be more in line with the actual position, the actual type and the actual boundary of the target object.

And 105, determining target detection information corresponding to the target object based on a target detection frame in the target detection model and the target feature map.

In an embodiment of the present invention, the target detection frame may be a single-point multi-frame probe, such as SSD (Single Shot MultiBox Detector) model or YOLO model. And classifying and regression processing is carried out on the target feature images by utilizing the target detection frame so as to carry out frame prediction and category prediction on the target feature images, and further, target detection information corresponding to the target object is obtained. The target detection information may include a target type, a target boundary, and a target position corresponding to the target object. The target type may include pedestrians, trucks, buses, automobiles, etc., and the target boundary may be embodied in the form of a target detection frame for characterizing boundary information corresponding to the target object. It will be appreciated that the process of identifying the target feature map based on the target detection framework may refer to the prior art, and embodiments of the present invention are not limited in this respect. After the target detection information corresponding to the target object is acquired, automatic driving and auxiliary driving can be judged based on the target detection information, and illustratively, driving operations such as accelerator, brake, steering of a steering wheel and the like can be judged based on the target detection information.

In one possible implementation manner, in a case that 3 target objects are included in the target point cloud data, detecting the target detection information corresponding to the target objects based on the above method may include: the three target objects respectively correspond to the target type, the target boundary and the target position. Driving judgment is performed based on the obtained target detection information, for example: the vehicle speed can be controlled based on the type, boundary and position of the target object around the current vehicle, and particularly, the speed can be reduced under the condition that pedestrians exist within the preset distance around the current vehicle; in the case where there is a vehicle ahead of the current vehicle, the vehicle speed may be adjusted to maintain a certain vehicle distance from the preceding vehicle, or the like.

In summary, in the embodiment of the present invention, the target feature is obtained by performing feature extraction on the target point cloud data, so that feature information corresponding to the target point cloud data can be initially obtained, further, feature encoding is performed on the target feature based on the two-dimensional encoder, the target output result is determined, deep feature information can be further extracted in a layered manner, and the information expression capability of the first feature vector is improved. The second feature vector corresponding to the first feature vector is obtained based on the semantic fusion model, so that semantic information of different grades corresponding to the first feature vector can be better learned, and the information expression capacity of the second feature vector is enriched. Further, based on the scale fusion module, the target output result and the second feature vector are subjected to multi-scale fusion processing to obtain a target feature map, and the bottom layer features are gradually transferred to the high-level features through the scale fusion module to obtain features from different resolutions, so that the obtained target feature map can pay attention to details of objects and global context information at the same time, the comprehensiveness of feature information expressed by the target feature map is improved, and the information expression capability of the target feature map is further improved. Therefore, as the information expression capability of the target feature map is stronger and the expressed features are more comprehensive, the target detection information corresponding to the target object determined based on the target detection frame and the target feature map can be more accurate, and the detection precision of target detection is further improved.

Furthermore, compared with the traditional dense convolution-based network, the feature extraction based on the two-dimensional sparse convolution-based network has lower calculation amount and faster operation efficiency, so that the whole efficiency and the speed are not influenced by adding the scale fusion module on the basis of the semantic fusion module, and the speed of target detection is ensured to a certain extent.

Alternatively, step 101 may comprise the steps of:

and step 1011, acquiring target point cloud data.

In the embodiment of the invention, the target point cloud data can be acquired by a laser radar deployed in a vehicle.

Step 1012, dividing the target point cloud data into a plurality of columnar voxels according to a target segmentation rule.

In the embodiment of the invention, a target segmentation rule may be preset, where the target segmentation rule may define the size and shape of a "column" used when segmenting target point cloud data, and the target segmentation rule may be determined based on the size and shape of a target object in the target point cloud data, and exemplary, a larger "column" may be used for a larger target object, and a smaller "column" may be used for a smaller target object. The dynamic method for adjusting the size and the shape of the 'upright post' according to the shape and the size of the target object is beneficial to improving the accuracy and the robustness of detection.

The target point cloud data is divided into a plurality of columnar voxels according to a target segmentation rule, i.e., points are segmented into individual "posts" along the x-axis and the y-axis. Wherein each "column" represents a voxel of three-dimensional space. Since the points in the point cloud data are not guaranteed to be uniformly distributed in the 'upright columns', but are sparsely distributed, assuming that the maximum number of points in the 'upright columns' is N, the number of channels of each point is C, and P 'upright columns' are shared in the whole scene, the process of dividing the target point cloud data into a plurality of columnar voxels according to the target segmentation rule can be as follows: firstly, setting the dimension of a grid (H.times.W) on the plane of a top view; then for each point in the column corresponding to each grid (x, y, z, r, x) _c ,y _c ,z _c ,x _p ,y _p ) 9 dimensions. Wherein x, y, z are the true position coordinates of each point, r is the reflectivity, x _c ,y _c ,z _c Representing the point toDeviation of column center, x _p ,y _p Indicating the deviation of the point from the center of the network. Each "pillar" midpoint is sampled more than N, with less than N being filled with 0. Thus, a (D, N, P) is formed, where d= 9,N is the maximum number of points (set value), and p=h×w.

Step 1013, encoding the plurality of columnar voxels to obtain the target feature.

In the embodiment of the invention, the columnar voxels are encoded based on a plurality of columnar voxels, and the target characteristics corresponding to the target point cloud data are obtained. The target feature may be feature data corresponding to point cloud information stored in the column in a two-dimensional feature mode. Illustratively, after obtaining a plurality of columnar voxels ("columns"), C channels corresponding to the "columns" can be learned from D-dimension using a simplified PointNet network to become (C, N, P), then the operation of maximizing N becomes (C, P), and since p=h×w, the target feature can be obtained.

In the embodiment of the invention, the column method is adopted to process the target point cloud data, so that the point cloud distribution in the target point cloud data can be simplified, the target point cloud data is divided based on the target division rule, and the size and shape of the column can be adjusted to adapt to different types and specifications of target objects, so that the division precision is more attached to the target object, and the detection precision is improved to a certain extent.

Further, compared with the position relation of the three-dimensional point cloud in the directly used target point cloud data, the method and the device for processing the three-dimensional point cloud data based on the column method can relatively reduce the amount of the point cloud data and improve the calculation efficiency; compared with the method based on the voxel method, the method directly divides the point cloud data into a series of cube voxels with equal size, and then processes and analyzes each voxel, at the moment, if the size of the selected unified voxel is too small, the point cloud data is sparse, and the feature extraction is difficult; if the size of the selected unified voxel is too large, the point cloud data is lost, and the target detection precision is reduced.

Optionally, the two-dimensional encoder includes a first convolution network, a second convolution network, a third convolution network, and a fourth convolution network, and the target output result further includes a first output feature, a second output feature, and a third output feature.

In the embodiment of the invention, the two-dimensional encoder may include four layers of sparse convolutional networks, which are a first convolutional network, a second convolutional network, a third convolutional network and a fourth convolutional network, respectively, and the first convolutional network, the second convolutional network, the third convolutional network and the fourth convolutional network are sequentially connected, and in order to further improve the convergence rate of the model, a batch of normalization layers and a ReLU activation function may be set after the convolutional layers in each layer of convolutional network. Correspondingly, the target output result may include output results of a four-layer sparse convolution network, which are a first output feature, a second output feature, a third output feature, and a first feature vector, respectively. The first convolution network, the second convolution network, the third convolution network and the fourth convolution network have the same structure and comprise a sparse convolution layer, a batch normalization layer and a ReLU activation function, but the side length of a convolution kernel in the sparse convolution layer is sequentially halved from large to small, so that the output characteristics (a first output characteristic, a second output characteristic, a third output characteristic and a first characteristic vector) of the sparse convolution layer are sequentially reduced correspondingly. It may be appreciated that the side lengths of the convolution kernels in the convolution layers in the first convolution network, the second convolution network, the third convolution network, and the fourth convolution network may be customized according to requirements, which the embodiments of the present invention do not limit.

Accordingly, step 102 may include the steps of:

and 1021, extracting the characteristics of the target characteristics based on the first convolution network to obtain first output characteristics.

In the embodiment of the invention, the target feature is extracted based on a first convolution layer, a first normalization layer and an activation function in the first convolution network, so as to obtain a first output feature output by the first convolution network.

And step 1022, performing feature extraction on the first output feature based on the second convolution network to obtain the second output feature.

In the embodiment of the invention, the target feature is extracted based on a second convolution layer, a second normalization layer and an activation function in the second convolution network, so as to obtain a second output feature output by the second convolution network.

Step 1023, based on the third convolution network, extracting the features of the second output features to obtain the third output features.

In the embodiment of the invention, the target feature is extracted based on a third convolution layer, a third normalization layer and an activation function in a third convolution network, and a third output feature output by the third convolution network is obtained.

And step 1024, performing feature extraction on the third output feature based on the fourth convolution network to obtain the first feature vector.

In the embodiment of the invention, the target feature is extracted based on a fourth convolution layer, a fourth normalization layer and an activation function in a fourth convolution network, and a fourth output feature output by the fourth convolution network is obtained. And pooling the fourth output feature based on a maximum pooling layer (Max Pooling Layer) to obtain a first feature vector.

Exemplary, assume a first output characteristic T of the first convolutional network output ₁ The scale is 1×, and the second output characteristic T of the second convolution network output ₂ Third output characteristic T of third convolution network output ₃ Fourth output characteristic T of fourth convolution network output ₄ The corresponding calculation process is:

T _i ＝spconv(T _i-1 ),i＝1,2,3,4。

wherein T is _i Representing the output characteristics of the ith convolutional network; speconv (·) represents a custom sparse convolution operation, including sparse convolution layers, normalization layers, and ReLU activation functions.

Correspondingly, the scale of the characteristics (first output characteristic, second output characteristic, third output characteristic and fourth output characteristic) of each layer of custom convolution network output is as follows:

T _i ＝2 ^i-1 ×,i＝1,2,3,4。

in the embodiment of the invention, the characteristic extraction is performed through the first convolution network, the second convolution network, the third convolution network and the fourth convolution network, and the target characteristic can be gradually downsampled, so that the characteristic information of different layers can be extracted, and the characteristics of higher levels can be gradually learned along with the increase of the layer number. In this way, the two-dimensional sparse convolutional neural network is gradually used from the first stage to the fourth stage to downsample sparse target features, hierarchical feature extraction is achieved, convolutional calculation amount is reduced to a certain extent, sparse convolutional operation is performed on different input scales through multi-scale processing capacity, input data with different sizes and shapes can be better processed, spatial features of the input data (target features) are learned, and detection performance of an algorithm is improved.

Alternatively, step 103 may comprise the steps of:

step 1031, inputting the first feature vector into the semantic fusion model; the semantic fusion model comprises a dense convolution network and an inverse convolution network.

In the embodiment of the invention, the output result of the fourth convolution network, namely the output result of the two-dimensional encoder, is input into the semantic fusion model, and the semantic fusion model is utilized to further mine the high-level semantics in the first feature vector. The semantic fusion model may include a dense convolution network and an inverse convolution network, and the dense volume may include 3 convolution layers, a batch normalization layer and a ReLU activation function. Illustratively, the composition of the 3 convolutional layers may be: each convolution layer has 256 output channels with a convolution kernel size of 3 and a padding of 1. The deconvolution network may include deconvolution layers, which may be, for example, a 2 x 2 convolution kernel, as well as a ReLU function.

Step 1032, performing convolution processing on the first feature vector based on the dense convolution network to obtain a first convolution result.

In the embodiment of the invention, after a first feature vector is input into a semantic fusion model, convolution processing is performed by a dense convolution network in the semantic fusion model to obtain a first convolution result. The first convolution result may be by convolving the first feature vector (low-level spatial feature) once, thereby capturing the high-level semantics in the local structure. That is, the first convolution result may preserve important information of the first feature vector while narrowing the feature vector dimension.

Illustratively, the process of a dense convolutional network may be as follows:

wherein BLK _dense () Representing a dense convolutional network comprising 3 convolutional layers, a batch normalization layer, and a ReLU activation function; reLU (·) represents a ReLU activation function; BN (·) represents batch normalization; BLK (BLK) _conv Representing a dense convolutional network; t (T) ₄ The first eigenvector representing the output of the fourth convolutional network.

And 1033, performing deconvolution processing on the first convolution result based on the deconvolution network to obtain a second convolution result.

In the embodiment of the invention, deconvolution processing is performed on the first convolution result based on the deconvolution network to obtain the second convolution result. Specifically, the first convolution result may be up-sampled based on a deconvolution layer in the deconvolution network and a ReLU activation function to obtain a second convolution result. In this way, the reduced-dimension feature vector can be restored back to the original input dimension, thereby restoring the lost detail information, which helps preserve and restore the spatial structure and fine-grained information of the first feature vector. Illustratively, if the first convolution result is up-sampled twice based on the deconvolution layer and the ReLU activation function, the second convolution result is obtained with the same scale as the first feature vector.

By way of example, the corresponding calculation process for the deconvolution network may be as follows:

wherein, reLU (·) represents a ReLU activation function; deconv (·) represents the deconvolution layer; BLK (BLK) _upsamp (·) represents an inverse convolutional network,representing the first convolution result.

And 1034, performing fusion processing on the first characteristic vector and the second convolution result to obtain the second characteristic vector.

In the embodiment of the invention, the first feature vector and the second convolution result are fused to obtain the second feature vector. Thus, the second feature vector is fused with the high-level semantic information (second convolution result) and the low-level semantic information (second feature vector), and the information expression capability is improved. For example, the first feature vector may be subjected to a residual connection process with the second convolution result.

Illustratively, the second feature vector may be obtained as follows:

wherein,representing the second convolution result, T ₄ Representing a first feature vector, ">Representing a second feature vector.

In the embodiment of the invention, the second convolution result is obtained through the dense convolution network and the deconvolution network in the semantic fusion model, further capturing the deep semantic information, and more detail information can be introduced by fusing the features (second convolution result) storing the high-level semantics with the features (second feature vector) of the low-level semantics, thereby reducing the risk of overfitting of the network and improving the model performance.

Optionally, step 104 may include the steps of:

step 201, determining a first fusion feature according to the second output result and the first feature based on the scale fusion model.

In the embodiment of the invention, in order to more fully fuse the characteristic information under multiple scales, the target output result obtained by the two-dimensional editor can be subjected to multi-scale fusion, and the information can be transferred and flowed in the model by transferring the upper layer characteristics to the lower layer characteristics, so that the expression capability of the model under different scales is improved. The scale fusion model may include three convolution layers, a fifth convolution layer, a sixth convolution layer, and a seventh convolution layer, respectively. The bottom layer features are gradually transferred to the high layer features in a bottom-up transfer mode based on three layers of convolution layers. And based on a fifth convolution layer in the scale fusion model, acquiring a first fusion characteristic obtained by the fifth convolution layer according to a second output result, namely a result output by a second convolution network in the two-dimensional encoder, and the first characteristic. The first characteristic is that the output result of the fourth convolution network in the two-dimensional encoder is up-sampled, and then is fused with the output result of the third convolution network and up-sampled. That is, the first fusion feature is obtained by fusing the output results (the second output result, the third output result, and the first feature vector) of the second convolution network, the third convolution network, and the fourth convolution network in the two-dimensional encoder. The first feature vector and the third output result are up-sampled to the size of the current layer, and then are fused with the features (second output result) of the current layer, so that feature fusion is realized.

Step 202, determining a second fusion feature based on the first fusion feature and the third output result of the second feature; the second feature is obtained by fusing the first feature vector after up-sampling processing and the third output result, and the first feature is obtained by up-sampling processing on the second feature.

In the embodiment of the invention, based on a sixth convolution layer in the scale fusion model, the second fusion feature obtained by the sixth convolution layer is obtained according to the first fusion feature, the second fusion feature and the third output result. The second feature is obtained by performing up-sampling processing on an output result of a fourth convolution network in the two-dimensional encoder and then fusing the output result with an output result of a third convolution network in the two-dimensional encoder, and it can be understood that the first feature can be obtained by performing up-sampling on the second feature. The second fusion feature is equivalent to up-sampling the first feature vector output by the fourth convolution network in the two-dimensional encoder to the size of the current layer, so that the feature information combining the second feature and the third output result of the third convolution network in the two-dimensional encoder is realized on the basis of the feature information expressed by the first fusion feature, and therefore the expression capability of the second fusion feature is stronger.

Step 203, determining a third fusion feature based on the second fusion feature and the second feature vector.

In the embodiment of the invention, based on a seventh convolution layer in the scale fusion model, a third fusion feature is determined according to the second fusion feature and the second feature vector. The second feature vector is the result output by the semantic fusion model, and can represent multi-level semantic information, so that the second feature vector and the second fusion feature are further fused, and the expression capability of the third fusion feature is enriched.

And 204, fusing the first fusion feature, the second fusion feature and the third fusion feature to obtain the target feature map.

In the embodiment of the invention, after the first fusion feature, the second fusion feature and the third fusion feature are obtained, the fusion feature output by the three-layer convolution layer is subjected to convolution processing again, so that a target feature map is obtained.

In the embodiment of the invention, the output results of all convolution layers in the two-dimensional encoder are transversely connected through the scale fusion model, so that information can be allowed to be transmitted between different levels, the expression capacity of the features is improved, and the recall rate and the accuracy of model detection are also improved to a certain extent.

Fig. 2 is a schematic flow chart of object detection based on an object detection model, and based on fig. 2, the method for object detection in the embodiment of the invention is described: the method comprises the steps of obtaining original point cloud data (target point cloud data), carrying out column-forming processing on the original point cloud data to obtain column characteristics (target characteristics), inputting the target characteristics into a two-dimensional encoder in a target detection model, and processing based on a first convolution network, a second convolution network, a third convolution network and a fourth convolution network to obtain target output results (a first output result, a second output result, a third output result and a second characteristic vector). And processing the first feature vector output by the fourth convolution network based on the semantic fusion model, wherein the processing comprises the following steps of: and performing intensive convolution processing to obtain a first convolution result, and further performing up-sampling on the first convolution result to obtain a second convolution result. And fusing the second convolution result and the first eigenvector to obtain a second eigenvector. And fusing the second output result output by the second convolution network and the first characteristic based on a fifth convolution layer in the scale fusion model to obtain a first fusion result. And based on the sixth convolution layer, carrying out fusion processing on the first fusion result, the second characteristic and the third output result to obtain the second fusion characteristic. And fusing the second fusion feature and the second feature vector based on the seventh convolution layer to obtain a third fusion feature. And fusing and convolving the first fusion feature, the second fusion feature and the third fusion feature to obtain a target feature map.

Alternatively, step 201 may comprise the steps of:

and 2011, upsampling the first feature vector and fusing the first feature vector with the third output feature to obtain the second feature.

In the embodiment of the invention, the first feature vector output by the fourth convolution network in the two-dimensional encoder can be up-sampled, and the first feature vector can be up-sampled to the same scale as the size of the third convolution network by way of example. And then fusing the up-sampled first feature vector with a third output feature output by a third convolution network in the two-dimensional encoder to obtain a second feature.

For example, the second feature acquisition formula may be as follows:

wherein,representing a second feature; conv (·) represents convolution; swish (·) represents a swish activation function; e's' _1,2 The trainable parameters are represented and used for adaptively scaling the features so as to adapt to targets with different scales, and specific values can be optimal parameters obtained in the model training process, so that the embodiment of the invention is not limited to the optimal parameters; the umsample (·) represents upsampling; t (T) ₃ A third output characteristic representing a third convolutional network output; t (T) ₄ A first eigenvector representing the output of the fourth convolutional network; mu represents the super parameter, the value is a fixed value, and the specific value can be the optimal parameter obtained in the model training process, which is not limited by the embodiment of the invention.

Step 2012, performing upsampling processing on the second feature to obtain the first feature.

In the embodiment of the invention, the second feature may be further subjected to upsampling processing to obtain the first feature. For example, the second feature may be upsampled to the same scale size as the fifth convolutional layer for subsequent feature fusion.

And 2013, fusing the first feature and the second output result according to different weights to obtain the first fused feature.

In the embodiment of the invention, in the process of feature fusion, fusion can be carried out in each layer of convolution layers in the scale fusion model based on different weight proportions, and the weights are used for guiding the fusion process of the features. Specifically, based on the weight, the self-adaptive fusion of the characteristics according to the characteristic contribution degrees of different levels can be realized, and the detection capability of the characteristics with different scales is improved. For example, a weight value may be allocated to the first feature and the second output result, and then fusion may be performed based on the first feature and the second output result according to the weight value, and a convolution process may be performed to obtain a first fusion feature.

Illustratively, the first fusion feature acquisition process may be as follows:

Wherein,representing a first fusion feature; />Representing a second feature; upsample->Representing a first feature; conv (·) represents convolution; swish (·) represents a swish activation function; e-shaped article _1,2 The trainable parameters are represented and used for adaptively scaling the features so as to adapt to targets with different scales, and specific values can be optimal parameters obtained in the model training process, so that the embodiment of the invention is not limited to the optimal parameters; the umsample (·) represents upsampling; t (T) ₂ A second output result representing a second convolutional network output in the two-dimensional encoder; mu represents the super parameter, the value is a fixed value, and the specific value can be the optimal parameter obtained in the model training process, which is not limited by the embodiment of the invention.

In the embodiment of the invention, the first characteristics and the second output results are weighted and fused according to different weights, and the importance of the characteristics can be adaptively adjusted through weight connection, so that the characteristics of different scales can be better fused, and the accuracy and the robustness of target detection are improved to a certain extent.

Optionally, step 202 may include:

and step 2021, fusing the first fused feature, the second feature and the third output result according to different weights to obtain the second fused feature.

In the embodiment of the invention, the first fusion feature, the second feature and the third output result are fused according to different weights based on the sixth convolution layer to obtain the second fusion feature. Specifically, after the first fusion feature is subjected to convolution processing, the first fusion feature is fused with the second feature and the third output result according to different weights, and the second fusion feature is obtained through convolution processing.

Illustratively, the second fusion feature acquisition process may be as follows:

wherein,representing a second fusion feature; />Representing a second feature; t (T) ₃ Representing a third output result; conv (·) represents convolution; swish (·) represents a swish activation function; e's' _1,2,3 The trainable parameters are represented and used for adaptively scaling the features so as to adapt to targets with different scales, and specific values can be optimal parameters obtained in the model training process, so that the embodiment of the invention is not limited to the optimal parameters; mu represents the super parameter, the value is a fixed value, and the specific value can be the optimal parameter obtained in the model training process, which is not limited by the embodiment of the invention.

In the embodiment of the invention, the first fusion feature, the second feature and the third output result are weighted and fused according to different weights, and the importance of the features can be adaptively adjusted through weight connection, so that the features with different scales can be better fused, and the accuracy and the robustness of target detection are improved to a certain extent.

Optionally, step 203 may include the steps of:

step 2031, fusing the second fusion feature and the second feature vector according to different weights to obtain the third fusion feature.

In the embodiment of the invention, the second fusion feature and the second feature vector are fused according to different weights based on the seventh convolution layer to obtain a third fusion feature. Specifically, the second fusion feature may be subjected to convolution processing, and then fused with the second feature vector according to different weights, and the third fusion feature is obtained through convolution processing.

Illustratively, the third fusion feature may be obtained as follows:

wherein,representing a third fusion feature; />Representing a second fusion feature; />A second feature vector representing the semantic fusion model output; conv (·) represents convolution; swish (·) represents a swish activation function; e'. _1,2 The trainable parameters are represented and used for adaptively scaling the features so as to adapt to targets with different scales, and specific values can be optimal parameters obtained in the model training process, so that the embodiment of the invention is not limited to the optimal parameters; mu represents the super parameter, the value is a fixed value, and the specific value can be the optimal parameter obtained in the model training process, which is not limited by the embodiment of the invention.

In one possible implementation, the object detection model may be trained by:

step 301, acquiring a sample set to be trained; the sample set to be trained comprises a plurality of sample point cloud data.

The sample set to be trained can comprise a plurality of sample point cloud data, and the sample point cloud data can be point cloud data collected in different driving scenes by a vehicle provided with a laser radar in advance. And acquiring a plurality of sample point cloud data for training, and carrying out data annotation on the plurality of sample point cloud data. The sample point cloud data are used for training the classification capacity, the prediction capacity and the perception capacity of the detection model to be trained, and the more the number of the sample point cloud data is, the more excellent the performance of the detection model to be trained is.

By way of example, the sample set to be trained may be from the KITTI 3D public data set, a widely used computer vision data set for studying autopilot and environmental awareness. The data set includes a large amount of data from various sensors on the car (e.g., lidar, stereo camera, GPS, etc.) for evaluating and testing the performance of computer vision and autopilot algorithms in a real environment. The KITTI data set contains various scenarios, such as city streets, rural roads, and highways, as well as various tasks, such as object detection, semantic segmentation, 3D object detection, and motion estimation. The target class of the KITTI data set mainly consists of automobiles, vans, trucks, pedestrians (including multi-attitudes), bicycles, trams and the like. In addition, the KITTI data set also contains other related data, such as camera calibration data, laser radar calibration data, road surface map and the like. Road scenes are marked manually, and a single scene can comprise 15 automobiles and 30 pedestrians at most.

Step 302, regarding any sample point cloud data in the plurality of sample point cloud data, taking the sample point cloud data as input of the detection model to be trained, and obtaining a prediction result output by the detection model to be trained; and the prediction result is used for representing detection information corresponding to the predicted object contained in the sample point cloud data.

In the embodiment of the invention, aiming at any sample point cloud data in a plurality of sample point cloud data, the sample point cloud data is input into a detection model to be trained, and a prediction result output by the detection model to be trained is obtained. The prediction results may include a category prediction result, a position prediction result, and a boundary prediction result corresponding to the predicted object.

Step 303, adjusting parameters of the to-be-trained detection model based on the prediction result corresponding to each predicted object and the labeling label corresponding to each predicted object; the labeling label is used for representing real category information, real position information and real boundary information corresponding to the predicted object in the sample point cloud data.

In the embodiment of the invention, for any predicted object in a plurality of predicted objects, a labeling label corresponding to the predicted object is obtained, wherein the labeling label is obtained by carrying out data labeling on sample point cloud data in advance, and based on a predicted result corresponding to the predicted object and the labeling label corresponding to the predicted object, parameter adjustment is carried out on a detection model to be trained, and the similarity between a classification result output by the detection model to be trained and the labeling label corresponding to the predicted object is greater than a preset similarity threshold value by continuously adjusting parameters of the detection model to be trained. For example, optimization algorithms such as random gradient descent (SGD), batch Gradient Descent (BGD), etc. may be used to adjust parameters of the classification network to be trained.

And step 304, determining the detection model to be trained as the target detection model under the condition that the stop condition is reached.

In the embodiment of the invention, the stopping condition may include conditions that a loss value of the detection model to be trained reaches a preset threshold value, the number of training rounds of the detection model to be trained reaches a preset round number threshold value, and the like.

In one possible implementation, the average accuracy mean (mean Average Precision, mAP) may also be used to evaluate the detection accuracy of the target detection model during training of the target detection model. For example, in calculating the mAP, a Precision-Recall curve is generally used to measure the accuracy and Recall of the detection result output by the target detection model under different confidence degrees, so as to calculate an average Precision mean. Specifically, one detection result can be classified into three difficulty levels of simplicity, medium and difficulty in the KITTI data set. Where simple means that the object size and distance are large, medium means that the object size and distance are medium, and difficult means that the object size and distance are small. And for each difficulty level, calculating the area under the Precision-Recall curve to obtain a corresponding AP index. These AP values are averaged over three difficulty levels to obtain a 3D AP indicator. Therefore, the detection accuracy of the target detection model can be measured to a certain extent, and the detection accuracy of the target detection model can be improved.

For example, fig. 3 provides a flowchart of specific steps of a target detection method, as shown in fig. 3, in a vehicle driving process, target point cloud data collected by a vehicle is obtained, and feature extraction is performed on the target point cloud data to obtain target features. And carrying out two-dimensional sparse convolution on the target characteristics based on the two-dimensional encoder to obtain a target output result. And acquiring a second feature vector corresponding to the first feature vector in the target output result based on the semantic fusion model. And based on the scale fusion model, carrying out multi-scale fusion processing on the second feature vector and the target output result to obtain a target feature map. And determining target detection information corresponding to the target object based on the target detection frame and the target feature map.

Fig. 4 is a schematic structural diagram of an object detection device according to an embodiment of the present invention, as shown in fig. 4, the device may specifically include:

the first extraction module 401 is configured to perform feature extraction on the target point cloud data to obtain a target feature; the target point cloud data comprises a target object;

a first encoding module 402, configured to perform feature encoding on the target feature based on a two-dimensional encoder in a target detection model, and determine a target output result; the target output result comprises a first feature vector;

A first obtaining module 403, configured to obtain a second feature vector corresponding to the first feature vector based on a semantic fusion model in the target detection model; the second feature vector is used for representing multistage semantic features corresponding to the target point cloud data;

the first fusion module 404 is configured to perform multi-scale fusion processing on the target output result and the second feature vector based on a scale fusion model in the target detection model, so as to obtain a target feature map; the target feature map is used for representing multi-scale semantic features corresponding to the target point cloud data;

the first determining module 405 is configured to determine target detection information corresponding to the target object based on a target detection framework in the target detection model and the target feature map.

The embodiment of the invention provides a target detection device, which is used for obtaining target characteristics by extracting characteristics of target point cloud data, so that characteristic information corresponding to the target point cloud data can be initially obtained, further, characteristic coding is carried out on the target characteristics based on a two-dimensional encoder, a target output result is determined, deep characteristic information can be further extracted in a layered mode, and the information expression capability of a first characteristic vector is improved. The second feature vector corresponding to the first feature vector is obtained based on the semantic fusion model, so that semantic information of different grades corresponding to the first feature vector can be better learned, and the information expression capacity of the second feature vector is enriched. Further, based on the scale fusion module, the target output result and the second feature vector are subjected to multi-scale fusion processing to obtain a target feature map, and the bottom layer features are gradually transferred to the high-level features through the scale fusion module to obtain features from different resolutions, so that the obtained target feature map can pay attention to details of objects and global context information at the same time, the comprehensiveness of feature information expressed by the target feature map is improved, and the information expression capability of the target feature map is further improved. Therefore, as the target feature map information has stronger expression capability and more comprehensive expressed features, the target detection information corresponding to the target object determined based on the target detection frame and the target feature map can be more accurate, and the detection precision of target detection is further improved.

Optionally, the first extraction module 401 includes:

the first acquisition sub-module is used for acquiring cloud data of a target point;

the first dividing module is used for dividing the target point cloud data into a plurality of columnar voxels according to a target segmentation rule;

and the first coding submodule is used for coding the columnar voxels to obtain the target feature.

Optionally, the two-dimensional encoder includes a first convolution network, a second convolution network, a third convolution network, and a fourth convolution network, and the target output result further includes a first output feature, a second output feature, and a third output feature; the first encoding module 402 includes:

the first extraction submodule is used for extracting the characteristics of the target characteristics based on the first convolution network to obtain first output characteristics;

the second extraction submodule is used for extracting the characteristics of the first output characteristics based on the second convolution network to obtain the second output characteristics;

the third extraction submodule is used for extracting the characteristics of the second output characteristics based on the third convolution network to obtain the third output characteristics;

and the fourth extraction submodule is used for extracting the characteristics of the third output characteristics based on the fourth convolution network to obtain the first characteristic vector.

Optionally, the first acquisition module 403 includes:

the first input module is used for inputting the first feature vector into the semantic fusion model; the semantic fusion model comprises a dense convolution network and an inverse convolution network;

the first processing module is used for carrying out convolution processing on the first feature vector based on the dense convolution network to obtain a first convolution result;

the second processing module is used for carrying out deconvolution processing on the first convolution result based on the deconvolution network to obtain a second convolution result;

and the first fusion submodule is used for carrying out fusion processing on the first characteristic vector and the second convolution result to obtain the second characteristic vector.

Optionally, the first fusion module 404 includes:

the first determining submodule is used for determining a first fusion characteristic according to the second output result and the first characteristic based on the scale fusion model;

the second determining submodule is used for determining a second fusion characteristic based on the first fusion characteristic and the third output result of the second characteristic;

A third determination submodule configured to determine a third fusion feature based on the second fusion feature and the second feature vector;

and the second fusion sub-module is used for fusing the first fusion feature, the second fusion feature and the third fusion feature to obtain the target feature map.

Optionally, the first determining submodule includes:

the third fusion sub-module is used for up-sampling the first feature vector and carrying out fusion processing on the first feature vector and the third output feature to obtain the second feature;

the first processing sub-module is used for carrying out up-sampling processing on the second characteristic to obtain the first characteristic;

and the fourth fusion submodule is used for fusing the first characteristic and the second output result according to different weights to obtain the first fusion characteristic.

Optionally, the second determining submodule includes:

and a fifth fusion sub-module, configured to fuse the first fusion feature, the second feature and the third output result according to different weights, so as to obtain the second fusion feature.

Optionally, the third determining submodule includes:

and a sixth fusion sub-module, configured to fuse the second fusion feature and the second feature vector according to different weights, so as to obtain the third fusion feature.

Optionally, the apparatus may further include:

the second acquisition module is used for acquiring a sample set to be trained; the sample set to be trained comprises a plurality of sample point cloud data.

The third acquisition module is used for taking the sample point cloud data as the input of the detection model to be trained aiming at any sample point cloud data in the plurality of sample point cloud data to acquire a prediction result output by the detection model to be trained; and the prediction result is used for representing detection information corresponding to the predicted object contained in the sample point cloud data.

The first adjustment module is used for carrying out parameter adjustment on the detection model to be trained based on the prediction results corresponding to the prediction objects and the labeling labels corresponding to the prediction objects; the labeling label is used for representing real category information, real position information and real boundary information corresponding to the predicted object in the sample point cloud data.

And the second determining module is used for determining the detection model to be trained as the target detection model under the condition that the stop condition is reached.

The present invention also provides an electronic device, see fig. 5, comprising: the object detection method of the foregoing embodiment is implemented by a processor 501, a memory 502, and a computer program 5021 stored on the memory and executable on the processor when the processor executes the program.

The present invention also provides a readable storage medium, which when executed by a processor of an electronic device, enables the electronic device to perform the object detection method of the foregoing embodiments.

For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.

The algorithms and displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general-purpose systems may also be used with the teachings herein. The required structure for a construction of such a system is apparent from the description above. In addition, the present invention is not directed to any particular programming language. It will be appreciated that the teachings of the present invention described herein may be implemented in a variety of programming languages, and the above description of specific languages is provided for disclosure of enablement and best mode of the present invention.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the above description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be construed as reflecting the intention that: i.e., the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and disposed in one or more apparatuses different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component and, furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Any combination of all features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be used in combination, except insofar as at least some of such features and/or processes or units are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that some or all of the functions of some or all of the components in a sorting device according to the present invention may be implemented in practice using a microprocessor or Digital Signal Processor (DSP). The present invention may also be implemented as an apparatus or device program for performing part or all of the methods described herein. Such a program embodying the present invention may be stored on a computer readable medium, or may have the form of one or more signals. Such signals may be downloaded from an internet website, provided on a carrier signal, or provided in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

It should be noted that all actions for obtaining signals, information or data in this application are performed in compliance with the corresponding data protection legislation policy of the country of location and obtaining the authorization granted by the owner of the corresponding device.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. A method of target detection, the method comprising:

2. The method according to claim 1, wherein the feature extraction of the target point cloud data to obtain the target feature includes:

acquiring cloud data of a target point;

and encoding the plurality of columnar voxels to obtain the target feature.

3. The method of claim 1, wherein the two-dimensional encoder comprises a first convolutional network, a second convolutional network, a third convolutional network, and a fourth convolutional network, and the target output result further comprises a first output feature, a second output feature, and a third output feature; the two-dimensional encoder in the target detection model is used for carrying out feature encoding on the target features, and determining a target output result comprises the following steps:

4. The method according to claim 1, wherein the obtaining a second feature vector corresponding to the first feature vector based on a semantic fusion model in the object detection model includes:

5. The method of claim 3, wherein the performing a multi-scale fusion process on the target output result and the second feature vector based on a scale fusion model in the target detection model to obtain a target feature map includes:

6. The method of claim 5, wherein determining a first fusion feature based on the second output result and the first feature comprises:

7. The method of claim 6, wherein the determining a second fusion feature based on the first fusion feature, the second feature, and the third output result comprises:

8. An object detection device, the device comprising:

9. An electronic device, comprising:

A processor, a memory and a computer program stored on the memory and executable on the processor, the processor implementing the object detection method according to any one of claims 1-7 when executing the program.

10. A readable storage medium, characterized in that instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the object detection method of one or more of claims 1-7.