CN113903028A

CN113903028A - Target detection method and electronic equipment

Info

Publication number: CN113903028A
Application number: CN202111045058.4A
Authority: CN
Inventors: 高戈; 杜能; 文凡; 余星源; 李明; 常军; 陈怡�
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2021-09-07
Filing date: 2021-09-07
Publication date: 2022-01-07

Abstract

The application discloses a target detection method and electronic equipment, and relates to the technical field of computer vision, wherein the method comprises the following steps: defining an anchor frame template; constructing a target detection neural network and training, and extracting anchor point parameters and target types of targets in an input image by using the trained target detection neural network; obtaining a two-dimensional boundary frame, a three-dimensional boundary frame and a central position coordinate of the three-dimensional boundary frame; carrying out back projection on the central position coordinates according to the reverse calculation of the codes to obtain camera coordinates; acquiring a projected two-dimensional frame obtained after the three-dimensional bounding frame is projected, and making the projected two-dimensional frame and the two-dimensional bounding frame into an L shape₁A loss function, adjusting the angle of the observation angle until the adjustment step length of the angle of the observation angle is smaller than a preset termination parameter, and obtaining an adjusted three-dimensional bounding box; object class of output object, two-dimensional bounding box, camera coordinates, andand adjusting the three-dimensional bounding box. According to the method and the device, the interference of external noise is effectively eliminated, and the detection precision of the monocular image three-dimensional target is improved.

Description

Target detection method and electronic equipment

Technical Field

The application relates to the technical field of computer vision, in particular to a target detection method and electronic equipment.

Background

Currently, in practical application scenarios such as automatic driving, information such as a coordinate position, a three-dimensional size, and a deflection angle of a target is essential for a vehicle sensing system. Most three-dimensional target detection technologies are designed based on two-dimensional target detection technologies, and because the design concept of two-dimensional detection does not consider depth information, three-dimensional information of a target object cannot be directly obtained.

In the related art, the monocular image-based three-dimensional target detection technology can obtain the three-dimensional information of a target object by means of methods such as prior fusion information, geometric characteristic information, three-dimensional model matching or a depth estimation model.

However, solving three-dimensional coordinates using geometric constraints requires reliance on some external most advanced sub-network, leading to the introduction of persistent noise, which makes network performance unable to break through its inherent upper bound.

Disclosure of Invention

In view of one of the defects in the prior art, the present application aims to provide an object detection method and an electronic device, so as to solve the problem that the three-dimensional object detection technology in the related art excessively depends on an external network to obtain additional depth information, which leads to continuous noise.

A first aspect of the present application provides a target detection method, which includes the steps of:

defining an anchor frame template comprising two-dimensional parameters [ w, h]_2DThree-dimensional parameters [ w, h, l, theta ]]_3DAnd a depth information parameter; w, h and l respectively represent given values of the width, height and length of the target, and theta represents an observation angle of the target by the camera;

constructing a target detection neural network based on semantic information extraction and annular stripe segmentation depth perception, training, and extracting anchor point parameters and target types of targets in an input image by using the trained target detection neural network;

performing transformation calculation on the anchor point parameters and the corresponding anchor frame template to obtain a two-dimensional boundary frame, a three-dimensional boundary frame and a central position coordinate of the three-dimensional boundary frame;

calculating the coordinates of the center position according to the reverse direction of the code and performing back projection to obtain camera coordinates;

obtaining a projected two-dimensional frame obtained after the three-dimensional bounding frame is projected, and making the projected two-dimensional frame and the two-dimensional bounding frame into an L shape₁A loss function, adjusting the angle of the observation angle until the adjustment step length of the angle of the observation angle is smaller than a preset termination parameter, and obtaining an adjusted three-dimensional bounding box;

and outputting the target category, the two-dimensional bounding box, the camera coordinates and the adjusted three-dimensional bounding box of the target.

In some embodiments, after the above defining the anchor frame template, the method further includes:

and specifying the position of the shared central pixel and coding the depth information parameter.

In some embodiments, the encoding the depth information parameter specifically includes:

with the position of the designated shared center pixel as [ x, y ]]_P3D center position [ x, y, z ] in camera coordinate system]_3DProjected into an image of a given projection matrix P to obtain projection coordinates [ x, y, z ]]_P；

Depth information Z based on the projection coordinates_PThe information parameter is coded to obtain

Where P is a given projection matrix.

In some embodiments, the extracting anchor point parameters and target classes of a target in an input image by using a trained target detection neural network specifically includes:

obtaining a semantic information feature map of an input image by using a trained target detection neural network, and respectively performing global feature extraction and local feature extraction based on annular stripe segmentation depth perception on the semantic information feature map;

weighting the output parameters extracted by the global features and the output parameters extracted by the local features to obtain the anchor point parameters;

and performing self-knowledge distillation optimization on the prediction categories extracted by the global features and the prediction categories extracted by the local features to obtain the target categories.

In some embodiments, the performing global feature extraction and local feature extraction based on annular band segmentation depth perception on the semantic information feature map respectively includes:

when the global feature extraction is carried out by the conventional convolution, the global feature F is introduced in the conventional convolution process_globalGlobal feature F described above_globalIntroducing convolution layers with filling gaps of 1 and 3 multiplied by 3 in number, and carrying out nonlinear activation by a Relu function to generate 512 feature maps;

global feature F_globalRespectively inputting into 12 prediction networks, each of which has a convolution layer with a1 × 1 padding number, and extracting a total output sequence O after convolution_global；

When local feature extraction is carried out through depth perception convolution, annular strip segmentation is introduced in the depth perception convolution process, the semantic information feature map is segmented into r different strips, and different convolution kernels act on the different strips;

introducing local features F_localLocal feature F mentioned above_localThe method comprises the steps of carrying out nonlinear activation by a Relu function through depth perception convolution based on annular strip segmentation to generate 512 feature maps;

local feature F_localRespectively inputting into 12 prediction networks, each prediction network having r convolution layers with 1 padding number and 1 × 1 based on annular stripe division, and performing convolutionPost-extracting partial output sequence O_local。

In some embodiments, the self-knowledge distillation optimization of the prediction category of global feature extraction and the prediction category of local feature extraction is performed to obtain the target category, which specifically includes:

respectively output the global output sequence O_globalClass parameter c _ ls and local output sequence O in (1)_localThe class parameter c _ loc in (1) is input into a distillation neural network, and the teacher class c _ ls of global output is output_teacherAnd student class c _ ls_studentAnd teacher class c _ loc of local output_teacherAnd student class c _ loc_student；

Introducing a weighting alpha obtained by neural network learning_teacherAnd alpha_studentRespectively obtaining the optimized teacher class c _ ls_teacher' and student class c _ ls_student' is

c_ls_teacher'＝c_ls_teacher×α_teacher+c_loc_teacher×(1-α_teacher)

c_ls_student'＝c_ls_student×α_student+c_loc_student×(1-α_student)；

In student class cls_student' as an optimized target class.

In some embodiments, the projected two-dimensional frame [ x ] obtained after the projection of the three-dimensional bounding box is obtained_min,y_min,x_max,y_max]The projection formula of (a) is:

Υ_P＝P·Υ_3D,Υ_2D＝Υ_P/Υ_P[φ_z]

x_min＝min(Υ_2D[φ_x]),y_min＝min(Υ_2D[φ_y])

x_max＝max(Υ_2D[φ_x]),y_max＝max(Υ_2D[φ_y])

wherein phi is_x、φ_y、φ_zRespectively represent axes [ x, y, z ]]Index of [ w, h, l, θ ]]'_3DIs a three-dimensional bounding box, [ x, y, z ]]'_PIs the coordinate of the center position of the three-dimensional bounding box, and P is the given projection matrix.

In some embodiments, training the target detection neural network specifically includes:

constructing a multi-task network loss function of target detection to perform back propagation on the training sample to obtain a weight parameter of an optimized neural network;

the network loss function comprises a classification loss function, a distillation loss function, a 2D frame regression loss function and a 3D frame regression loss function.

In some embodiments, the classification loss function is:

the above 2D box regression loss function is:

the 3D box regression loss function is:

the above distillation loss function is:

L_distill＝(c_ls_teacher'-c_ls_student')²+(prob_teacher-prob_student)²

the multitask network loss function is obtained as follows:

wherein, c_τPredicted value for the τ th real target class, c_iIs the predicted value of the ith target class, n_cIs the total number of target classes, IOU is the cross-over ratio, b'_2DIn the form of a two-dimensional bounding box,

is a true two-dimensional box, b'_3DIs a three-dimensional bounding box which is,

a true value of the three-dimensional box, SmoothL₁Is smoothed L₁Function, prob_teacherProb, teacher class possibility_studentFor student class probability, λ₁Weight coefficient of regression loss, λ, for 2D frame₂Weight coefficient of 3D frame regression loss, λ₃Is the weight factor lost by distillation.

A second aspect of the present application provides an electronic device for object detection, comprising a processor and a memory, wherein the processor executes code in the memory to implement the method.

The beneficial effect that technical scheme that this application provided brought includes:

according to the target detection method and the electronic equipment, after the anchor frame template is defined, the target detection neural network based on semantic information extraction and annular strip segmentation depth perception is constructed and trained, so that anchor point parameters and target types of targets in an input image are extracted by using the trained target detection neural network, and the anchor point parameters and the corresponding anchor frame template are subjected to transformation calculation to obtain the central position coordinates of the two-dimensional boundary frame, the three-dimensional boundary frame and the three-dimensional boundary frame; then, the coordinates of the central position are subjected to back projection according to the reverse calculation of the codes to obtain camera coordinates; finally, theAcquiring a projected two-dimensional frame obtained after the three-dimensional bounding box is projected, and making the projected two-dimensional frame and the two-dimensional bounding box into an L shape₁A loss function, which is used for adjusting the observation angle of view until the adjustment step length of the observation angle of view is smaller than a preset termination parameter, so as to obtain an adjusted three-dimensional bounding box, namely outputting the target category, the two-dimensional bounding box, the camera coordinate and the adjusted three-dimensional bounding box of the target as the detection result of the target; therefore, the semantic information of the image and the different target resolutions at different positions of the image are fully considered, the external noise interference is effectively eliminated, and the detection precision of the monocular image three-dimensional target is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flow chart of a target detection method package in an embodiment of the present application;

FIG. 2 is a diagram of a target detection neural network in an embodiment of the present application;

FIG. 3 is a DNet network architecture diagram in an embodiment of the present application;

fig. 4 is a schematic diagram of a depth-aware network divided by a ring stripe in an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features mentioned in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

The embodiment of the application provides a target detection method and electronic equipment, which can solve the problem that continuous noise is introduced due to the fact that a three-dimensional target detection technology in the related art excessively depends on an external network to obtain additional depth information.

As shown in fig. 1, the target detection method according to the embodiment of the present application includes the following steps:

s1, defining an anchor frame template comprising two-dimensional parameters [ w, h ]]_2DThree-dimensional parameters [ w, h, l, theta ]]_3DAnd depth information parameters to define anchor point templates in respective dimension spaces to realize simultaneous prediction of the two-dimensional bounding box and the three-dimensional bounding box.

Where w, h, and l represent given values of the width, height, and length of the target, respectively, i.e., given values in the detection camera coordinate system, and θ represents the angle of the observation angle of the target by the camera. Since the 3D object is not identical to the 2D object, which has a rotation property, θ represents the viewing angle of the camera to the object, which corresponds to the rotation of the camera around the Y-axis of its camera coordinate system.

S2, constructing and training a target detection neural network based on semantic information extraction and annular stripe segmentation depth perception, performing operation prediction on an input image by using the trained target detection neural network, and extracting anchor point parameters and target types of targets in the input image.

And S3, performing transformation calculation on the anchor point parameters and the corresponding anchor frame template to obtain a two-dimensional boundary frame, a three-dimensional boundary frame and a central position coordinate of the three-dimensional boundary frame.

And S4, carrying out back projection on the central position coordinates according to the reverse calculation of codes to obtain camera coordinates.

S5, obtaining a projection two-dimensional frame obtained after the three-dimensional boundary frame is projected, and making the projection two-dimensional frame and the two-dimensional boundary frame into L₁And adjusting the observation angle of view by the loss function until the adjustment step length of the observation angle of view is smaller than the preset termination parameter, thereby obtaining the adjusted three-dimensional bounding box.

And S6, outputting the target category, the two-dimensional boundary frame, the camera coordinate and the adjusted three-dimensional boundary frame of the target.

In the target detection method of the embodiment, after the anchor frame template is defined, semantic information extraction based on construction and training is performedThe method comprises the steps of obtaining a target detection neural network with an annular strip segmentation depth perception function, extracting anchor point parameters and target types of targets in an input image by using the trained target detection neural network, and performing transformation calculation on the anchor point parameters and corresponding anchor frame templates to obtain a two-dimensional boundary frame, a three-dimensional boundary frame and central position coordinates of the three-dimensional boundary frame; then, the coordinates of the central position are subjected to back projection according to the reverse calculation of the codes to obtain camera coordinates; finally, a projected two-dimensional frame obtained after the three-dimensional boundary frame is projected is obtained, and the projected two-dimensional frame and the two-dimensional boundary frame are made into an L shape₁A loss function, which is used for adjusting the observation angle of view until the adjustment step length of the observation angle of view is smaller than a preset end parameter, so as to obtain an adjusted three-dimensional boundary frame, and then outputting the target category, the two-dimensional boundary frame, the camera coordinate and the adjusted three-dimensional boundary frame of the target as the detection result of the target; therefore, the semantic information of the image and the different target resolutions at different positions of the image are fully considered, the external noise interference is effectively eliminated, and the detection precision of the monocular image three-dimensional target is improved.

Further, in step S1, before defining the anchor frame template, the method further includes: the hyper-parameters H, W and a are predefined based on the size and resolution of the input image described above.

H and W respectively represent the height and width of an input image, and A represents the number of anchor frame templates to be generated by each anchor point; the number of anchor points in the input image is H × W.

In this embodiment, taking a 3-channel monocular image with a size of 572 × 572 as an example, H may be set to 388, W may be 388, and a may be 36, that is, 388 × 388 anchor points are selected, and each anchor point generates 36 anchor frame templates with different sizes, so that 388 × 388 × 36 anchor frame templates may be obtained from one input image.

Preferably, in step S1, after the anchor frame template is defined, the method further includes the following steps: and specifying the position of the shared central pixel and coding the depth information parameter.

In this embodiment, to place an anchor point and define a complete two-dimensional bounding box or a three-dimensional bounding box, a shared center pixel position [ x, y ] needs to be specified]_PWherein 2D representsThe parameters are expressed in terms of pixel coordinates.

For the depth information parameter Z_PThe encoding specifically comprises the following steps:

first, to specify the shared center pixel position [ x, y [ ]]_P3D center position [ x, y, z ] in camera coordinate system]_3DProjected into an image of a given projection matrix P to obtain projection coordinates [ x, y, z ]]_P。

Then, the depth information parameter Z is determined based on the projection coordinates_PCoding to obtain

Where P is a given projection matrix.

In the present embodiment, θ_3DRepresenting the viewing angle, i.e. rotation about the Y-axis of the camera coordinate system, theta_3DThe relative orientation of the object with respect to the camera view is considered. The anchor frame mechanism calculates the depth information parameter Z for each anchor point separately in advance through mean value statistics_PAnd three-dimensional parameters [ w, h, l, θ ]]_3DAnd then the three-dimensional parameter estimation method serves as strong prior information to reduce the difficulty of three-dimensional parameter estimation calculation.

As shown in fig. 2, preferably, in the step S2, extracting anchor point parameters and object types of the object in the input image by using the trained object detection neural network specifically includes the following steps:

firstly, after a semantic information feature map of an input image is obtained by using a trained target detection neural network, global feature extraction and local feature extraction based on annular stripe segmentation depth perception are respectively carried out on the semantic information feature map, namely global convolution and local annular segmentation convolution.

And then, carrying out weighting processing on the output parameters of the global feature extraction and the output parameters of the local feature extraction to obtain the anchor point parameters, thereby realizing the combined feature extraction.

And finally, performing self-knowledge distillation optimization on the prediction category extracted by the global features and the prediction category extracted by the local features to obtain the target category. In this embodiment, the neural network generation parameter PA-GEN is a "student" and "teacher" parameter in self-learning distillation.

In this embodiment, the target detection neural network includes a DNet module, a combined feature extraction module and a self-knowledge distillation module, that is, when extracting semantic information features, DNet (convolutional neural network with a deeper layer number) is used as a basic semantic information extractor to obtain a H × W-dimensional semantic information feature map, that is, DNet is used as a main network to extract a semantic information feature map from an input monocular image.

As shown in fig. 3, specifically, the DNet includes an encoding block located on the left side in the figure, a decoding block located on the right side in the figure, and an intermediate block located in the middle in the figure. For an input 3-channel picture, which is first input into coding block C1, C1 includes two convolution layers of 3 × 3 size convolution operation and ReLU activation function. The output of C1 is duplicated, one down-sampled by a2 x 2 max pooling operation into a coding block C2 similar in structure to C1, and the other into an intermediate block M1 with only one 3 x 3 convolutional layer.

Accordingly, the output of C2 is down-sampled into the coding block C3 and copied into the intermediate block M2, respectively; the output of C3 passes through an intermediate block M3, and the output of M3 is cut out into intermediate portions into a decoding block D3. The output of D3 is upsampled by a2 x 2 convolutional layer and spliced together with the output of M2 after being cut to the same size into the decoding block D2. The output of D2 is input into the decoding layer D1 after being sampled on the 2 × 2 convolutional layer and spliced with the output of M1 cut to the same size, and the output of the decoding layer D1 generates a pixel level semantic information feature map with the size of H × W after passing through the 1 × 1 convolutional layer.

Further, when global feature extraction and local feature extraction based on annular stripe segmentation depth perception are respectively carried out on the semantic information feature maps, the obtained semantic information feature maps are respectively sent into two branches, one is global feature extraction, and the other is local feature extraction, and the method specifically comprises the following steps:

A1. when global feature extraction is performed by conventional convolutionThe conventional convolution process introduces a global local feature F_globalGlobal feature F described above_globalIn the method, convolution layers with the number of filling gaps being 1 and 3 multiplied by 3 are introduced, nonlinear activation is carried out by a Relu (Rectified Linear Unit) function, and 512 feature graphs are generated; the global feature extraction adopts conventional convolution, and the convolution kernel of the conventional convolution acts in the whole space, namely the global convolution.

A2. Global feature F_globalRespectively inputting into 12 prediction networks, each of which is a convolution layer with 1 padding quantity and 1 × 1, and extracting a global output sequence O after convolution_global；

The global output sequence O_globalComprises the following steps: c, [ t ]_x,t_y,t_w,t_h]_2D，[t_x,t_y,t_z]_P,[t_w,t_h,t_l,t_θ]_3DFor a total of 12 outputs. Wherein c represents the class, [ t ]_x,t_y,t_w,t_h]_2DRepresents the 2D anchor-box-related parameter predicted by global feature extraction, [ t ]_x,t_y,t_z]_PRepresents the center of projection, [ t ]_w,t_h,t_l,t_θ]_3DIndicating size and orientation.

A3. When local feature extraction is carried out through depth-aware convolution (depth-aware convolution), annular stripe segmentation is introduced in the depth-aware convolution process, the semantic information feature map is segmented into r different stripes, and different convolution kernels act on the different stripes; wherein, the local feature extraction adopts depth perception convolution, namely local convolution.

Alternatively, r is a value that is preset according to actual conditions such as camera resolution, the position where the camera is mounted on the autonomous vehicle, and the like.

A4. Introducing local features F_localLocal feature F mentioned above_localThe method comprises the steps of carrying out nonlinear activation by a Relu function through depth perception convolution based on annular strip segmentation to generate 512 feature maps;

A5. will be at the topThe local feature F_localRespectively inputting into 12 prediction networks, wherein the structure of each prediction network is r convolution layers with 1 padding quantity and 1 multiplied by 1 based on annular stripe division, extracting and convolving the convolution layers to obtain a global output sequence O_globalSimilar local output sequence O_localThe local features of different regions of the image are processed separately, i.e. using annular band segmentation.

As shown in fig. 4, for a semantic information feature map of size H × W, feature extraction is performed on the feature map using r different convolution layers, and r different depth feature maps are obtained. For each depth feature map, different annular strip segmentation is adopted to cut the depth feature map, and then r cut strip regions are spliced into a new depth feature map with the size H multiplied by W.

Preferably, for the convenience of operation, the method can be realized by directly covering the outer strip with the inner strip, and the specific steps are as follows:

taking i E (0, r) as index subscript of stripe division, wherein i ═ 0 represents the innermost stripe division corresponding index, i ═ r represents the outermost stripe division corresponding index, and E_iRepresenting the characteristic graph extracted from the convolution kernel corresponding to the i index and corresponding to the strip ring_iThe outer edge of (i x h/r) x (i x w/r);

then, with E₀Cutting off the characteristic region from the outer edge of the strip and covering the characteristic region to E₁In the same way, each E_iRing of_iCut along the outer edge and cover to E_i+1Finally, E_rAll depth feature strips will be included.

In this embodiment, the weighting processing is performed on the output parameter extracted by the global feature and the output parameter extracted by the local feature to obtain the anchor point parameter, and the method specifically includes the following steps:

the weighted number alpha obtained by neural network learning is introduced, the advantage of the space invariance of the convolutional neural network is utilized as an index of the 2 nd to 12 th outputs, and the specific output function is as follows:

O^k＝O_global ^k·α_k+O_local ^k·(1-α_k)

wherein, O_global ^kFor global output sequence O_globalMiddle k output, O_local ^kFor partial output of sequence O_localMiddle k output, α_kThe kth output is weighted, i.e., k ∈ (2, 12).

Further, the self-knowledge distillation optimization is carried out on the prediction category extracted by the global features and the prediction category extracted by the local features to obtain the target category, and the method specifically comprises the following steps:

firstly, the global output sequences O are respectively_globalClass parameter c _ ls and local output sequence O in (1)_localThe class parameter c _ loc in the teacher class is input into a distillation neural network, and the teacher class c _ ls of the global output is output_teacherAnd student class c _ ls_studenAnd teacher class c _ loc of local output_teacherAnd student class c _ loc_student；

At this time, the class "teacher" and the class "student" may be processed by the softmax function to generate the possibility prob of class teacher_teacherAnd likelihood prob of student class_student。

Then, a weighting alpha obtained by the learning of the distillation neural network is introduced_teacherAnd alpha_studentRespectively obtaining the optimized teacher class c _ ls_teacher' and student class c _ ls_student' is

c_ls_teacher'＝c_ls_teacher×α_teacher+c_loc_teacher×(1-α_teacher)

c_ls_student'＝c_ls_student×α_student+c_loc_student×(1-α_student)；

Finally, with student class c _ ls_student'as optimized object class c'.

Specifically, the class parameter c _ ls of the global output is input to one 1 × 1 convolutional layer, and the teacher class c _ ls of the global output is generated_teacherAnd studyBirth class c _ ls_studentThe class parameter c _ loc of the local output is inputted into a1 × 1 convolution layer to generate the class of teacher c _ loc of the local output_teacherAnd student class cls _ loc_student。

Preferably, the anchor point parameters and the corresponding anchor frame template are subjected to transformation calculation to obtain the center position coordinates of the two-dimensional boundary frame, the three-dimensional boundary frame and the three-dimensional boundary frame, namely [ t ] in the anchor point related parameters_x,t_y,t_w,t_h]'_2D，[t_x,t_y,t_z]'_P,[t_w,t_h,t_l,t_θ]'_3DThe relative coordinates are transformed into absolute coordinates [ x, y, w, h ] of the detection target bounding box]'_2D，[x,y,z]'_P,[w,h,l,θ]'_3D。

In this embodiment, the step S3 specifically includes the following steps:

first, use [ w, h]_2DTwo-dimensional transformation is carried out to obtain a two-dimensional boundary frame [ x, y, w, h ]]'_2DIs recorded as b'_2D：

In this embodiment, the center pixel position [ x, y ] of each frame]_PI.e. the spatial centre position of each frame, similar to the two-dimensional transformation, using [ w, h ]]_2D、Z_P、[w,h,l,θ]_3DPerforming three-dimensional transformation to obtain a three-dimensional bounding box [ 2 ]w,h,l,θ]'_3DAnd its center position coordinate [ x, y, z ]]'_PIs recorded as b'_3D：

Subsequently, the projected three-dimensional center is estimated, i.e. the coordinates of the center position are back-projected from the image space according to the inverse calculation of the code to obtain the camera coordinates [ x, y, z ]]'_3D。

Further, in step S5, the adjustment step σ of the viewing angle, the decay coefficient γ, and the predetermined termination parameter β are preset. Then taking the two-dimensional bounding box, the three-dimensional bounding box and the central position coordinates, the adjustment step length, the decay coefficient and the preset termination parameter thereof as the input of the observation angle optimization, taking the projection two-dimensional frame obtained after the projection of the three-dimensional bounding box and the two-dimensional bounding box as L₁And continuously adjusting the viewing angle by using the loss function so as to project the 3D information to the 2D information and perform forward optimization processing.

Specifically, in step S5, the three-dimensional bounding box is projected to obtain a projected two-dimensional box [ x [ ]_min,y_min,x_max,y_max]Denoted as ρ, the projection formula is:

Υ_P＝P·Υ_3D,Υ_2D＝Υ_P/Υ_P[φ_z]

x_min＝min(Υ_2D[φ_x]),y_min＝min(Υ_2D[φ_y])

x_max＝max(Υ_2D[φ_x]),y_max＝max(Υ_2D[φ_y])

wherein, γ is recorded₀、Υ_3D、Υ_P、Υ_2DRespectively including half of the length, width and height of the target three-dimensional bounding box, the position of the half size in a 3D space, the position after half-size projection conversion and the position of half size conversion to 2D, wherein the whole process is to convert the three-dimensional bounding box into a projection two-dimensional box; phi is a_x、φ_y、φ_zRespectively represent axes [ x, y, z ]]Index of [ w, h, l, θ ]]'_3DIs a three-dimensional bounding box, [ x, y, z ]]'_PAnd P is the coordinate of the central position of the three-dimensional bounding box, and the projection matrix is given to the camera coordinate system.

In this embodiment, a projected two-dimensional frame [ x ] obtained by projection using the three-dimensional bounding box described above is used_min,y_min,x_max,y_max]And two-dimensional bounding box b'_2DThe loss is calculated to optimize the viewing angle.

When there is no update of loss in the range of θ ± σ, the step size σ is changed by the decay coefficient γ, and the above operation is repeatedly performed until σ < β when σ > β.

At this time, c', [ x, y, w, h]'_2D，[x,y,z]'_3DAnd optimized [ w, h, l, theta ]]'_3DAnd outputting 12 parameters, namely, obtaining the target detection result.

Optionally, in the step S2, the training of the target detection neural network specifically includes the following steps:

and constructing a multi-task network loss function of target detection to perform back propagation on the training sample to obtain a weight parameter of the optimized neural network. In the training stage of the neural network, the weight parameters of the trained target detection neural network are obtained by calculating the loss function and performing back propagation on the neural network.

In this embodiment, for the classification loss, a polynomial logic loss function based on softmax is adopted, that is, the classification loss function is:

for two-dimensional bounding box regression loss, at two-dimensional bounding box b'_2DAnd true two-dimensional box of match

The IOU of (2) adopts a negative logistic loss function, namely a 2D box regression loss function, which is as follows:

for three-dimensional bounding box regression loss, three-dimensional bounding box conversion and true conversion

Using SmoothL for each parameter of₁The regression loss is optimized. Namely, the 3D box regression loss function is:

wherein, SmoothL₁After the smoothing treatmentL₁A loss function. Using smoothL₁By setting the piecewise function, the problem of L is avoided₁The function curve of the loss function has the problem that the break point is not smooth enough, so that the function curve is unstable.

Because a distillation neural network is introduced, the type identification prediction of the target can be optimized through a teacher-student network, the distillation neural network predicts the possibility of teacher class and the possibility of student class and student class, the student class is the optimized target class, and the distillation loss function is obtained as follows:

L_distill＝(c_ls_teacher'-c_ls_student')²+(prob_teacher-prob_student)²

the overall multitasking network loss function is obtained as:

wherein, c_τPredicted value for the τ th real target class, c_iIs a predicted value of the ith class, n_cIOU is the intersection ratio of predicted frame to real frame, b'_3DIs a three-dimensional bounding box which is,

a true value of the three-dimensional box, SmoothL₁Is smoothed L₁Function, prob_teacherProb, teacher class possibility_studentFor student class probability, λ₁Weight coefficient of 2D frame regression loss, λ₂Weight coefficient of regression loss, λ, for 3D frame₃Is the weight factor lost by distillation.

In this embodiment, the anchor point parameter and the target category of the target output by training and the label parameter of the training data set are input into the multitask network loss function, and the total loss L of the multitask network loss function is propagated back to the target detection neural network through the backward propagation of the neural network to optimize the weight parameter of the neural network until the training is finished. And when the training is finished, saving the weight parameters of the target detection neural network for application in an actual scene.

The method can be used for detecting the three-dimensional target based on the monocular image, the aperture imaging principle of the monocular camera is utilized, the different resolutions of the target at different positions and depths of the image during aperture imaging of the monocular camera are fully considered, the depth information of the detected target is processed in different areas, the monocular image is preprocessed through the semantic information extraction backbone network, and the classification of the target is optimized by utilizing self-knowledge distillation, so that the interference of external noise is effectively eliminated, and the accuracy of detecting the three-dimensional target of the monocular image is improved.

The electronic device for target detection in the embodiment of the application comprises a processor and a memory, wherein the processor executes codes in the memory to realize the following target detection method:

Optionally, the processor executing the code in the memory may further implement the following target detection method:

Preferably, the processor executing the code in the memory may also implement other steps in the object detection method.

The present invention is not limited to the above-described embodiments, and it will be apparent to those skilled in the art that various modifications and improvements can be made without departing from the principle of the present invention, and such modifications and improvements are also considered to be within the scope of the present invention. Those not described in detail in this specification are within the skill of the art.

Claims

1. A method of target detection, comprising the steps of:

calculating the coordinates of the central position according to the reverse direction of the code and performing back projection to obtain camera coordinates;

acquiring a projected two-dimensional frame obtained after the three-dimensional boundary frame is projected, and making the projected two-dimensional frame and the two-dimensional boundary frame into an L shape₁A loss function, adjusting the angle of the observation angle until the adjustment step length of the angle of the observation angle is smaller than a preset termination parameter, and obtaining an adjusted three-dimensional bounding box;

2. The object detection method of claim 1, wherein after defining the anchor frame template, further comprising:

3. The object detection method of claim 2, wherein encoding the depth information parameter specifically comprises:

Where P is a given projection matrix.

4. The target detection method of claim 1, wherein extracting anchor point parameters and target classes of targets in the input image using a trained target detection neural network specifically comprises:

weighting the output parameters of the global feature extraction and the output parameters of the local feature extraction to obtain the anchor point parameters;

and performing self-knowledge distillation optimization on the prediction category extracted by the global features and the prediction category extracted by the local features to obtain the target category.

5. The method for detecting the target according to claim 4, wherein the global feature extraction and the local feature extraction based on the annular band segmentation depth perception are respectively performed on the semantic information feature map, and specifically comprises the following steps:

when global feature extraction is carried out through conventional convolution, the global feature F is introduced in the conventional convolution process_globalSaid global feature F_globalIntroducing convolution layers with filling gaps of 1 and 3 multiplied by 3 in number, and carrying out nonlinear activation by a Relu function to generate 512 feature maps;

global feature F_globalRespectively inputting into 12 prediction networks, each of which is a convolution layer with 1 padding quantity and 1 × 1, extracting a global output sequence O after convolution_global；

introducing local features F_localSaid local feature F_localBy being based on a ring-shaped stripDepth-aware convolution with segmentation, non-linearly activated by Relu function, to generate 512 feature maps;

local feature F_localRespectively inputting into 12 prediction networks, wherein the structure of each prediction network is r convolution layers with 1 padding quantity and 1 multiplied by 1 based on annular stripe division, extracting a local output sequence O after convolution_local。

6. The target detection method of claim 5, wherein the self-knowledge distillation optimization of the prediction classes of global feature extraction and the prediction classes of local feature extraction is performed to obtain the target class, and specifically comprises:

respectively output the global output sequence O_globalClass parameter c _ ls and local output sequence O in (1)_localThe class parameter c _ loc in the teacher class is input into a distillation neural network, and the teacher class c _ ls of the global output is output_teacherAnd student class c _ ls_studentAnd teacher class c _ loc of local output_teacherAnd student class c _ loc_student；

c_ls_teacher'＝c_ls_teacher×α_teacher+c_loc_teacher×(1-α_teacher)

c_ls_student'＝c_ls_student×α_student+c_loc_student×(1-α_student)；

In student class cls_student' as an optimized target class.

7. The object detection method of claim 6, wherein the three-dimensional bounding box is projected to form a projected two-dimensional box [ x ]_min,y_min,x_max,y_max]The projection formula of (a) is:

Υ_P＝P·Υ_3D,Υ_2D＝Υ_P/Υ_P[φ_z]

x_min＝min(Υ_2D[φ_x]),y_min＝min(Υ_2D[φ_y])

x_max＝max(Υ_2D[φ_x]),y_max＝max(Υ_2D[φ_y])

8. The target detection method of claim 7, wherein training the target detection neural network specifically comprises:

the network loss function includes a classification loss function, a distillation loss function, a 2D box regression loss function, and a 3D box regression loss function.

9. The object detection method of claim 8, wherein the classification loss function is:

the 2D box regression loss function is:

the 3D box regression loss function is:

the distillation loss function is:

L_distill＝(c_ls_teacher'-c_ls_student')²+(prob_teacher-prob_student)²

obtaining the multitask network loss function as follows:

a true value of the three-dimensional box, SmoothL₁Is smoothed L₁Function, prob_teacherProb, teacher class possibility_studentFor student class probability, λ₁Weight coefficient of regression loss, λ, for 2D frame₂Weight coefficient of regression loss, λ, for 3D frame₃Is the weight factor lost by distillation.

10. An electronic device for object detection, comprising a processor and a memory, wherein execution of code in the memory by the processor implements the method of any of claims 1 to 9.