CN113903028A - Target detection method and electronic equipment - Google Patents

Target detection method and electronic equipment Download PDF

Info

Publication number
CN113903028A
CN113903028A CN202111045058.4A CN202111045058A CN113903028A CN 113903028 A CN113903028 A CN 113903028A CN 202111045058 A CN202111045058 A CN 202111045058A CN 113903028 A CN113903028 A CN 113903028A
Authority
CN
China
Prior art keywords
dimensional
target
student
teacher
global
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111045058.4A
Other languages
Chinese (zh)
Inventor
高戈
杜能
文凡
余星源
李明
常军
陈怡�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202111045058.4A priority Critical patent/CN113903028A/en
Publication of CN113903028A publication Critical patent/CN113903028A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses a target detection method and electronic equipment, and relates to the technical field of computer vision, wherein the method comprises the following steps: defining an anchor frame template; constructing a target detection neural network and training, and extracting anchor point parameters and target types of targets in an input image by using the trained target detection neural network; obtaining a two-dimensional boundary frame, a three-dimensional boundary frame and a central position coordinate of the three-dimensional boundary frame; carrying out back projection on the central position coordinates according to the reverse calculation of the codes to obtain camera coordinates; acquiring a projected two-dimensional frame obtained after the three-dimensional bounding frame is projected, and making the projected two-dimensional frame and the two-dimensional bounding frame into an L shape1A loss function, adjusting the angle of the observation angle until the adjustment step length of the angle of the observation angle is smaller than a preset termination parameter, and obtaining an adjusted three-dimensional bounding box; object class of output object, two-dimensional bounding box, camera coordinates, andand adjusting the three-dimensional bounding box. According to the method and the device, the interference of external noise is effectively eliminated, and the detection precision of the monocular image three-dimensional target is improved.

Description

Target detection method and electronic equipment
Technical Field
The application relates to the technical field of computer vision, in particular to a target detection method and electronic equipment.
Background
Currently, in practical application scenarios such as automatic driving, information such as a coordinate position, a three-dimensional size, and a deflection angle of a target is essential for a vehicle sensing system. Most three-dimensional target detection technologies are designed based on two-dimensional target detection technologies, and because the design concept of two-dimensional detection does not consider depth information, three-dimensional information of a target object cannot be directly obtained.
In the related art, the monocular image-based three-dimensional target detection technology can obtain the three-dimensional information of a target object by means of methods such as prior fusion information, geometric characteristic information, three-dimensional model matching or a depth estimation model.
However, solving three-dimensional coordinates using geometric constraints requires reliance on some external most advanced sub-network, leading to the introduction of persistent noise, which makes network performance unable to break through its inherent upper bound.
Disclosure of Invention
In view of one of the defects in the prior art, the present application aims to provide an object detection method and an electronic device, so as to solve the problem that the three-dimensional object detection technology in the related art excessively depends on an external network to obtain additional depth information, which leads to continuous noise.
A first aspect of the present application provides a target detection method, which includes the steps of:
defining an anchor frame template comprising two-dimensional parameters [ w, h]2DThree-dimensional parameters [ w, h, l, theta ]]3DAnd a depth information parameter; w, h and l respectively represent given values of the width, height and length of the target, and theta represents an observation angle of the target by the camera;
constructing a target detection neural network based on semantic information extraction and annular stripe segmentation depth perception, training, and extracting anchor point parameters and target types of targets in an input image by using the trained target detection neural network;
performing transformation calculation on the anchor point parameters and the corresponding anchor frame template to obtain a two-dimensional boundary frame, a three-dimensional boundary frame and a central position coordinate of the three-dimensional boundary frame;
calculating the coordinates of the center position according to the reverse direction of the code and performing back projection to obtain camera coordinates;
obtaining a projected two-dimensional frame obtained after the three-dimensional bounding frame is projected, and making the projected two-dimensional frame and the two-dimensional bounding frame into an L shape1A loss function, adjusting the angle of the observation angle until the adjustment step length of the angle of the observation angle is smaller than a preset termination parameter, and obtaining an adjusted three-dimensional bounding box;
and outputting the target category, the two-dimensional bounding box, the camera coordinates and the adjusted three-dimensional bounding box of the target.
In some embodiments, after the above defining the anchor frame template, the method further includes:
and specifying the position of the shared central pixel and coding the depth information parameter.
In some embodiments, the encoding the depth information parameter specifically includes:
with the position of the designated shared center pixel as [ x, y ]]P3D center position [ x, y, z ] in camera coordinate system]3DProjected into an image of a given projection matrix P to obtain projection coordinates [ x, y, z ]]P
Depth information Z based on the projection coordinatesPThe information parameter is coded to obtain
Figure BDA0003250907440000021
Where P is a given projection matrix.
In some embodiments, the extracting anchor point parameters and target classes of a target in an input image by using a trained target detection neural network specifically includes:
obtaining a semantic information feature map of an input image by using a trained target detection neural network, and respectively performing global feature extraction and local feature extraction based on annular stripe segmentation depth perception on the semantic information feature map;
weighting the output parameters extracted by the global features and the output parameters extracted by the local features to obtain the anchor point parameters;
and performing self-knowledge distillation optimization on the prediction categories extracted by the global features and the prediction categories extracted by the local features to obtain the target categories.
In some embodiments, the performing global feature extraction and local feature extraction based on annular band segmentation depth perception on the semantic information feature map respectively includes:
when the global feature extraction is carried out by the conventional convolution, the global feature F is introduced in the conventional convolution processglobalGlobal feature F described aboveglobalIntroducing convolution layers with filling gaps of 1 and 3 multiplied by 3 in number, and carrying out nonlinear activation by a Relu function to generate 512 feature maps;
global feature FglobalRespectively inputting into 12 prediction networks, each of which has a convolution layer with a1 × 1 padding number, and extracting a total output sequence O after convolutionglobal
When local feature extraction is carried out through depth perception convolution, annular strip segmentation is introduced in the depth perception convolution process, the semantic information feature map is segmented into r different strips, and different convolution kernels act on the different strips;
introducing local features FlocalLocal feature F mentioned abovelocalThe method comprises the steps of carrying out nonlinear activation by a Relu function through depth perception convolution based on annular strip segmentation to generate 512 feature maps;
local feature FlocalRespectively inputting into 12 prediction networks, each prediction network having r convolution layers with 1 padding number and 1 × 1 based on annular stripe division, and performing convolutionPost-extracting partial output sequence Olocal
In some embodiments, the self-knowledge distillation optimization of the prediction category of global feature extraction and the prediction category of local feature extraction is performed to obtain the target category, which specifically includes:
respectively output the global output sequence OglobalClass parameter c _ ls and local output sequence O in (1)localThe class parameter c _ loc in (1) is input into a distillation neural network, and the teacher class c _ ls of global output is outputteacherAnd student class c _ lsstudentAnd teacher class c _ loc of local outputteacherAnd student class c _ locstudent
Introducing a weighting alpha obtained by neural network learningteacherAnd alphastudentRespectively obtaining the optimized teacher class c _ lsteacher' and student class c _ lsstudent' is
c_lsteacher'=c_lsteacher×αteacher+c_locteacher×(1-αteacher)
c_lsstudent'=c_lsstudent×αstudent+c_locstudent×(1-αstudent);
In student class clsstudent' as an optimized target class.
In some embodiments, the projected two-dimensional frame [ x ] obtained after the projection of the three-dimensional bounding box is obtainedmin,ymin,xmax,ymax]The projection formula of (a) is:
Figure BDA0003250907440000041
Figure BDA0003250907440000042
ΥP=P·Υ3D2D=ΥPPz]
xmin=min(Υ2Dx]),ymin=min(Υ2Dy])
xmax=max(Υ2Dx]),ymax=max(Υ2Dy])
wherein phi isx、φy、φzRespectively represent axes [ x, y, z ]]Index of [ w, h, l, θ ]]'3DIs a three-dimensional bounding box, [ x, y, z ]]'PIs the coordinate of the center position of the three-dimensional bounding box, and P is the given projection matrix.
In some embodiments, training the target detection neural network specifically includes:
constructing a multi-task network loss function of target detection to perform back propagation on the training sample to obtain a weight parameter of an optimized neural network;
the network loss function comprises a classification loss function, a distillation loss function, a 2D frame regression loss function and a 3D frame regression loss function.
In some embodiments, the classification loss function is:
Figure BDA0003250907440000051
the above 2D box regression loss function is:
Figure BDA0003250907440000052
the 3D box regression loss function is:
Figure BDA0003250907440000053
Figure BDA0003250907440000054
the above distillation loss function is:
Ldistill=(c_lsteacher'-c_lsstudent')2+(probteacher-probstudent)2
the multitask network loss function is obtained as follows:
Figure BDA0003250907440000055
wherein, cτPredicted value for the τ th real target class, ciIs the predicted value of the ith target class, ncIs the total number of target classes, IOU is the cross-over ratio, b'2DIn the form of a two-dimensional bounding box,
Figure BDA0003250907440000056
is a true two-dimensional box, b'3DIs a three-dimensional bounding box which is,
Figure BDA0003250907440000057
a true value of the three-dimensional box, SmoothL1Is smoothed L1Function, probteacherProb, teacher class possibilitystudentFor student class probability, λ1Weight coefficient of regression loss, λ, for 2D frame2Weight coefficient of 3D frame regression loss, λ3Is the weight factor lost by distillation.
A second aspect of the present application provides an electronic device for object detection, comprising a processor and a memory, wherein the processor executes code in the memory to implement the method.
The beneficial effect that technical scheme that this application provided brought includes:
according to the target detection method and the electronic equipment, after the anchor frame template is defined, the target detection neural network based on semantic information extraction and annular strip segmentation depth perception is constructed and trained, so that anchor point parameters and target types of targets in an input image are extracted by using the trained target detection neural network, and the anchor point parameters and the corresponding anchor frame template are subjected to transformation calculation to obtain the central position coordinates of the two-dimensional boundary frame, the three-dimensional boundary frame and the three-dimensional boundary frame; then, the coordinates of the central position are subjected to back projection according to the reverse calculation of the codes to obtain camera coordinates; finally, theAcquiring a projected two-dimensional frame obtained after the three-dimensional bounding box is projected, and making the projected two-dimensional frame and the two-dimensional bounding box into an L shape1A loss function, which is used for adjusting the observation angle of view until the adjustment step length of the observation angle of view is smaller than a preset termination parameter, so as to obtain an adjusted three-dimensional bounding box, namely outputting the target category, the two-dimensional bounding box, the camera coordinate and the adjusted three-dimensional bounding box of the target as the detection result of the target; therefore, the semantic information of the image and the different target resolutions at different positions of the image are fully considered, the external noise interference is effectively eliminated, and the detection precision of the monocular image three-dimensional target is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flow chart of a target detection method package in an embodiment of the present application;
FIG. 2 is a diagram of a target detection neural network in an embodiment of the present application;
FIG. 3 is a DNet network architecture diagram in an embodiment of the present application;
fig. 4 is a schematic diagram of a depth-aware network divided by a ring stripe in an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features mentioned in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The embodiment of the application provides a target detection method and electronic equipment, which can solve the problem that continuous noise is introduced due to the fact that a three-dimensional target detection technology in the related art excessively depends on an external network to obtain additional depth information.
As shown in fig. 1, the target detection method according to the embodiment of the present application includes the following steps:
s1, defining an anchor frame template comprising two-dimensional parameters [ w, h ]]2DThree-dimensional parameters [ w, h, l, theta ]]3DAnd depth information parameters to define anchor point templates in respective dimension spaces to realize simultaneous prediction of the two-dimensional bounding box and the three-dimensional bounding box.
Where w, h, and l represent given values of the width, height, and length of the target, respectively, i.e., given values in the detection camera coordinate system, and θ represents the angle of the observation angle of the target by the camera. Since the 3D object is not identical to the 2D object, which has a rotation property, θ represents the viewing angle of the camera to the object, which corresponds to the rotation of the camera around the Y-axis of its camera coordinate system.
S2, constructing and training a target detection neural network based on semantic information extraction and annular stripe segmentation depth perception, performing operation prediction on an input image by using the trained target detection neural network, and extracting anchor point parameters and target types of targets in the input image.
And S3, performing transformation calculation on the anchor point parameters and the corresponding anchor frame template to obtain a two-dimensional boundary frame, a three-dimensional boundary frame and a central position coordinate of the three-dimensional boundary frame.
And S4, carrying out back projection on the central position coordinates according to the reverse calculation of codes to obtain camera coordinates.
S5, obtaining a projection two-dimensional frame obtained after the three-dimensional boundary frame is projected, and making the projection two-dimensional frame and the two-dimensional boundary frame into L1And adjusting the observation angle of view by the loss function until the adjustment step length of the observation angle of view is smaller than the preset termination parameter, thereby obtaining the adjusted three-dimensional bounding box.
And S6, outputting the target category, the two-dimensional boundary frame, the camera coordinate and the adjusted three-dimensional boundary frame of the target.
In the target detection method of the embodiment, after the anchor frame template is defined, semantic information extraction based on construction and training is performedThe method comprises the steps of obtaining a target detection neural network with an annular strip segmentation depth perception function, extracting anchor point parameters and target types of targets in an input image by using the trained target detection neural network, and performing transformation calculation on the anchor point parameters and corresponding anchor frame templates to obtain a two-dimensional boundary frame, a three-dimensional boundary frame and central position coordinates of the three-dimensional boundary frame; then, the coordinates of the central position are subjected to back projection according to the reverse calculation of the codes to obtain camera coordinates; finally, a projected two-dimensional frame obtained after the three-dimensional boundary frame is projected is obtained, and the projected two-dimensional frame and the two-dimensional boundary frame are made into an L shape1A loss function, which is used for adjusting the observation angle of view until the adjustment step length of the observation angle of view is smaller than a preset end parameter, so as to obtain an adjusted three-dimensional boundary frame, and then outputting the target category, the two-dimensional boundary frame, the camera coordinate and the adjusted three-dimensional boundary frame of the target as the detection result of the target; therefore, the semantic information of the image and the different target resolutions at different positions of the image are fully considered, the external noise interference is effectively eliminated, and the detection precision of the monocular image three-dimensional target is improved.
Further, in step S1, before defining the anchor frame template, the method further includes: the hyper-parameters H, W and a are predefined based on the size and resolution of the input image described above.
H and W respectively represent the height and width of an input image, and A represents the number of anchor frame templates to be generated by each anchor point; the number of anchor points in the input image is H × W.
In this embodiment, taking a 3-channel monocular image with a size of 572 × 572 as an example, H may be set to 388, W may be 388, and a may be 36, that is, 388 × 388 anchor points are selected, and each anchor point generates 36 anchor frame templates with different sizes, so that 388 × 388 × 36 anchor frame templates may be obtained from one input image.
Preferably, in step S1, after the anchor frame template is defined, the method further includes the following steps: and specifying the position of the shared central pixel and coding the depth information parameter.
In this embodiment, to place an anchor point and define a complete two-dimensional bounding box or a three-dimensional bounding box, a shared center pixel position [ x, y ] needs to be specified]PWherein 2D representsThe parameters are expressed in terms of pixel coordinates.
For the depth information parameter ZPThe encoding specifically comprises the following steps:
first, to specify the shared center pixel position [ x, y [ ]]P3D center position [ x, y, z ] in camera coordinate system]3DProjected into an image of a given projection matrix P to obtain projection coordinates [ x, y, z ]]P
Then, the depth information parameter Z is determined based on the projection coordinatesPCoding to obtain
Figure BDA0003250907440000091
Where P is a given projection matrix.
In the present embodiment, θ3DRepresenting the viewing angle, i.e. rotation about the Y-axis of the camera coordinate system, theta3DThe relative orientation of the object with respect to the camera view is considered. The anchor frame mechanism calculates the depth information parameter Z for each anchor point separately in advance through mean value statisticsPAnd three-dimensional parameters [ w, h, l, θ ]]3DAnd then the three-dimensional parameter estimation method serves as strong prior information to reduce the difficulty of three-dimensional parameter estimation calculation.
As shown in fig. 2, preferably, in the step S2, extracting anchor point parameters and object types of the object in the input image by using the trained object detection neural network specifically includes the following steps:
firstly, after a semantic information feature map of an input image is obtained by using a trained target detection neural network, global feature extraction and local feature extraction based on annular stripe segmentation depth perception are respectively carried out on the semantic information feature map, namely global convolution and local annular segmentation convolution.
And then, carrying out weighting processing on the output parameters of the global feature extraction and the output parameters of the local feature extraction to obtain the anchor point parameters, thereby realizing the combined feature extraction.
And finally, performing self-knowledge distillation optimization on the prediction category extracted by the global features and the prediction category extracted by the local features to obtain the target category. In this embodiment, the neural network generation parameter PA-GEN is a "student" and "teacher" parameter in self-learning distillation.
In this embodiment, the target detection neural network includes a DNet module, a combined feature extraction module and a self-knowledge distillation module, that is, when extracting semantic information features, DNet (convolutional neural network with a deeper layer number) is used as a basic semantic information extractor to obtain a H × W-dimensional semantic information feature map, that is, DNet is used as a main network to extract a semantic information feature map from an input monocular image.
As shown in fig. 3, specifically, the DNet includes an encoding block located on the left side in the figure, a decoding block located on the right side in the figure, and an intermediate block located in the middle in the figure. For an input 3-channel picture, which is first input into coding block C1, C1 includes two convolution layers of 3 × 3 size convolution operation and ReLU activation function. The output of C1 is duplicated, one down-sampled by a2 x 2 max pooling operation into a coding block C2 similar in structure to C1, and the other into an intermediate block M1 with only one 3 x 3 convolutional layer.
Accordingly, the output of C2 is down-sampled into the coding block C3 and copied into the intermediate block M2, respectively; the output of C3 passes through an intermediate block M3, and the output of M3 is cut out into intermediate portions into a decoding block D3. The output of D3 is upsampled by a2 x 2 convolutional layer and spliced together with the output of M2 after being cut to the same size into the decoding block D2. The output of D2 is input into the decoding layer D1 after being sampled on the 2 × 2 convolutional layer and spliced with the output of M1 cut to the same size, and the output of the decoding layer D1 generates a pixel level semantic information feature map with the size of H × W after passing through the 1 × 1 convolutional layer.
Further, when global feature extraction and local feature extraction based on annular stripe segmentation depth perception are respectively carried out on the semantic information feature maps, the obtained semantic information feature maps are respectively sent into two branches, one is global feature extraction, and the other is local feature extraction, and the method specifically comprises the following steps:
A1. when global feature extraction is performed by conventional convolutionThe conventional convolution process introduces a global local feature FglobalGlobal feature F described aboveglobalIn the method, convolution layers with the number of filling gaps being 1 and 3 multiplied by 3 are introduced, nonlinear activation is carried out by a Relu (Rectified Linear Unit) function, and 512 feature graphs are generated; the global feature extraction adopts conventional convolution, and the convolution kernel of the conventional convolution acts in the whole space, namely the global convolution.
A2. Global feature FglobalRespectively inputting into 12 prediction networks, each of which is a convolution layer with 1 padding quantity and 1 × 1, and extracting a global output sequence O after convolutionglobal
The global output sequence OglobalComprises the following steps: c, [ t ]x,ty,tw,th]2D,[tx,ty,tz]P,[tw,th,tl,tθ]3DFor a total of 12 outputs. Wherein c represents the class, [ t ]x,ty,tw,th]2DRepresents the 2D anchor-box-related parameter predicted by global feature extraction, [ t ]x,ty,tz]PRepresents the center of projection, [ t ]w,th,tl,tθ]3DIndicating size and orientation.
A3. When local feature extraction is carried out through depth-aware convolution (depth-aware convolution), annular stripe segmentation is introduced in the depth-aware convolution process, the semantic information feature map is segmented into r different stripes, and different convolution kernels act on the different stripes; wherein, the local feature extraction adopts depth perception convolution, namely local convolution.
Alternatively, r is a value that is preset according to actual conditions such as camera resolution, the position where the camera is mounted on the autonomous vehicle, and the like.
A4. Introducing local features FlocalLocal feature F mentioned abovelocalThe method comprises the steps of carrying out nonlinear activation by a Relu function through depth perception convolution based on annular strip segmentation to generate 512 feature maps;
A5. will be at the topThe local feature FlocalRespectively inputting into 12 prediction networks, wherein the structure of each prediction network is r convolution layers with 1 padding quantity and 1 multiplied by 1 based on annular stripe division, extracting and convolving the convolution layers to obtain a global output sequence OglobalSimilar local output sequence OlocalThe local features of different regions of the image are processed separately, i.e. using annular band segmentation.
As shown in fig. 4, for a semantic information feature map of size H × W, feature extraction is performed on the feature map using r different convolution layers, and r different depth feature maps are obtained. For each depth feature map, different annular strip segmentation is adopted to cut the depth feature map, and then r cut strip regions are spliced into a new depth feature map with the size H multiplied by W.
Preferably, for the convenience of operation, the method can be realized by directly covering the outer strip with the inner strip, and the specific steps are as follows:
taking i E (0, r) as index subscript of stripe division, wherein i ═ 0 represents the innermost stripe division corresponding index, i ═ r represents the outermost stripe division corresponding index, and EiRepresenting the characteristic graph extracted from the convolution kernel corresponding to the i index and corresponding to the strip ringiThe outer edge of (i x h/r) x (i x w/r);
then, with E0Cutting off the characteristic region from the outer edge of the strip and covering the characteristic region to E1In the same way, each EiRing ofiCut along the outer edge and cover to Ei+1Finally, ErAll depth feature strips will be included.
In this embodiment, the weighting processing is performed on the output parameter extracted by the global feature and the output parameter extracted by the local feature to obtain the anchor point parameter, and the method specifically includes the following steps:
the weighted number alpha obtained by neural network learning is introduced, the advantage of the space invariance of the convolutional neural network is utilized as an index of the 2 nd to 12 th outputs, and the specific output function is as follows:
Ok=Oglobal k·αk+Olocal k·(1-αk)
wherein, Oglobal kFor global output sequence OglobalMiddle k output, Olocal kFor partial output of sequence OlocalMiddle k output, αkThe kth output is weighted, i.e., k ∈ (2, 12).
Further, the self-knowledge distillation optimization is carried out on the prediction category extracted by the global features and the prediction category extracted by the local features to obtain the target category, and the method specifically comprises the following steps:
firstly, the global output sequences O are respectivelyglobalClass parameter c _ ls and local output sequence O in (1)localThe class parameter c _ loc in the teacher class is input into a distillation neural network, and the teacher class c _ ls of the global output is outputteacherAnd student class c _ lsstudenAnd teacher class c _ loc of local outputteacherAnd student class c _ locstudent
At this time, the class "teacher" and the class "student" may be processed by the softmax function to generate the possibility prob of class teacherteacherAnd likelihood prob of student classstudent
Then, a weighting alpha obtained by the learning of the distillation neural network is introducedteacherAnd alphastudentRespectively obtaining the optimized teacher class c _ lsteacher' and student class c _ lsstudent' is
c_lsteacher'=c_lsteacher×αteacher+c_locteacher×(1-αteacher)
c_lsstudent'=c_lsstudent×αstudent+c_locstudent×(1-αstudent);
Finally, with student class c _ lsstudent'as optimized object class c'.
Specifically, the class parameter c _ ls of the global output is input to one 1 × 1 convolutional layer, and the teacher class c _ ls of the global output is generatedteacherAnd studyBirth class c _ lsstudentThe class parameter c _ loc of the local output is inputted into a1 × 1 convolution layer to generate the class of teacher c _ loc of the local outputteacherAnd student class cls _ locstudent
Preferably, the anchor point parameters and the corresponding anchor frame template are subjected to transformation calculation to obtain the center position coordinates of the two-dimensional boundary frame, the three-dimensional boundary frame and the three-dimensional boundary frame, namely [ t ] in the anchor point related parametersx,ty,tw,th]'2D,[tx,ty,tz]'P,[tw,th,tl,tθ]'3DThe relative coordinates are transformed into absolute coordinates [ x, y, w, h ] of the detection target bounding box]'2D,[x,y,z]'P,[w,h,l,θ]'3D
In this embodiment, the step S3 specifically includes the following steps:
first, use [ w, h]2DTwo-dimensional transformation is carried out to obtain a two-dimensional boundary frame [ x, y, w, h ]]'2DIs recorded as b'2D
Figure BDA0003250907440000131
Figure BDA0003250907440000132
Figure BDA0003250907440000133
Figure BDA0003250907440000134
In this embodiment, the center pixel position [ x, y ] of each frame]PI.e. the spatial centre position of each frame, similar to the two-dimensional transformation, using [ w, h ]]2D、ZP、[w,h,l,θ]3DPerforming three-dimensional transformation to obtain a three-dimensional bounding box [ 2 ]w,h,l,θ]'3DAnd its center position coordinate [ x, y, z ]]'PIs recorded as b'3D
Figure BDA0003250907440000141
Figure BDA0003250907440000142
Figure BDA0003250907440000143
Figure BDA0003250907440000144
Subsequently, the projected three-dimensional center is estimated, i.e. the coordinates of the center position are back-projected from the image space according to the inverse calculation of the code to obtain the camera coordinates [ x, y, z ]]'3D
Further, in step S5, the adjustment step σ of the viewing angle, the decay coefficient γ, and the predetermined termination parameter β are preset. Then taking the two-dimensional bounding box, the three-dimensional bounding box and the central position coordinates, the adjustment step length, the decay coefficient and the preset termination parameter thereof as the input of the observation angle optimization, taking the projection two-dimensional frame obtained after the projection of the three-dimensional bounding box and the two-dimensional bounding box as L1And continuously adjusting the viewing angle by using the loss function so as to project the 3D information to the 2D information and perform forward optimization processing.
Specifically, in step S5, the three-dimensional bounding box is projected to obtain a projected two-dimensional box [ x [ ]min,ymin,xmax,ymax]Denoted as ρ, the projection formula is:
Figure BDA0003250907440000145
Figure BDA0003250907440000146
ΥP=P·Υ3D2D=ΥPPz]
xmin=min(Υ2Dx]),ymin=min(Υ2Dy])
xmax=max(Υ2Dx]),ymax=max(Υ2Dy])
wherein, γ is recorded0、Υ3D、ΥP、Υ2DRespectively including half of the length, width and height of the target three-dimensional bounding box, the position of the half size in a 3D space, the position after half-size projection conversion and the position of half size conversion to 2D, wherein the whole process is to convert the three-dimensional bounding box into a projection two-dimensional box; phi is ax、φy、φzRespectively represent axes [ x, y, z ]]Index of [ w, h, l, θ ]]'3DIs a three-dimensional bounding box, [ x, y, z ]]'PAnd P is the coordinate of the central position of the three-dimensional bounding box, and the projection matrix is given to the camera coordinate system.
In this embodiment, a projected two-dimensional frame [ x ] obtained by projection using the three-dimensional bounding box described above is usedmin,ymin,xmax,ymax]And two-dimensional bounding box b'2DThe loss is calculated to optimize the viewing angle.
When there is no update of loss in the range of θ ± σ, the step size σ is changed by the decay coefficient γ, and the above operation is repeatedly performed until σ < β when σ > β.
At this time, c', [ x, y, w, h]'2D,[x,y,z]'3DAnd optimized [ w, h, l, theta ]]'3DAnd outputting 12 parameters, namely, obtaining the target detection result.
Optionally, in the step S2, the training of the target detection neural network specifically includes the following steps:
and constructing a multi-task network loss function of target detection to perform back propagation on the training sample to obtain a weight parameter of the optimized neural network. In the training stage of the neural network, the weight parameters of the trained target detection neural network are obtained by calculating the loss function and performing back propagation on the neural network.
The network loss function comprises a classification loss function, a distillation loss function, a 2D frame regression loss function and a 3D frame regression loss function.
In this embodiment, for the classification loss, a polynomial logic loss function based on softmax is adopted, that is, the classification loss function is:
Figure BDA0003250907440000151
for two-dimensional bounding box regression loss, at two-dimensional bounding box b'2DAnd true two-dimensional box of match
Figure BDA0003250907440000152
The IOU of (2) adopts a negative logistic loss function, namely a 2D box regression loss function, which is as follows:
Figure BDA0003250907440000153
for three-dimensional bounding box regression loss, three-dimensional bounding box conversion and true conversion
Figure BDA0003250907440000154
Using SmoothL for each parameter of1The regression loss is optimized. Namely, the 3D box regression loss function is:
Figure BDA0003250907440000161
Figure BDA0003250907440000162
wherein, SmoothL1After the smoothing treatmentL1A loss function. Using smoothL1By setting the piecewise function, the problem of L is avoided1The function curve of the loss function has the problem that the break point is not smooth enough, so that the function curve is unstable.
Because a distillation neural network is introduced, the type identification prediction of the target can be optimized through a teacher-student network, the distillation neural network predicts the possibility of teacher class and the possibility of student class and student class, the student class is the optimized target class, and the distillation loss function is obtained as follows:
Ldistill=(c_lsteacher'-c_lsstudent')2+(probteacher-probstudent)2
the overall multitasking network loss function is obtained as:
Figure BDA0003250907440000163
wherein, cτPredicted value for the τ th real target class, ciIs a predicted value of the ith class, ncIOU is the intersection ratio of predicted frame to real frame, b'3DIs a three-dimensional bounding box which is,
Figure BDA0003250907440000164
a true value of the three-dimensional box, SmoothL1Is smoothed L1Function, probteacherProb, teacher class possibilitystudentFor student class probability, λ1Weight coefficient of 2D frame regression loss, λ2Weight coefficient of regression loss, λ, for 3D frame3Is the weight factor lost by distillation.
In this embodiment, the anchor point parameter and the target category of the target output by training and the label parameter of the training data set are input into the multitask network loss function, and the total loss L of the multitask network loss function is propagated back to the target detection neural network through the backward propagation of the neural network to optimize the weight parameter of the neural network until the training is finished. And when the training is finished, saving the weight parameters of the target detection neural network for application in an actual scene.
The method can be used for detecting the three-dimensional target based on the monocular image, the aperture imaging principle of the monocular camera is utilized, the different resolutions of the target at different positions and depths of the image during aperture imaging of the monocular camera are fully considered, the depth information of the detected target is processed in different areas, the monocular image is preprocessed through the semantic information extraction backbone network, and the classification of the target is optimized by utilizing self-knowledge distillation, so that the interference of external noise is effectively eliminated, and the accuracy of detecting the three-dimensional target of the monocular image is improved.
The electronic device for target detection in the embodiment of the application comprises a processor and a memory, wherein the processor executes codes in the memory to realize the following target detection method:
defining an anchor frame template comprising two-dimensional parameters [ w, h]2DThree-dimensional parameters [ w, h, l, theta ]]3DAnd a depth information parameter; w, h and l respectively represent given values of the width, height and length of the target, and theta represents an observation angle of the target by the camera;
constructing a target detection neural network based on semantic information extraction and annular stripe segmentation depth perception, training, and extracting anchor point parameters and target types of targets in an input image by using the trained target detection neural network;
performing transformation calculation on the anchor point parameters and the corresponding anchor frame template to obtain a two-dimensional boundary frame, a three-dimensional boundary frame and a central position coordinate of the three-dimensional boundary frame;
calculating the coordinates of the center position according to the reverse direction of the code and performing back projection to obtain camera coordinates;
obtaining a projected two-dimensional frame obtained after the three-dimensional bounding frame is projected, and making the projected two-dimensional frame and the two-dimensional bounding frame into an L shape1A loss function, adjusting the angle of the observation angle until the adjustment step length of the angle of the observation angle is smaller than a preset termination parameter, and obtaining an adjusted three-dimensional bounding box;
and outputting the target category, the two-dimensional bounding box, the camera coordinates and the adjusted three-dimensional bounding box of the target.
Optionally, the processor executing the code in the memory may further implement the following target detection method:
and specifying the position of the shared central pixel and coding the depth information parameter.
Optionally, the processor executing the code in the memory may further implement the following target detection method:
obtaining a semantic information feature map of an input image by using a trained target detection neural network, and respectively performing global feature extraction and local feature extraction based on annular stripe segmentation depth perception on the semantic information feature map;
weighting the output parameters extracted by the global features and the output parameters extracted by the local features to obtain the anchor point parameters;
and performing self-knowledge distillation optimization on the prediction categories extracted by the global features and the prediction categories extracted by the local features to obtain the target categories.
Preferably, the processor executing the code in the memory may also implement other steps in the object detection method.
The present invention is not limited to the above-described embodiments, and it will be apparent to those skilled in the art that various modifications and improvements can be made without departing from the principle of the present invention, and such modifications and improvements are also considered to be within the scope of the present invention. Those not described in detail in this specification are within the skill of the art.

Claims (10)

1. A method of target detection, comprising the steps of:
defining an anchor frame template comprising two-dimensional parameters [ w, h]2DThree-dimensional parameters [ w, h, l, theta ]]3DAnd a depth information parameter; w, h and l respectively represent given values of the width, height and length of the target, and theta represents an observation angle of the target by the camera;
constructing a target detection neural network based on semantic information extraction and annular stripe segmentation depth perception, training, and extracting anchor point parameters and target types of targets in an input image by using the trained target detection neural network;
performing transformation calculation on the anchor point parameters and the corresponding anchor frame template to obtain a two-dimensional boundary frame, a three-dimensional boundary frame and a central position coordinate of the three-dimensional boundary frame;
calculating the coordinates of the central position according to the reverse direction of the code and performing back projection to obtain camera coordinates;
acquiring a projected two-dimensional frame obtained after the three-dimensional boundary frame is projected, and making the projected two-dimensional frame and the two-dimensional boundary frame into an L shape1A loss function, adjusting the angle of the observation angle until the adjustment step length of the angle of the observation angle is smaller than a preset termination parameter, and obtaining an adjusted three-dimensional bounding box;
and outputting the target category, the two-dimensional bounding box, the camera coordinates and the adjusted three-dimensional bounding box of the target.
2. The object detection method of claim 1, wherein after defining the anchor frame template, further comprising:
and specifying the position of the shared central pixel and coding the depth information parameter.
3. The object detection method of claim 2, wherein encoding the depth information parameter specifically comprises:
with the position of the designated shared center pixel as [ x, y ]]P3D center position [ x, y, z ] in camera coordinate system]3DProjected into an image of a given projection matrix P to obtain projection coordinates [ x, y, z ]]P
Depth information Z based on the projection coordinatesPThe information parameter is coded to obtain
Figure FDA0003250907430000021
Where P is a given projection matrix.
4. The target detection method of claim 1, wherein extracting anchor point parameters and target classes of targets in the input image using a trained target detection neural network specifically comprises:
obtaining a semantic information feature map of an input image by using a trained target detection neural network, and respectively performing global feature extraction and local feature extraction based on annular stripe segmentation depth perception on the semantic information feature map;
weighting the output parameters of the global feature extraction and the output parameters of the local feature extraction to obtain the anchor point parameters;
and performing self-knowledge distillation optimization on the prediction category extracted by the global features and the prediction category extracted by the local features to obtain the target category.
5. The method for detecting the target according to claim 4, wherein the global feature extraction and the local feature extraction based on the annular band segmentation depth perception are respectively performed on the semantic information feature map, and specifically comprises the following steps:
when global feature extraction is carried out through conventional convolution, the global feature F is introduced in the conventional convolution processglobalSaid global feature FglobalIntroducing convolution layers with filling gaps of 1 and 3 multiplied by 3 in number, and carrying out nonlinear activation by a Relu function to generate 512 feature maps;
global feature FglobalRespectively inputting into 12 prediction networks, each of which is a convolution layer with 1 padding quantity and 1 × 1, extracting a global output sequence O after convolutionglobal
When local feature extraction is carried out through depth perception convolution, annular strip segmentation is introduced in the depth perception convolution process, the semantic information feature map is segmented into r different strips, and different convolution kernels act on the different strips;
introducing local features FlocalSaid local feature FlocalBy being based on a ring-shaped stripDepth-aware convolution with segmentation, non-linearly activated by Relu function, to generate 512 feature maps;
local feature FlocalRespectively inputting into 12 prediction networks, wherein the structure of each prediction network is r convolution layers with 1 padding quantity and 1 multiplied by 1 based on annular stripe division, extracting a local output sequence O after convolutionlocal
6. The target detection method of claim 5, wherein the self-knowledge distillation optimization of the prediction classes of global feature extraction and the prediction classes of local feature extraction is performed to obtain the target class, and specifically comprises:
respectively output the global output sequence OglobalClass parameter c _ ls and local output sequence O in (1)localThe class parameter c _ loc in the teacher class is input into a distillation neural network, and the teacher class c _ ls of the global output is outputteacherAnd student class c _ lsstudentAnd teacher class c _ loc of local outputteacherAnd student class c _ locstudent
Introducing a weighting alpha obtained by neural network learningteacherAnd alphastudentRespectively obtaining the optimized teacher class c _ lsteacher' and student class c _ lsstudent' is
c_lsteacher'=c_lsteacher×αteacher+c_locteacher×(1-αteacher)
c_lsstudent'=c_lsstudent×αstudent+c_locstudent×(1-αstudent);
In student class clsstudent' as an optimized target class.
7. The object detection method of claim 6, wherein the three-dimensional bounding box is projected to form a projected two-dimensional box [ x ]min,ymin,xmax,ymax]The projection formula of (a) is:
Figure FDA0003250907430000031
Figure FDA0003250907430000041
ΥP=P·Υ3D2D=ΥPPz]
xmin=min(Υ2Dx]),ymin=min(Υ2Dy])
xmax=max(Υ2Dx]),ymax=max(Υ2Dy])
wherein phi isx、φy、φzRespectively represent axes [ x, y, z ]]Index of [ w, h, l, θ ]]'3DIs a three-dimensional bounding box, [ x, y, z ]]'PIs the coordinate of the center position of the three-dimensional bounding box, and P is the given projection matrix.
8. The target detection method of claim 7, wherein training the target detection neural network specifically comprises:
constructing a multi-task network loss function of target detection to perform back propagation on the training sample to obtain a weight parameter of an optimized neural network;
the network loss function includes a classification loss function, a distillation loss function, a 2D box regression loss function, and a 3D box regression loss function.
9. The object detection method of claim 8, wherein the classification loss function is:
Figure FDA0003250907430000042
the 2D box regression loss function is:
Figure FDA0003250907430000043
the 3D box regression loss function is:
Figure FDA0003250907430000044
Figure FDA0003250907430000045
the distillation loss function is:
Ldistill=(c_lsteacher'-c_lsstudent')2+(probteacher-probstudent)2
obtaining the multitask network loss function as follows:
Figure FDA0003250907430000051
wherein, cτPredicted value for the τ th real target class, ciIs the predicted value of the ith target class, ncIs the total number of target classes, IOU is the cross-over ratio, b'2DIn the form of a two-dimensional bounding box,
Figure FDA0003250907430000052
is a true two-dimensional box, b'3DIs a three-dimensional bounding box which is,
Figure FDA0003250907430000053
a true value of the three-dimensional box, SmoothL1Is smoothed L1Function, probteacherProb, teacher class possibilitystudentFor student class probability, λ1Weight coefficient of regression loss, λ, for 2D frame2Weight coefficient of regression loss, λ, for 3D frame3Is the weight factor lost by distillation.
10. An electronic device for object detection, comprising a processor and a memory, wherein execution of code in the memory by the processor implements the method of any of claims 1 to 9.
CN202111045058.4A 2021-09-07 2021-09-07 Target detection method and electronic equipment Pending CN113903028A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111045058.4A CN113903028A (en) 2021-09-07 2021-09-07 Target detection method and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111045058.4A CN113903028A (en) 2021-09-07 2021-09-07 Target detection method and electronic equipment

Publications (1)

Publication Number Publication Date
CN113903028A true CN113903028A (en) 2022-01-07

Family

ID=79188679

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111045058.4A Pending CN113903028A (en) 2021-09-07 2021-09-07 Target detection method and electronic equipment

Country Status (1)

Country Link
CN (1) CN113903028A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114565916A (en) * 2022-02-07 2022-05-31 苏州浪潮智能科技有限公司 Target detection model training method, target detection method and electronic equipment
CN114677568A (en) * 2022-05-30 2022-06-28 山东极视角科技有限公司 Linear target detection method, module and system based on neural network
CN114677565A (en) * 2022-04-08 2022-06-28 北京百度网讯科技有限公司 Training method of feature extraction network and image processing method and device
CN116189150A (en) * 2023-03-02 2023-05-30 吉咖智能机器人有限公司 Monocular 3D target detection method, device, equipment and medium based on fusion output
CN117711609A (en) * 2024-02-04 2024-03-15 广州中大医疗器械有限公司 Nerve transplanting scheme recommendation method and system based on big data
CN117893692A (en) * 2024-03-12 2024-04-16 之江实验室 Three-dimensional reconstruction method, device and storage medium based on symmetrical view

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114565916A (en) * 2022-02-07 2022-05-31 苏州浪潮智能科技有限公司 Target detection model training method, target detection method and electronic equipment
CN114677565A (en) * 2022-04-08 2022-06-28 北京百度网讯科技有限公司 Training method of feature extraction network and image processing method and device
CN114677568A (en) * 2022-05-30 2022-06-28 山东极视角科技有限公司 Linear target detection method, module and system based on neural network
CN116189150A (en) * 2023-03-02 2023-05-30 吉咖智能机器人有限公司 Monocular 3D target detection method, device, equipment and medium based on fusion output
CN116189150B (en) * 2023-03-02 2024-05-17 吉咖智能机器人有限公司 Monocular 3D target detection method, device, equipment and medium based on fusion output
CN117711609A (en) * 2024-02-04 2024-03-15 广州中大医疗器械有限公司 Nerve transplanting scheme recommendation method and system based on big data
CN117711609B (en) * 2024-02-04 2024-05-03 广州中大医疗器械有限公司 Nerve transplanting scheme recommendation method and system based on big data
CN117893692A (en) * 2024-03-12 2024-04-16 之江实验室 Three-dimensional reconstruction method, device and storage medium based on symmetrical view
CN117893692B (en) * 2024-03-12 2024-05-28 之江实验室 Three-dimensional reconstruction method, device and storage medium based on symmetrical view

Similar Documents

Publication Publication Date Title
CN113903028A (en) Target detection method and electronic equipment
US10353271B2 (en) Depth estimation method for monocular image based on multi-scale CNN and continuous CRF
CN115601549B (en) River and lake remote sensing image segmentation method based on deformable convolution and self-attention model
CN111612807B (en) Small target image segmentation method based on scale and edge information
CN111860695B (en) Data fusion and target detection method, device and equipment
US11315266B2 (en) Self-supervised depth estimation method and system
EP3516624B1 (en) A method and system for creating a virtual 3d model
EP3716198A1 (en) Image reconstruction method and device
CN110622177B (en) Instance partitioning
CN112258618A (en) Semantic mapping and positioning method based on fusion of prior laser point cloud and depth map
US20230080133A1 (en) 6d pose and shape estimation method
CN109003297B (en) Monocular depth estimation method, device, terminal and storage medium
US20230121534A1 (en) Method and electronic device for 3d object detection using neural networks
US20240070972A1 (en) Rendering new images of scenes using geometry-aware neural networks conditioned on latent variables
US11961266B2 (en) Multiview neural human prediction using implicit differentiable renderer for facial expression, body pose shape and clothes performance capture
CN112053441A (en) Full-automatic layout recovery method for indoor fisheye image
CN115238758A (en) Multi-task three-dimensional target detection method based on point cloud feature enhancement
US11887248B2 (en) Systems and methods for reconstructing a scene in three dimensions from a two-dimensional image
CN117542122B (en) Human body pose estimation and three-dimensional reconstruction method, network training method and device
CN116468793A (en) Image processing method, device, electronic equipment and storage medium
WO2022208440A1 (en) Multiview neural human prediction using implicit differentiable renderer for facial expression, body pose shape and clothes performance capture
CN117237623B (en) Semantic segmentation method and system for remote sensing image of unmanned aerial vehicle
CN116630917A (en) Lane line detection method
CN114998630B (en) Ground-to-air image registration method from coarse to fine
CN113902933A (en) Ground segmentation network model training method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination