CN116994068A

CN116994068A - Target detection method and device based on knowledge distillation

Info

Publication number: CN116994068A
Application number: CN202311210238.2A
Authority: CN
Inventors: 王进; 刘明朝; 王明择; 石英
Original assignee: Hubei Changtou Smart Parking Co ltd; Wuhan University of Technology WUT
Current assignee: Hubei Changtou Smart Parking Co ltd; Wuhan University of Technology WUT
Priority date: 2023-09-19
Filing date: 2023-09-19
Publication date: 2023-11-03

Abstract

The invention provides a target detection method and device based on knowledge distillation, wherein the method comprises the following steps: acquiring a first feature map output by a teacher network model and a second feature map output by a student network model; obtaining a foreground focus loss function based on the first feature map and the second feature map; calculating a global semantic loss function based on the first feature map and the second feature map; obtaining a target loss function according to the foreground focal loss function and the global semantic loss function; and optimizing parameters of the student network model by using the target loss function to obtain a target detection model, and performing target detection by using the target detection model. According to the invention, the foreground focal knowledge distillation and the global semantic knowledge distillation are fused, so that the detection accuracy of the target detection model can be improved while the target detection speed is improved.

Description

Target detection method and device based on knowledge distillation

Technical Field

The invention relates to the technical field of computer vision recognition, in particular to a target detection method and device based on knowledge distillation.

Background

The current target detection algorithm based on the deep convolutional neural network has too great demand on calculation resources, and the running cost of the target detection model is high due to serious memory consumption, so that the target detection model needs to be subjected to light weight treatment. The light weight processing of the target detection model can obviously reduce the parameter quantity of the model, improve the real-time detection effect of the target detection model, but can lead to the reduction of detection precision.

Knowledge distillation is a lightweight processing method that inherits information data from a teacher network model to a student network model, and allows the student network to recover part of the detection accuracy of a target detection model without increasing costs. However, the current knowledge distillation is mainly designed for the task of image classification, and the detection of the vehicle target in the road scene not only needs to detect the category of the target, but also needs to define the surrounding circle of the target, which means that a convolutional neural network with more complex structure and more parameters needs to be used, so that the effect of the knowledge distillation is poor.

Therefore, how to perform light weight processing on the target detection model without reducing the detection accuracy of the target detection model is an urgent problem to be solved by workers in the current technical field.

Disclosure of Invention

In view of the foregoing, it is necessary to provide a method and an apparatus for detecting a target based on knowledge distillation, so as to achieve the purpose of not reducing the detection accuracy of the target detection model while performing a light-weight process on the target detection model.

In order to achieve the above object, the present invention provides a method for detecting a target based on knowledge distillation, comprising:

respectively inputting the sample image into a teacher network model and a student network model, and obtaining a first characteristic image output by the teacher network model and a second characteristic image output by the student network model;

calculating a mask loss function and an attention loss function based on the first feature map and the second feature map;

obtaining a foreground focus loss function based on the mask loss function and the attention loss function;

calculating a global semantic loss function based on the first feature map and the second feature map;

obtaining a target loss function according to the foreground focal loss function and the global semantic loss function;

and optimizing parameters of the student network model by using the target loss function to obtain a target detection model, and performing target detection by using the target detection model.

In one possible implementation, calculating the mask loss function and the attention loss function based on the first feature map and the second feature map includes:

calculating a teacher binary mask based on the first feature map, and calculating a teacher size mask using the teacher binary mask;

calculating a student binary mask based on the second feature map, and calculating a student size mask using the student binary mask;

calculating a teacher spatial attention mask and a teacher channel attention mask based on the first feature map;

calculating a student spatial attention mask and a student channel attention mask based on the second feature map;

calculating a mask loss function from the teacher binary mask, the teacher size mask, the student binary mask, the student size mask, the teacher spatial attention mask, the teacher channel attention mask, the student spatial attention mask, and the student channel attention mask;

an attention loss function is calculated from the teacher spatial attention mask, the teacher channel attention mask, the student spatial attention mask, and the student channel attention mask.

In one possible implementation, the binary mask includes a teacher binary mask and a student binary mask, where the binary mask is calculated according to the formula:

；

wherein ,represents the abscissa, ++for the pixel points in the feature map>Representing the ordinate of the pixel point in the feature map,/->And representing a binary mask, wherein the real frame is a preset real frame, and the feature map is one of a first feature map and a second feature map.

In one possible implementation, the scale mask includes a teacher scale mask and a student scale mask, where the scale mask is calculated as:

；

wherein ,representing the height of the real frame +.>Representing the width of the real frame +.>Representing a scale mask.

In one possible implementation, the spatial attention mask includes a teacher spatial attention mask and a student spatial attention mask, the channel attention mask includes a teacher channel attention mask and a student channel attention mask, and the spatial attention mask is calculated by the formula:

；

wherein ,representing a normalization function->Representing the absolute average value of the pixels of the feature map over different spaces,/for each pixel>Indicating the temperature super-parameter in knowledge distillation, +.>Representing a spatial attention mask;

the channel attention mask is calculated as:

；

wherein ,indicates the number of channels>Representing the absolute average value of the pixels of the feature map on different channels,/for each channel>Representing a channel attention mask.

In one possible implementation, the calculation formula of the mask loss function is:

； wherein ,/> and />Balance factors representing foreground and background in balance feature map, +.>Output characteristics of fusion characteristic module in teacher network model are represented,/-for>Output characteristics of fusion characteristic module in student network model are represented,/-for>Indicates the number of channels>Representing the height of the real frame +.>Representing the width of the real frame +.>Indicating the number of layers of the feature layer->Representing a mask loss function;

the calculation formula of the attention loss function is:

；

wherein ,representing a balancing factor for balancing attention mask loss, < ->Representation->Loss function (F)>A spatial attention mask representing teacher, < >>Representing student spatial attention mask, < >>Representing channel attention mask, ++>Representing channel attention mask, ++>Representing the attention loss function.

In one possible implementation, the calculation formula of the foreground focal point loss function is:

；

wherein ,representing the foreground focal point loss function.

In one possible implementation, calculating the global semantic loss function based on the first feature map and the second feature map includes:

determining candidate areas of the first feature map and the second feature map, and mapping the first feature map and the second feature map to obtain pixel-level features;

calculating a target vector according to the candidate region and the pixel level characteristics to obtain candidate region characterization;

weighting the candidate region characterization to obtain a context feature representation;

performing splicing processing and convolution processing on the context feature representation and the input feature to obtain an output feature;

calculating a global semantic loss function based on the output features;

the calculation formula of the global semantic loss function is as follows:

；

wherein ,representing global semantic loss function,/->Represents a balancing factor for balancing semantic losses, +.>Representing up and down Wen Biaozheng modules,>output characteristics of fusion characteristic module in teacher network model are represented,/-for>And the output characteristics of the fusion characteristic module in the student network model are represented.

In one possible implementation, the calculation formula of the objective loss function is:

；

wherein ,representing the target loss function.

In order to achieve the above object, the present invention further provides a target detection device based on knowledge distillation, including:

the characteristic diagram acquisition module is used for inputting the sample images into the teacher network model and the student network model respectively to acquire a first characteristic diagram output by the teacher network model and a second characteristic diagram output by the student network model;

a calculation module for calculating a mask loss function and an attention loss function based on the first feature map and the second feature map;

the foreground focus loss function module is used for obtaining a foreground focus loss function based on the mask loss function and the attention loss function;

the global semantic loss function module is used for calculating a global semantic loss function based on the first feature map and the second feature map;

the target loss function module is used for obtaining a target loss function according to the foreground focus loss function and the global semantic loss function;

and the detection module is used for optimizing parameters of the student network model by utilizing the target loss function to obtain a target detection model, and carrying out target detection by utilizing the target detection model.

The beneficial effects of adopting the embodiment are as follows: firstly, separating and processing a sample image by adopting foreground focus distillation based on a mask, guiding a student network to pay attention to key pixels and channels, then calculating a foreground focus loss function based on the mask, secondly, providing a global semantic distillation method, further enhancing the learning efficiency of the student network by extracting a global pixel relation blocked by foreground focus distillation, and obtaining a target loss function based on fusion of the foreground focus loss function and the global semantic function, and completing training of a detection model. According to the invention, the foreground focal knowledge distillation and the global semantic knowledge distillation are fused, so that the detection accuracy of the target detection model can be improved while the target detection speed is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of an embodiment of a method for detecting targets based on knowledge distillation according to the present invention;

fig. 2 is a schematic structural diagram of an embodiment of a target detection device based on knowledge distillation according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor systems and/or microcontroller systems.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the invention. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

The knowledge used in the distillation algorithm can be divided into tag knowledge, intermediate layer knowledge, relationship characteristic knowledge and structural characteristic knowledge according to the transfer form. When the method is applied to road scene target detection, the information provided by the tag knowledge is limited and single, so that a complex road scene cannot be met; the calculation cost of the structured knowledge is relatively large, the optimization cost is high, and the road scene has the characteristics of high real-time requirement and limited calculation power although the structured knowledge can provide stronger knowledge representation capability; the relation characteristic knowledge has high threshold realization and insufficient flexibility, and is difficult to be applied to an unmanned road scene. While the middle layer knowledge distillation algorithm solves the defect of single tag knowledge information, can be flexibly applied to road scene target detection, has some adaptability problems when being applied to specific unmanned road scene target detection, and leads to very limited improvement of precision.

The invention adopts the knowledge distillation of the foreground focus based on the mask to separate the foreground region and the background region of the image, guides the student network to pay attention to the key pixels and channels, but the knowledge distillation method of the foreground focus based on the mask also blocks the direct connection of the foreground and the background; the invention further provides a global semantic distillation method based on context characterization, the learning of the student network model is further enhanced by extracting the global pixel relation blocked by foreground focus distillation, and finally a new knowledge distillation loss function is provided, so that the final training of the model is completed, and the detection precision of the target detection model can not be reduced while the target detection model is subjected to light weight treatment.

Fig. 1 is a schematic flow chart of an embodiment of a target detection method based on knowledge distillation.

Referring to fig. 1, the present invention provides a method for detecting a target based on knowledge distillation, comprising:

s101, respectively inputting a sample image into a teacher network model and a student network model, and acquiring a first characteristic image output by the teacher network model and a second characteristic image output by the student network model;

s102, calculating a mask loss function and an attention loss function based on the first feature map and the second feature map;

s103, obtaining a foreground focus loss function based on the mask loss function and the attention loss function;

s104, calculating a global semantic loss function based on the first feature map and the second feature map;

s105, obtaining a target loss function according to the foreground focus loss function and the global semantic loss function;

and S106, optimizing parameters of the student network model by using the target loss function to obtain a target detection model, and performing target detection by using the target detection model.

The beneficial effects of adopting the embodiment are as follows: acquiring a first feature map output by a teacher network model and a second feature map output by a student network model; obtaining a foreground focus loss function based on the first feature map and the second feature map; calculating a global semantic loss function based on the first feature map and the second feature map; obtaining a target loss function according to the foreground focal loss function and the global semantic loss function; and optimizing parameters of the student network model by using the target loss function to obtain a target detection model, and performing target detection by using the target detection model. According to the invention, the foreground focal knowledge distillation and the global semantic knowledge distillation are fused, so that the detection accuracy of the target detection model can be improved while the target detection speed is improved.

The sample image may be a sample image preset in advance, or may be an image captured in real time by the image capturing module.

In order to determine the detection target, a real frame needs to be preset in the sample image for training the teacher network model and the student network model.

In one embodiment, step S102 includes:

It should be explained that, the mask loss function is calculated through the first feature map and the second feature map to separate the foreground area and the background area in the first feature map and the second feature map, and meanwhile, the scale of the target is calculated by weight, so that not only can the detection efficiency of the target be improved, but also the influence of the small target on the knowledge distillation effect can be avoided, and the focus information in the teacher network model can be learned by the learning network model by means of the first feature map and the second feature map, so that the performance of knowledge distillation is improved.

In one embodiment, the binary masks include a teacher binary mask and a student binary mask, where the binary masks are calculated as:

；

wherein ,represents the abscissa, ++for the pixel points in the feature map>Representing the ordinate of the pixel point in the feature map,/->Representing a binary mask, wherein the real frame is a preset real frame, and when the pixel point is positioned in the real frame, the pixel point is +.>The value of (1) is 1, otherwise->The feature map is one of the first feature map and the second feature map, and the value of (a) is 0.

In consideration of the great difference between the foreground and the background in the unmanned road scene, the method separates the foreground area and the background area in the feature map through the binary mask so as to improve the loss of the foreground area and the background area in the feature map, and note that as the mask calculation modes of the teacher network model and the student network model are the same, the calculation formulas of the binary mask are applicable to the teacher network model and the student network model, and the calculation formulas of the scale mask, the spatial attention mask, the channel attention mask, the mask loss function, the foreground focus loss function and the global semantic loss function are applicable to the teacher network model and the student network model at the same time, and are not repeated later.

On the basis of balancing the foreground area and the background area, the method further considers the scale of the target, when the scale of the target is larger, the number of pixels is more, which means more occupied loss, further reduces the distillation effect of small target objects, balances the weights of targets with different scales by setting scale masks, and further improves the detection precision of the detection model for targets with different scales.

Further, the scale mask includes a teacher scale mask and a student scale mask, and the calculation formula of the scale mask is:

；

wherein ,representing the height of the real frame +.>Representing the width of the real frame +.>Representing a scale mask. When the pixel isIf the pixel is positioned in the real frame, the value of the scale mask is the inverse of the area of the real frame, and if the pixel is +.>And if the scale mask is positioned outside the real frame, the value of the scale mask is the reciprocal of the total number of pixels outside the real frame, so that the weight of a large target is reduced, and the weight of a small target is increased, thereby achieving the effect of balancing the scale of the target.

On the basis of the binary mask and the scale mask, the invention further introduces an attention mechanism to enable the student network model to learn focus information of the teacher network model in a focusing way, the attention mechanism can capture channel dependence, and the spatial attention mechanism captures the pixel-level pairwise relationship, so as to optimize the student neural network model.

In one embodiment, the spatial attention mask includes a teacher spatial attention mask and a student spatial attention mask, the channel attention mask includes a teacher channel attention mask and a student channel attention mask, and the spatial attention mask is calculated as:

；

wherein ,representing a normalization function->Representing the absolute average value of the pixels of the feature map over different spaces,/for each pixel>Indicating the temperature super-parameter in knowledge distillation, +.>Representing a spatial attention mask.

The channel attention mask is calculated as:

；

Further, before determining the spatial attention mask and the channel attention mask, further comprising: absolute average of pixels over different spaces based on a first feature map and a second feature mapAnd absolute average value of pixels on different channels +.>。

The invention adjusts the spatial pixel and channel distribution in the characteristic diagram through the spatial attention mask and the channel attention mask, thereby enabling the student network model to learn focus information of the teacher network model in a focused manner.

In one embodiment, the mask loss function is calculated as:

；

wherein , and />Balance factors representing foreground and background in balance feature map, +.>Output characteristics of fusion characteristic module in teacher network model are represented,/-for>Output characteristics of fusion characteristic module in student network model are represented,/-for>Indicates the number of channels>Representing a real frameHeight (I) of (II)>Representing the width of the real frame +.>Indicating the number of layers of the feature layer->Representing a mask loss function;

the calculation formula of the attention loss function is:

；

In one embodiment, the foreground focal point loss function is calculated as:

；

wherein ,representing the foreground focal point loss function.

According to the embodiment, the student network model is focused on key information in the feature map through the foreground focus knowledge distillation method, but the foreground focus knowledge distillation method is used for separating a foreground area from a background area and blocking the relation between the foreground area and the background area, so that the student learning model ignores the relevance among different pixels, but for a road scene, the accurate detection of a foreground target is required to be realized, and the corresponding background features are combined to be essential.

In one embodiment, after a sample picture is acquired, the sample picture is input into a teacher network model and a student network model respectively, so that a first feature image output by the teacher network model and a second feature image output by the student network model are obtained, then feature fusion and context characterization are carried out on the first feature image through the teacher network model, the feature information and context characterization relation of the first feature image are obtained, and the feature information and the context characterization of the first feature image are fused, so that global semantic information of the first feature image is obtained; feature fusion and context characterization are carried out on the second feature map through the student network model to obtain feature information and context characterization relation of the second feature map, the feature information and the context characterization of the second feature map are fused to obtain global semantic information of the second feature map, the global semantic information of the second feature map is trained based on the global semantic information of the first feature map, and therefore a loss function of global semantic distillation for training the student network model is obtained, and parameterization setting of the student network model is completed through the loss function of global semantic distillation; and simultaneously, after the characteristic information of the first characteristic diagram and the second characteristic diagram is obtained, respectively carrying out foreground focusing on the characteristic information of the first characteristic diagram and the second characteristic diagram, respectively calculating a spatial attention mask and a channel attention mask of a teacher network model and a student network model, determining a loss function of foreground focus distillation, carrying out parameterization on the student network model through the loss function of foreground focus distillation, and further obtaining a target detection model based on knowledge distillation.

S104 includes:

a global semantic loss function is calculated based on the output features.

In one embodiment, an input feature is adjusted to be a group of features with the channel number equal to the class number as candidate areas, a pixel level feature is mapped in parallel, k groups of C-dimensional vectors of a feature map are calculated according to the pixel level feature and the candidate areas, then the features are weighted and summed according to the corresponding relation between each pixel in the pixel level feature and the candidate area feature, context feature representation after information increment can be obtained, finally the context feature representation is spliced with the input feature, and the context feature and the input feature are fused through convolution to obtain an output feature, so that global semantic information supplement of distillation features is achieved.

The calculation formula of the global semantic loss function is as follows:

；

wherein ,representing the target loss function.

It should be noted that, the knowledge distillation in the invention only calculates on the middle layer characteristics of the deep convolutional network, the knowledge content is various, and high calculation expense and optimization cost are not needed, the knowledge distillation method is suitable for scene target detection of roads, and the middle layer characteristics can be obtained from the characteristic fusion module of the detector, so that the distillation algorithm based on the foreground focus and the global semantics can be easily applied to different teacher network models and student network models.

FIG. 2 is a schematic diagram of an embodiment of a knowledge-based distillation target detection apparatus.

Referring to fig. 2, the present invention further provides a knowledge distillation-based object detection apparatus, including:

a feature map obtaining module 21, configured to input the sample image into a teacher network model and a student network model respectively, and obtain a first feature map output by the teacher network model and a second feature map output by the student network model;

a calculation module 22 for calculating a mask loss function and an attention loss function based on the first feature map and the second feature map;

a foreground focus loss function module 23 for obtaining a foreground focus loss function based on the mask loss function and the attention loss function;

a global semantic loss function module 24 for calculating a global semantic loss function based on the first feature map and the second feature map;

a target loss function module 25, configured to obtain a target loss function according to the foreground focal loss function and the global semantic loss function;

the detection module 26 is configured to optimize parameters of the student network model by using the objective loss function to obtain an objective detection model, and perform objective detection by using the objective detection model.

The beneficial effects of adopting the embodiment are as follows: the feature map acquisition module 21 acquires a first feature map output by the teacher network model and a second feature map output by the student network model; the calculation module 22 calculates a mask loss function and an attention loss function based on the first feature map and the second feature map; the foreground focus loss function module 23 obtains a foreground focus loss function based on the mask loss function and the attention loss function; the global semantic loss function module 24 calculates a global semantic loss function based on the first feature map and the second feature map; the objective loss function module 25 obtains an objective loss function according to the foreground focus loss function and the global semantic loss function; the detection module 26 optimizes parameters of the student network model using the objective loss function to obtain an objective detection model, and performs objective detection using the objective detection model. According to the invention, the foreground focal knowledge distillation and the global semantic knowledge distillation are fused, so that the detection accuracy of the target detection model can be improved while the target detection speed is improved.

The foregoing embodiments provide a technical solution that may be implemented by using a target detection device based on knowledge distillation in the foregoing embodiments of a target detection method based on knowledge distillation, and the specific implementation principles of the foregoing modules or units may be based on the corresponding content in the embodiments of the target detection method based on knowledge distillation, which is not described herein again.

The above description of the method and system for detecting the target based on knowledge distillation provided by the invention applies specific examples to illustrate the principle and implementation of the invention, and the above examples are only used to help understand the method and core idea of the invention; meanwhile, as those skilled in the art will vary in the specific embodiments and application scope according to the ideas of the present invention, the present description should not be construed as limiting the present invention in summary.

Claims

1. A method for detecting a target based on knowledge distillation, comprising:

respectively inputting a sample image into a teacher network model and a student network model, and acquiring a first characteristic diagram output by the teacher network model and a second characteristic diagram output by the student network model;

obtaining a target loss function according to the foreground focus loss function and the global semantic loss function;

2. The knowledge-based distillation target detection method according to claim 1, wherein the calculating a mask loss function and an attention loss function based on the first feature map and the second feature map comprises:

calculating a student binary mask based on the second feature map, calculating a student size mask using the student binary mask;

3. The knowledge distillation based target detection method according to claim 2, wherein the binary mask includes the teacher binary mask and the student binary mask, and the binary mask is calculated by a formula:

；

wherein ,represents the abscissa, ++for the pixel points in the feature map>Representing the ordinate of the pixel point in the feature map,/->And representing a binary mask, wherein the real frame is a preset real frame, and the feature map is one of the first feature map and the second feature map.

4. A knowledge distillation based target detection method according to claim 3 wherein a scale mask comprises the teacher scale mask and the student scale mask, the scale mask calculated by:

；

5. The knowledge distillation based target detection method according to claim 4 wherein a spatial attention mask comprises the teacher spatial attention mask and the student spatial attention mask, a channel attention mask comprises the teacher channel attention mask and the student channel attention mask, and a calculation formula of the spatial attention mask is:

；

the calculation formula of the channel attention mask is as follows:

；

6. The knowledge distillation based target detection method according to claim 5 wherein the calculation formula of the mask loss function is:

；

wherein , and />Representing foreground in balanced feature graphsBalance factor of background->Output characteristics of fusion characteristic module in teacher network model are represented,/-for>Output characteristics of fusion characteristic module in student network model are represented,/-for>Indicates the number of channels>Representing the height of the real frame +.>Representing the width of the real frame +.>Indicating the number of layers of the feature layer->Representing a mask loss function;

the calculation formula of the attention loss function is as follows:

；

7. The knowledge distillation based target detection method according to claim 6 wherein the calculation formula of the foreground focal point loss function is:

；

wherein , representing the foreground focal point loss function.

8. The knowledge-based retorting target detection method as claimed in claim 1, wherein said calculating global semantic loss function based on said first and second feature maps includes:

calculating a target vector according to the candidate region and the pixel-level feature to obtain a candidate region characterization;

calculating a global semantic loss function based on the output features;

the calculation formula of the global semantic loss function is as follows:

；

9. The method for detecting a target based on knowledge distillation according to claim 7 and 8, wherein the calculation formula of the target loss function is:

；

wherein ,representing the target loss function.

10. A knowledge distillation-based target detection apparatus, comprising:

the characteristic diagram acquisition module is used for inputting the sample images into a teacher network model and a student network model respectively to acquire a first characteristic diagram output by the teacher network model and a second characteristic diagram output by the student network model;

a foreground focus loss function module, configured to obtain a foreground focus loss function based on the mask loss function and the attention loss function;

a global semantic loss function module for calculating a global semantic loss function based on the first feature map and the second feature map;

and the detection module is used for optimizing the parameters of the student network model by utilizing the target loss function to obtain a target detection model, and carrying out target detection by utilizing the target detection model.