CN116994068A - Target detection method and device based on knowledge distillation - Google Patents

Target detection method and device based on knowledge distillation Download PDF

Info

Publication number
CN116994068A
CN116994068A CN202311210238.2A CN202311210238A CN116994068A CN 116994068 A CN116994068 A CN 116994068A CN 202311210238 A CN202311210238 A CN 202311210238A CN 116994068 A CN116994068 A CN 116994068A
Authority
CN
China
Prior art keywords
mask
loss function
feature map
student
representing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311210238.2A
Other languages
Chinese (zh)
Inventor
王进
刘明朝
王明择
石英
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hubei Changtou Smart Parking Co ltd
Wuhan University of Technology WUT
Original Assignee
Hubei Changtou Smart Parking Co ltd
Wuhan University of Technology WUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hubei Changtou Smart Parking Co ltd, Wuhan University of Technology WUT filed Critical Hubei Changtou Smart Parking Co ltd
Priority to CN202311210238.2A priority Critical patent/CN116994068A/en
Publication of CN116994068A publication Critical patent/CN116994068A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Image Analysis (AREA)

Abstract

The invention provides a target detection method and device based on knowledge distillation, wherein the method comprises the following steps: acquiring a first feature map output by a teacher network model and a second feature map output by a student network model; obtaining a foreground focus loss function based on the first feature map and the second feature map; calculating a global semantic loss function based on the first feature map and the second feature map; obtaining a target loss function according to the foreground focal loss function and the global semantic loss function; and optimizing parameters of the student network model by using the target loss function to obtain a target detection model, and performing target detection by using the target detection model. According to the invention, the foreground focal knowledge distillation and the global semantic knowledge distillation are fused, so that the detection accuracy of the target detection model can be improved while the target detection speed is improved.

Description

Target detection method and device based on knowledge distillation
Technical Field
The invention relates to the technical field of computer vision recognition, in particular to a target detection method and device based on knowledge distillation.
Background
The current target detection algorithm based on the deep convolutional neural network has too great demand on calculation resources, and the running cost of the target detection model is high due to serious memory consumption, so that the target detection model needs to be subjected to light weight treatment. The light weight processing of the target detection model can obviously reduce the parameter quantity of the model, improve the real-time detection effect of the target detection model, but can lead to the reduction of detection precision.
Knowledge distillation is a lightweight processing method that inherits information data from a teacher network model to a student network model, and allows the student network to recover part of the detection accuracy of a target detection model without increasing costs. However, the current knowledge distillation is mainly designed for the task of image classification, and the detection of the vehicle target in the road scene not only needs to detect the category of the target, but also needs to define the surrounding circle of the target, which means that a convolutional neural network with more complex structure and more parameters needs to be used, so that the effect of the knowledge distillation is poor.
Therefore, how to perform light weight processing on the target detection model without reducing the detection accuracy of the target detection model is an urgent problem to be solved by workers in the current technical field.
Disclosure of Invention
In view of the foregoing, it is necessary to provide a method and an apparatus for detecting a target based on knowledge distillation, so as to achieve the purpose of not reducing the detection accuracy of the target detection model while performing a light-weight process on the target detection model.
In order to achieve the above object, the present invention provides a method for detecting a target based on knowledge distillation, comprising:
respectively inputting the sample image into a teacher network model and a student network model, and obtaining a first characteristic image output by the teacher network model and a second characteristic image output by the student network model;
calculating a mask loss function and an attention loss function based on the first feature map and the second feature map;
obtaining a foreground focus loss function based on the mask loss function and the attention loss function;
calculating a global semantic loss function based on the first feature map and the second feature map;
obtaining a target loss function according to the foreground focal loss function and the global semantic loss function;
and optimizing parameters of the student network model by using the target loss function to obtain a target detection model, and performing target detection by using the target detection model.
In one possible implementation, calculating the mask loss function and the attention loss function based on the first feature map and the second feature map includes:
calculating a teacher binary mask based on the first feature map, and calculating a teacher size mask using the teacher binary mask;
calculating a student binary mask based on the second feature map, and calculating a student size mask using the student binary mask;
calculating a teacher spatial attention mask and a teacher channel attention mask based on the first feature map;
calculating a student spatial attention mask and a student channel attention mask based on the second feature map;
calculating a mask loss function from the teacher binary mask, the teacher size mask, the student binary mask, the student size mask, the teacher spatial attention mask, the teacher channel attention mask, the student spatial attention mask, and the student channel attention mask;
an attention loss function is calculated from the teacher spatial attention mask, the teacher channel attention mask, the student spatial attention mask, and the student channel attention mask.
In one possible implementation, the binary mask includes a teacher binary mask and a student binary mask, where the binary mask is calculated according to the formula:
wherein ,represents the abscissa, ++for the pixel points in the feature map>Representing the ordinate of the pixel point in the feature map,/->And representing a binary mask, wherein the real frame is a preset real frame, and the feature map is one of a first feature map and a second feature map.
In one possible implementation, the scale mask includes a teacher scale mask and a student scale mask, where the scale mask is calculated as:
wherein ,representing the height of the real frame +.>Representing the width of the real frame +.>Representing a scale mask.
In one possible implementation, the spatial attention mask includes a teacher spatial attention mask and a student spatial attention mask, the channel attention mask includes a teacher channel attention mask and a student channel attention mask, and the spatial attention mask is calculated by the formula:
wherein ,representing a normalization function->Representing the absolute average value of the pixels of the feature map over different spaces,/for each pixel>Indicating the temperature super-parameter in knowledge distillation, +.>Representing a spatial attention mask;
the channel attention mask is calculated as:
wherein ,indicates the number of channels>Representing the absolute average value of the pixels of the feature map on different channels,/for each channel>Representing a channel attention mask.
In one possible implementation, the calculation formula of the mask loss function is:
; wherein ,/> and />Balance factors representing foreground and background in balance feature map, +.>Output characteristics of fusion characteristic module in teacher network model are represented,/-for>Output characteristics of fusion characteristic module in student network model are represented,/-for>Indicates the number of channels>Representing the height of the real frame +.>Representing the width of the real frame +.>Indicating the number of layers of the feature layer->Representing a mask loss function;
the calculation formula of the attention loss function is:
wherein ,representing a balancing factor for balancing attention mask loss, < ->Representation->Loss function (F)>A spatial attention mask representing teacher, < >>Representing student spatial attention mask, < >>Representing channel attention mask, ++>Representing channel attention mask, ++>Representing the attention loss function.
In one possible implementation, the calculation formula of the foreground focal point loss function is:
wherein ,representing the foreground focal point loss function.
In one possible implementation, calculating the global semantic loss function based on the first feature map and the second feature map includes:
determining candidate areas of the first feature map and the second feature map, and mapping the first feature map and the second feature map to obtain pixel-level features;
calculating a target vector according to the candidate region and the pixel level characteristics to obtain candidate region characterization;
weighting the candidate region characterization to obtain a context feature representation;
performing splicing processing and convolution processing on the context feature representation and the input feature to obtain an output feature;
calculating a global semantic loss function based on the output features;
the calculation formula of the global semantic loss function is as follows:
wherein ,representing global semantic loss function,/->Represents a balancing factor for balancing semantic losses, +.>Representing up and down Wen Biaozheng modules,>output characteristics of fusion characteristic module in teacher network model are represented,/-for>And the output characteristics of the fusion characteristic module in the student network model are represented.
In one possible implementation, the calculation formula of the objective loss function is:
wherein ,representing the target loss function.
In order to achieve the above object, the present invention further provides a target detection device based on knowledge distillation, including:
the characteristic diagram acquisition module is used for inputting the sample images into the teacher network model and the student network model respectively to acquire a first characteristic diagram output by the teacher network model and a second characteristic diagram output by the student network model;
a calculation module for calculating a mask loss function and an attention loss function based on the first feature map and the second feature map;
the foreground focus loss function module is used for obtaining a foreground focus loss function based on the mask loss function and the attention loss function;
the global semantic loss function module is used for calculating a global semantic loss function based on the first feature map and the second feature map;
the target loss function module is used for obtaining a target loss function according to the foreground focus loss function and the global semantic loss function;
and the detection module is used for optimizing parameters of the student network model by utilizing the target loss function to obtain a target detection model, and carrying out target detection by utilizing the target detection model.
The beneficial effects of adopting the embodiment are as follows: firstly, separating and processing a sample image by adopting foreground focus distillation based on a mask, guiding a student network to pay attention to key pixels and channels, then calculating a foreground focus loss function based on the mask, secondly, providing a global semantic distillation method, further enhancing the learning efficiency of the student network by extracting a global pixel relation blocked by foreground focus distillation, and obtaining a target loss function based on fusion of the foreground focus loss function and the global semantic function, and completing training of a detection model. According to the invention, the foreground focal knowledge distillation and the global semantic knowledge distillation are fused, so that the detection accuracy of the target detection model can be improved while the target detection speed is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of an embodiment of a method for detecting targets based on knowledge distillation according to the present invention;
fig. 2 is a schematic structural diagram of an embodiment of a target detection device based on knowledge distillation according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor systems and/or microcontroller systems.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the invention. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
The knowledge used in the distillation algorithm can be divided into tag knowledge, intermediate layer knowledge, relationship characteristic knowledge and structural characteristic knowledge according to the transfer form. When the method is applied to road scene target detection, the information provided by the tag knowledge is limited and single, so that a complex road scene cannot be met; the calculation cost of the structured knowledge is relatively large, the optimization cost is high, and the road scene has the characteristics of high real-time requirement and limited calculation power although the structured knowledge can provide stronger knowledge representation capability; the relation characteristic knowledge has high threshold realization and insufficient flexibility, and is difficult to be applied to an unmanned road scene. While the middle layer knowledge distillation algorithm solves the defect of single tag knowledge information, can be flexibly applied to road scene target detection, has some adaptability problems when being applied to specific unmanned road scene target detection, and leads to very limited improvement of precision.
The invention adopts the knowledge distillation of the foreground focus based on the mask to separate the foreground region and the background region of the image, guides the student network to pay attention to the key pixels and channels, but the knowledge distillation method of the foreground focus based on the mask also blocks the direct connection of the foreground and the background; the invention further provides a global semantic distillation method based on context characterization, the learning of the student network model is further enhanced by extracting the global pixel relation blocked by foreground focus distillation, and finally a new knowledge distillation loss function is provided, so that the final training of the model is completed, and the detection precision of the target detection model can not be reduced while the target detection model is subjected to light weight treatment.
Fig. 1 is a schematic flow chart of an embodiment of a target detection method based on knowledge distillation.
Referring to fig. 1, the present invention provides a method for detecting a target based on knowledge distillation, comprising:
s101, respectively inputting a sample image into a teacher network model and a student network model, and acquiring a first characteristic image output by the teacher network model and a second characteristic image output by the student network model;
s102, calculating a mask loss function and an attention loss function based on the first feature map and the second feature map;
s103, obtaining a foreground focus loss function based on the mask loss function and the attention loss function;
s104, calculating a global semantic loss function based on the first feature map and the second feature map;
s105, obtaining a target loss function according to the foreground focus loss function and the global semantic loss function;
and S106, optimizing parameters of the student network model by using the target loss function to obtain a target detection model, and performing target detection by using the target detection model.
The beneficial effects of adopting the embodiment are as follows: acquiring a first feature map output by a teacher network model and a second feature map output by a student network model; obtaining a foreground focus loss function based on the first feature map and the second feature map; calculating a global semantic loss function based on the first feature map and the second feature map; obtaining a target loss function according to the foreground focal loss function and the global semantic loss function; and optimizing parameters of the student network model by using the target loss function to obtain a target detection model, and performing target detection by using the target detection model. According to the invention, the foreground focal knowledge distillation and the global semantic knowledge distillation are fused, so that the detection accuracy of the target detection model can be improved while the target detection speed is improved.
The sample image may be a sample image preset in advance, or may be an image captured in real time by the image capturing module.
In order to determine the detection target, a real frame needs to be preset in the sample image for training the teacher network model and the student network model.
In one embodiment, step S102 includes:
calculating a teacher binary mask based on the first feature map, and calculating a teacher size mask using the teacher binary mask;
calculating a student binary mask based on the second feature map, and calculating a student size mask using the student binary mask;
calculating a teacher spatial attention mask and a teacher channel attention mask based on the first feature map;
calculating a student spatial attention mask and a student channel attention mask based on the second feature map;
calculating a mask loss function from the teacher binary mask, the teacher size mask, the student binary mask, the student size mask, the teacher spatial attention mask, the teacher channel attention mask, the student spatial attention mask, and the student channel attention mask;
an attention loss function is calculated from the teacher spatial attention mask, the teacher channel attention mask, the student spatial attention mask, and the student channel attention mask.
It should be explained that, the mask loss function is calculated through the first feature map and the second feature map to separate the foreground area and the background area in the first feature map and the second feature map, and meanwhile, the scale of the target is calculated by weight, so that not only can the detection efficiency of the target be improved, but also the influence of the small target on the knowledge distillation effect can be avoided, and the focus information in the teacher network model can be learned by the learning network model by means of the first feature map and the second feature map, so that the performance of knowledge distillation is improved.
In one embodiment, the binary masks include a teacher binary mask and a student binary mask, where the binary masks are calculated as:
wherein ,represents the abscissa, ++for the pixel points in the feature map>Representing the ordinate of the pixel point in the feature map,/->Representing a binary mask, wherein the real frame is a preset real frame, and when the pixel point is positioned in the real frame, the pixel point is +.>The value of (1) is 1, otherwise->The feature map is one of the first feature map and the second feature map, and the value of (a) is 0.
In consideration of the great difference between the foreground and the background in the unmanned road scene, the method separates the foreground area and the background area in the feature map through the binary mask so as to improve the loss of the foreground area and the background area in the feature map, and note that as the mask calculation modes of the teacher network model and the student network model are the same, the calculation formulas of the binary mask are applicable to the teacher network model and the student network model, and the calculation formulas of the scale mask, the spatial attention mask, the channel attention mask, the mask loss function, the foreground focus loss function and the global semantic loss function are applicable to the teacher network model and the student network model at the same time, and are not repeated later.
On the basis of balancing the foreground area and the background area, the method further considers the scale of the target, when the scale of the target is larger, the number of pixels is more, which means more occupied loss, further reduces the distillation effect of small target objects, balances the weights of targets with different scales by setting scale masks, and further improves the detection precision of the detection model for targets with different scales.
Further, the scale mask includes a teacher scale mask and a student scale mask, and the calculation formula of the scale mask is:
wherein ,representing the height of the real frame +.>Representing the width of the real frame +.>Representing a scale mask. When the pixel isIf the pixel is positioned in the real frame, the value of the scale mask is the inverse of the area of the real frame, and if the pixel is +.>And if the scale mask is positioned outside the real frame, the value of the scale mask is the reciprocal of the total number of pixels outside the real frame, so that the weight of a large target is reduced, and the weight of a small target is increased, thereby achieving the effect of balancing the scale of the target.
On the basis of the binary mask and the scale mask, the invention further introduces an attention mechanism to enable the student network model to learn focus information of the teacher network model in a focusing way, the attention mechanism can capture channel dependence, and the spatial attention mechanism captures the pixel-level pairwise relationship, so as to optimize the student neural network model.
In one embodiment, the spatial attention mask includes a teacher spatial attention mask and a student spatial attention mask, the channel attention mask includes a teacher channel attention mask and a student channel attention mask, and the spatial attention mask is calculated as:
wherein ,representing a normalization function->Representing the absolute average value of the pixels of the feature map over different spaces,/for each pixel>Indicating the temperature super-parameter in knowledge distillation, +.>Representing a spatial attention mask.
The channel attention mask is calculated as:
wherein ,indicates the number of channels>Representing the absolute average value of the pixels of the feature map on different channels,/for each channel>Representing a channel attention mask.
Further, before determining the spatial attention mask and the channel attention mask, further comprising: absolute average of pixels over different spaces based on a first feature map and a second feature mapAnd absolute average value of pixels on different channels +.>
The invention adjusts the spatial pixel and channel distribution in the characteristic diagram through the spatial attention mask and the channel attention mask, thereby enabling the student network model to learn focus information of the teacher network model in a focused manner.
In one embodiment, the mask loss function is calculated as:
wherein , and />Balance factors representing foreground and background in balance feature map, +.>Output characteristics of fusion characteristic module in teacher network model are represented,/-for>Output characteristics of fusion characteristic module in student network model are represented,/-for>Indicates the number of channels>Representing a real frameHeight (I) of (II)>Representing the width of the real frame +.>Indicating the number of layers of the feature layer->Representing a mask loss function;
the calculation formula of the attention loss function is:
wherein ,representing a balancing factor for balancing attention mask loss, < ->Representation->Loss function (F)>A spatial attention mask representing teacher, < >>Representing student spatial attention mask, < >>Representing channel attention mask, ++>Representing channel attention mask, ++>Representing the attention loss function.
In one embodiment, the foreground focal point loss function is calculated as:
wherein ,representing the foreground focal point loss function.
According to the embodiment, the student network model is focused on key information in the feature map through the foreground focus knowledge distillation method, but the foreground focus knowledge distillation method is used for separating a foreground area from a background area and blocking the relation between the foreground area and the background area, so that the student learning model ignores the relevance among different pixels, but for a road scene, the accurate detection of a foreground target is required to be realized, and the corresponding background features are combined to be essential.
In one embodiment, after a sample picture is acquired, the sample picture is input into a teacher network model and a student network model respectively, so that a first feature image output by the teacher network model and a second feature image output by the student network model are obtained, then feature fusion and context characterization are carried out on the first feature image through the teacher network model, the feature information and context characterization relation of the first feature image are obtained, and the feature information and the context characterization of the first feature image are fused, so that global semantic information of the first feature image is obtained; feature fusion and context characterization are carried out on the second feature map through the student network model to obtain feature information and context characterization relation of the second feature map, the feature information and the context characterization of the second feature map are fused to obtain global semantic information of the second feature map, the global semantic information of the second feature map is trained based on the global semantic information of the first feature map, and therefore a loss function of global semantic distillation for training the student network model is obtained, and parameterization setting of the student network model is completed through the loss function of global semantic distillation; and simultaneously, after the characteristic information of the first characteristic diagram and the second characteristic diagram is obtained, respectively carrying out foreground focusing on the characteristic information of the first characteristic diagram and the second characteristic diagram, respectively calculating a spatial attention mask and a channel attention mask of a teacher network model and a student network model, determining a loss function of foreground focus distillation, carrying out parameterization on the student network model through the loss function of foreground focus distillation, and further obtaining a target detection model based on knowledge distillation.
S104 includes:
determining candidate areas of the first feature map and the second feature map, and mapping the first feature map and the second feature map to obtain pixel-level features;
calculating a target vector according to the candidate region and the pixel level characteristics to obtain candidate region characterization;
weighting the candidate region characterization to obtain a context feature representation;
performing splicing processing and convolution processing on the context feature representation and the input feature to obtain an output feature;
a global semantic loss function is calculated based on the output features.
In one embodiment, an input feature is adjusted to be a group of features with the channel number equal to the class number as candidate areas, a pixel level feature is mapped in parallel, k groups of C-dimensional vectors of a feature map are calculated according to the pixel level feature and the candidate areas, then the features are weighted and summed according to the corresponding relation between each pixel in the pixel level feature and the candidate area feature, context feature representation after information increment can be obtained, finally the context feature representation is spliced with the input feature, and the context feature and the input feature are fused through convolution to obtain an output feature, so that global semantic information supplement of distillation features is achieved.
The calculation formula of the global semantic loss function is as follows:
wherein ,representing global semantic loss function,/->Represents a balancing factor for balancing semantic losses, +.>Representing up and down Wen Biaozheng modules,>output characteristics of fusion characteristic module in teacher network model are represented,/-for>And the output characteristics of the fusion characteristic module in the student network model are represented.
In one possible implementation, the calculation formula of the objective loss function is:
wherein ,representing the target loss function.
It should be noted that, the knowledge distillation in the invention only calculates on the middle layer characteristics of the deep convolutional network, the knowledge content is various, and high calculation expense and optimization cost are not needed, the knowledge distillation method is suitable for scene target detection of roads, and the middle layer characteristics can be obtained from the characteristic fusion module of the detector, so that the distillation algorithm based on the foreground focus and the global semantics can be easily applied to different teacher network models and student network models.
FIG. 2 is a schematic diagram of an embodiment of a knowledge-based distillation target detection apparatus.
Referring to fig. 2, the present invention further provides a knowledge distillation-based object detection apparatus, including:
a feature map obtaining module 21, configured to input the sample image into a teacher network model and a student network model respectively, and obtain a first feature map output by the teacher network model and a second feature map output by the student network model;
a calculation module 22 for calculating a mask loss function and an attention loss function based on the first feature map and the second feature map;
a foreground focus loss function module 23 for obtaining a foreground focus loss function based on the mask loss function and the attention loss function;
a global semantic loss function module 24 for calculating a global semantic loss function based on the first feature map and the second feature map;
a target loss function module 25, configured to obtain a target loss function according to the foreground focal loss function and the global semantic loss function;
the detection module 26 is configured to optimize parameters of the student network model by using the objective loss function to obtain an objective detection model, and perform objective detection by using the objective detection model.
The beneficial effects of adopting the embodiment are as follows: the feature map acquisition module 21 acquires a first feature map output by the teacher network model and a second feature map output by the student network model; the calculation module 22 calculates a mask loss function and an attention loss function based on the first feature map and the second feature map; the foreground focus loss function module 23 obtains a foreground focus loss function based on the mask loss function and the attention loss function; the global semantic loss function module 24 calculates a global semantic loss function based on the first feature map and the second feature map; the objective loss function module 25 obtains an objective loss function according to the foreground focus loss function and the global semantic loss function; the detection module 26 optimizes parameters of the student network model using the objective loss function to obtain an objective detection model, and performs objective detection using the objective detection model. According to the invention, the foreground focal knowledge distillation and the global semantic knowledge distillation are fused, so that the detection accuracy of the target detection model can be improved while the target detection speed is improved.
The foregoing embodiments provide a technical solution that may be implemented by using a target detection device based on knowledge distillation in the foregoing embodiments of a target detection method based on knowledge distillation, and the specific implementation principles of the foregoing modules or units may be based on the corresponding content in the embodiments of the target detection method based on knowledge distillation, which is not described herein again.
The above description of the method and system for detecting the target based on knowledge distillation provided by the invention applies specific examples to illustrate the principle and implementation of the invention, and the above examples are only used to help understand the method and core idea of the invention; meanwhile, as those skilled in the art will vary in the specific embodiments and application scope according to the ideas of the present invention, the present description should not be construed as limiting the present invention in summary.

Claims (10)

1. A method for detecting a target based on knowledge distillation, comprising:
respectively inputting a sample image into a teacher network model and a student network model, and acquiring a first characteristic diagram output by the teacher network model and a second characteristic diagram output by the student network model;
calculating a mask loss function and an attention loss function based on the first feature map and the second feature map;
obtaining a foreground focus loss function based on the mask loss function and the attention loss function;
calculating a global semantic loss function based on the first feature map and the second feature map;
obtaining a target loss function according to the foreground focus loss function and the global semantic loss function;
and optimizing parameters of the student network model by using the target loss function to obtain a target detection model, and performing target detection by using the target detection model.
2. The knowledge-based distillation target detection method according to claim 1, wherein the calculating a mask loss function and an attention loss function based on the first feature map and the second feature map comprises:
calculating a teacher binary mask based on the first feature map, and calculating a teacher size mask using the teacher binary mask;
calculating a student binary mask based on the second feature map, calculating a student size mask using the student binary mask;
calculating a teacher spatial attention mask and a teacher channel attention mask based on the first feature map;
calculating a student spatial attention mask and a student channel attention mask based on the second feature map;
calculating a mask loss function from the teacher binary mask, the teacher size mask, the student binary mask, the student size mask, the teacher spatial attention mask, the teacher channel attention mask, the student spatial attention mask, and the student channel attention mask;
an attention loss function is calculated from the teacher spatial attention mask, the teacher channel attention mask, the student spatial attention mask, and the student channel attention mask.
3. The knowledge distillation based target detection method according to claim 2, wherein the binary mask includes the teacher binary mask and the student binary mask, and the binary mask is calculated by a formula:
wherein ,represents the abscissa, ++for the pixel points in the feature map>Representing the ordinate of the pixel point in the feature map,/->And representing a binary mask, wherein the real frame is a preset real frame, and the feature map is one of the first feature map and the second feature map.
4. A knowledge distillation based target detection method according to claim 3 wherein a scale mask comprises the teacher scale mask and the student scale mask, the scale mask calculated by:
wherein ,representing the height of the real frame +.>Representing the width of the real frame +.>Representing a scale mask.
5. The knowledge distillation based target detection method according to claim 4 wherein a spatial attention mask comprises the teacher spatial attention mask and the student spatial attention mask, a channel attention mask comprises the teacher channel attention mask and the student channel attention mask, and a calculation formula of the spatial attention mask is:
wherein ,representing a normalization function->Representing the absolute average value of the pixels of the feature map over different spaces,/for each pixel>Indicating the temperature super-parameter in knowledge distillation, +.>Representing a spatial attention mask;
the calculation formula of the channel attention mask is as follows:
wherein ,indicates the number of channels>Representing the absolute average value of the pixels of the feature map on different channels,/for each channel>Representing a channel attention mask.
6. The knowledge distillation based target detection method according to claim 5 wherein the calculation formula of the mask loss function is:
wherein , and />Representing foreground in balanced feature graphsBalance factor of background->Output characteristics of fusion characteristic module in teacher network model are represented,/-for>Output characteristics of fusion characteristic module in student network model are represented,/-for>Indicates the number of channels>Representing the height of the real frame +.>Representing the width of the real frame +.>Indicating the number of layers of the feature layer->Representing a mask loss function;
the calculation formula of the attention loss function is as follows:
wherein ,representing a balancing factor for balancing attention mask loss, < ->Representation->Loss function (F)>A spatial attention mask representing teacher, < >>Representing student spatial attention mask, < >>Representing channel attention mask, ++>Representing channel attention mask, ++>Representing the attention loss function.
7. The knowledge distillation based target detection method according to claim 6 wherein the calculation formula of the foreground focal point loss function is:
wherein , representing the foreground focal point loss function.
8. The knowledge-based retorting target detection method as claimed in claim 1, wherein said calculating global semantic loss function based on said first and second feature maps includes:
determining candidate areas of the first feature map and the second feature map, and mapping the first feature map and the second feature map to obtain pixel-level features;
calculating a target vector according to the candidate region and the pixel-level feature to obtain a candidate region characterization;
weighting the candidate region characterization to obtain a context feature representation;
performing splicing processing and convolution processing on the context feature representation and the input feature to obtain an output feature;
calculating a global semantic loss function based on the output features;
the calculation formula of the global semantic loss function is as follows:
wherein ,representing global semantic loss function,/->Represents a balancing factor for balancing semantic losses, +.>Representing up and down Wen Biaozheng modules,>output characteristics of fusion characteristic module in teacher network model are represented,/-for>And the output characteristics of the fusion characteristic module in the student network model are represented.
9. The method for detecting a target based on knowledge distillation according to claim 7 and 8, wherein the calculation formula of the target loss function is:
wherein ,representing the target loss function.
10. A knowledge distillation-based target detection apparatus, comprising:
the characteristic diagram acquisition module is used for inputting the sample images into a teacher network model and a student network model respectively to acquire a first characteristic diagram output by the teacher network model and a second characteristic diagram output by the student network model;
a calculation module for calculating a mask loss function and an attention loss function based on the first feature map and the second feature map;
a foreground focus loss function module, configured to obtain a foreground focus loss function based on the mask loss function and the attention loss function;
a global semantic loss function module for calculating a global semantic loss function based on the first feature map and the second feature map;
the target loss function module is used for obtaining a target loss function according to the foreground focus loss function and the global semantic loss function;
and the detection module is used for optimizing the parameters of the student network model by utilizing the target loss function to obtain a target detection model, and carrying out target detection by utilizing the target detection model.
CN202311210238.2A 2023-09-19 2023-09-19 Target detection method and device based on knowledge distillation Pending CN116994068A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311210238.2A CN116994068A (en) 2023-09-19 2023-09-19 Target detection method and device based on knowledge distillation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311210238.2A CN116994068A (en) 2023-09-19 2023-09-19 Target detection method and device based on knowledge distillation

Publications (1)

Publication Number Publication Date
CN116994068A true CN116994068A (en) 2023-11-03

Family

ID=88532337

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311210238.2A Pending CN116994068A (en) 2023-09-19 2023-09-19 Target detection method and device based on knowledge distillation

Country Status (1)

Country Link
CN (1) CN116994068A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117974988A (en) * 2024-03-28 2024-05-03 南京邮电大学 Lightweight target detection method, lightweight target detection device and computer program product

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114565045A (en) * 2022-03-01 2022-05-31 北京航空航天大学 Remote sensing target detection knowledge distillation method based on feature separation attention
US20220261599A1 (en) * 2021-02-18 2022-08-18 Irida Labs S.A. Annotating unlabeled images using convolutional neural networks
CN115131627A (en) * 2022-07-01 2022-09-30 贵州大学 Construction and training method of lightweight plant disease and insect pest target detection model
CN115457364A (en) * 2022-08-30 2022-12-09 长沙智能驾驶研究院有限公司 Target detection knowledge distillation method and device, terminal equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220261599A1 (en) * 2021-02-18 2022-08-18 Irida Labs S.A. Annotating unlabeled images using convolutional neural networks
CN114565045A (en) * 2022-03-01 2022-05-31 北京航空航天大学 Remote sensing target detection knowledge distillation method based on feature separation attention
CN115131627A (en) * 2022-07-01 2022-09-30 贵州大学 Construction and training method of lightweight plant disease and insect pest target detection model
CN115457364A (en) * 2022-08-30 2022-12-09 长沙智能驾驶研究院有限公司 Target detection knowledge distillation method and device, terminal equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ZHENDONG YANG等: "《Focal and Global Knowledge Distillation for Detectors》", 《CVPR2022》, pages 3 - 4 *
李洋 等: "《基于知识蒸馏的SAR图像舰船目标检测》", 《现代防御技术》, vol. 51, no. 4 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117974988A (en) * 2024-03-28 2024-05-03 南京邮电大学 Lightweight target detection method, lightweight target detection device and computer program product
CN117974988B (en) * 2024-03-28 2024-05-31 南京邮电大学 Lightweight target detection method, lightweight target detection device and computer program product

Similar Documents

Publication Publication Date Title
WO2022111219A1 (en) Domain adaptation device operation and maintenance system and method
CN112150493B (en) Semantic guidance-based screen area detection method in natural scene
CN111460968B (en) Unmanned aerial vehicle identification and tracking method and device based on video
CN114202672A (en) Small target detection method based on attention mechanism
CN107133943A (en) A kind of visible detection method of stockbridge damper defects detection
CN108734143A (en) A kind of transmission line of electricity online test method based on binocular vision of crusing robot
CN111008633B (en) License plate character segmentation method based on attention mechanism
CN113313082B (en) Target detection method and system based on multitask loss function
CN112818969A (en) Knowledge distillation-based face pose estimation method and system
CN112287896A (en) Unmanned aerial vehicle aerial image target detection method and system based on deep learning
Li et al. Bifnet: Bidirectional fusion network for road segmentation
CN116994068A (en) Target detection method and device based on knowledge distillation
CN115131747A (en) Knowledge distillation-based power transmission channel engineering vehicle target detection method and system
CN114140672A (en) Target detection network system and method applied to multi-sensor data fusion in rainy and snowy weather scene
CN110659601A (en) Depth full convolution network remote sensing image dense vehicle detection method based on central point
CN117152513A (en) Vehicle boundary positioning method for night scene
CN111126155A (en) Pedestrian re-identification method for generating confrontation network based on semantic constraint
CN114511532A (en) Solar cell surface defect detection method based on feature-guided channel distillation
CN113807185A (en) Data processing method and device
CN117746264A (en) Multitasking implementation method for unmanned aerial vehicle detection and road segmentation
CN112861987A (en) Target detection method under dark light environment
CN117115616A (en) Real-time low-illumination image target detection method based on convolutional neural network
CN116310902A (en) Unmanned aerial vehicle target detection method and system based on lightweight neural network
Li et al. A real-time vehicle window positioning system based on nanodet
CN113670268B (en) Binocular vision-based unmanned aerial vehicle and electric power tower distance measurement method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination