CN115546708A - Target detection method and device - Google Patents

Target detection method and device Download PDF

Info

Publication number
CN115546708A
CN115546708A CN202110734260.1A CN202110734260A CN115546708A CN 115546708 A CN115546708 A CN 115546708A CN 202110734260 A CN202110734260 A CN 202110734260A CN 115546708 A CN115546708 A CN 115546708A
Authority
CN
China
Prior art keywords
map
model
image
loss function
branch
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110734260.1A
Other languages
Chinese (zh)
Inventor
吴捷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zitiao Network Technology Co Ltd
Original Assignee
Beijing Zitiao Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zitiao Network Technology Co Ltd filed Critical Beijing Zitiao Network Technology Co Ltd
Priority to CN202110734260.1A priority Critical patent/CN115546708A/en
Publication of CN115546708A publication Critical patent/CN115546708A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the disclosure provides a target detection method and device, wherein the method comprises the following steps: acquiring an image to be detected; detecting a target in the image through the first model to obtain an attention image and a density image of the image; generating a target detection map of the image according to the attention map and the density map, wherein the target detection map comprises a position identifier of a target in the image; the first model is a lightweight deep learning model obtained by training based on sample data, a second model and a first loss function, the first loss function comprises a distillation loss function for representing distillation loss between the first model and the second model, and the model scale of the second model is larger than that of the first model. Therefore, the target detection is carried out on the image through the light-weight deep learning model obtained based on knowledge distillation training, and the efficiency of the target detection is improved.

Description

Target detection method and device
Technical Field
The embodiment of the disclosure relates to the technical field of computers, in particular to a target detection method and device.
Background
With the development of business, detection of targets in complex scenes is required, for example, detection of pedestrians in street scenes and detection of vehicles in road scenes.
At present, a method based on density map estimation is adopted to realize target detection in a complex scene, and not only can reflect the number of targets, but also can reflect the distribution of the targets in the scene. However, this method needs to make a trade-off between "superior target detection effect" and "real-time", and generally, after an image is acquired by a camera, the image is transferred to a background server, and target detection is performed by a powerful graphics processor of the background server, which is poor in real-time.
Therefore, the efficiency of target detection needs to be improved.
Disclosure of Invention
The embodiment of the disclosure provides a target detection method and device, so as to improve the efficiency of target detection.
In a first aspect, an embodiment of the present disclosure provides a target detection method, including:
acquiring an image to be detected;
detecting a target in the image through a first model to obtain a focus image and a density image of the image;
generating a target detection map of the image according to the attention map and the density map, wherein the target detection map comprises position marks of targets in the image;
the first model is a lightweight deep learning model obtained by training based on sample data, a second model and a first loss function, the first loss function comprises a distillation loss function for representing the distillation loss between the first model and the second model, and the model scale of the second model is larger than that of the first model.
In a second aspect, an embodiment of the present disclosure provides an object detection apparatus, including:
the device comprises an acquisition unit, a detection unit and a processing unit, wherein the acquisition unit is used for acquiring an image to be detected;
the detection unit is used for detecting a target in the image through a first model to obtain an attention map and a density map of the image;
a generating unit, configured to generate a target detection map of the image according to the attention map and the density map;
the first model is a lightweight deep learning model obtained by training based on sample data, a second model and a first loss function, the first loss function comprises a distillation loss function for reflecting distillation loss between the first model and the second model, and the model scale of the second model is larger than that of the first model.
In a third aspect, an embodiment of the present disclosure provides an electronic device, including: at least one processor and memory;
the memory stores computer-executable instructions;
the at least one processor executes computer-executable instructions stored by the memory to cause the at least one processor to perform the object detection method as described above in the first aspect and various possible designs of the first aspect.
In a fourth aspect, the embodiments of the present disclosure provide a computer-readable storage medium, in which computer-executable instructions are stored, and when a processor executes the computer-executable instructions, the target detection method according to the first aspect and various possible designs of the first aspect is implemented.
In a fifth aspect, embodiments of the present disclosure provide a computer program product:
according to the method and the device for detecting the target, the target in the image is detected through the first model, the attention map and the density map of the image are obtained, and the target detection map of the image is generated according to the attention map and the density map of the image. The first model is a lightweight deep learning model obtained by training based on sample data, a second model and a first loss function, the first loss function comprises a distillation loss function for representing distillation loss between the first model and the second model, and the model scale of the second model is larger than that of the first model. Therefore, the second model with a larger model scale is distilled to obtain the first model with light weight, and the target detection is carried out on the image based on the first model with light weight, so that the target detection efficiency of the image is improved.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, a brief description will be given below to the drawings required for the embodiments or the technical solutions in the prior art, and it is obvious that the drawings in the following description are some embodiments of the present disclosure, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.
Fig. 1 is an exemplary diagram of an application scenario provided by an embodiment of the present disclosure;
fig. 2 is a schematic flow chart of a target detection method provided in the embodiment of the present disclosure;
fig. 3 is a schematic flow chart of a training process of a first model in a target detection method according to an embodiment of the present disclosure;
fig. 4 is a schematic flowchart of a training process of a second model in the target detection method according to the embodiment of the present disclosure;
FIG. 5 is a diagram illustrating a model structure of a first model and a second model provided by an embodiment of the disclosure;
fig. 6 is a block diagram of a target detection apparatus provided in the embodiment of the present disclosure;
fig. 7 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present disclosure.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, but not all embodiments of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.
Referring to fig. 1, fig. 1 is an exemplary diagram of an application scenario provided in an embodiment of the present disclosure.
As shown in fig. 1, the application scene is, for example, a road scene, and the object is, for example, a pedestrian. In a road scene, a video is shot through a camera 101 at the road side, and the image in the video is subjected to real-time pedestrian detection through the camera 101 or a terminal 102 connected with the camera 101, so that the pedestrian number condition on the image is obtained.
The camera and the terminal connected with the camera belong to light-weight equipment, and the computing capability is weak. How to realize real-time target detection on light-weight equipment is one of the problems which need to be solved urgently at present.
The method for detecting the pedestrians in the image by adopting the real-time crowd counting model generally comprises the following two modes:
mode one, a method based on pedestrian detection. In this type of method, pedestrians are detected in the image using a conventional detector based on Histogram of Oriented Gradient (HOG), or using a deep learning algorithm. Deep learning algorithms such as YOLOs algorithms, regional Convolutional Neural Networks (RCNNs). However, under the condition that pedestrian shielding is serious, the detection effect of the method is poor, and false detection and missed detection are easy to occur.
And the second mode is a method based on head density map estimation. In this type of method, head detection is performed on the image, and a density map reflecting information on the number of pedestrians and the distribution of pedestrians is provided. However, this type of method requires a trade-off between accuracy and real-time performance of head detection. The general method is that the image collected by the camera is sent to a server at the background, the strong image processing capacity of the server is utilized to carry out human head detection on the image, a model with large calculation amount is adopted on the server to carry out human head detection, and the model is obtained by adopting traditional binary cross entropy loss training, so that the human head detection efficiency and accuracy are all to be improved.
In order to solve the above problem, an embodiment of the present disclosure provides a target detection method. In the method, a lightweight first model with a smaller model scale and a smaller calculation amount is obtained by using a second model with a larger model scale and a larger calculation amount and knowledge distillation, so that the first model is convenient to deploy on lightweight equipment. And during target detection, performing target detection on the image through the first model to obtain a focus image and a density image of the image, and generating a target detection image of the image according to the focus image and the density image. Therefore, on one hand, the efficiency of target detection is improved through the lightweight first model, and real-time target detection on lightweight equipment is facilitated, and on the other hand, the target detection graph obtained through detection can reflect the quantity information of targets and the distribution condition of the targets, and the accuracy of the target detection is ensured.
Optionally, the target is a pedestrian, and the target detection is pedestrian detection and human head detection in the image.
For example, the target detection method provided by the embodiment of the present disclosure may be applied in a terminal or a server. When the method is applied to the terminal, the real-time target detection of the image acquired by the terminal is realized. When the method is applied to the server, the target detection of the image sent by the terminal is realized. The terminal device may be a Personal Digital Assistant (PDA), a handheld device with a wireless communication function (e.g., a smart phone or a tablet), a computing device (e.g., a Personal Computer (PC)), an in-vehicle device, a wearable device (e.g., a smart watch or a smart band), and a smart home device (e.g., a smart display device).
Optionally, the terminal is a camera, and the first model is deployed in a chip of the camera. Therefore, the target detection of the image acquired by the camera in real time is realized, and the real-time target detection of one side of the camera is realized.
Referring to fig. 2, fig. 2 is a schematic flow chart of a target detection method provided in the embodiment of the present disclosure. As shown in fig. 2, the target detection method includes:
s201: and acquiring an image to be detected.
In an example, the image to be detected may be an image captured by a camera in real time, or one or more frames of images obtained from a video captured by the camera in real time.
In yet another example, the image to be detected may be a user input or a user selected image. For example, the user inputs an image to be detected on a display interface of the terminal, or selects an image to be detected. Or the server receives user input or images selected by the user sent by the terminal.
In another example, the image to be detected may be an image played in real time on the terminal. For example, when detecting that an image or video is played, the terminal acquires the image or video frame being played. Therefore, the target detection of the image played in real time on the terminal is realized.
S202: and detecting the target in the image through the first model to obtain an attention map and a density map of the image.
The first model is a lightweight deep learning model obtained by training based on sample data, a second model and a first loss function, the first loss function comprises a distillation loss function for reflecting distillation loss between the first model and the second model, the second model is a deep learning model, and the model scale of the second model is larger than that of the first model.
And in the training process of the first model, acquiring a second model, and performing knowledge distillation on the second model and the first model based on a sample image in sample data and a distillation loss function in the first loss function. Therefore, the training of the first model is guided by the trained second model with large model scale and high target detection accuracy, and the target detection accuracy of the lightweight first model is improved.
In this step, after the image to be detected is obtained, the image may be input into the first model, or the image may be preprocessed (for example, denoising, cropping, enhancing, and the like), and the preprocessed image may be input into the first model. And performing feature extraction and decoding on the image through the plurality of convolution layers in the first model to respectively extract an attention map and a density map of the image. The attention map focuses on the definition and accuracy of target detection in the image, the position of a single target in the image can be reflected more clearly, and the density map focuses on the comprehensiveness of target detection in the image, and the distribution of multiple targets in the image can be reflected more comprehensively.
S203: from the attention map and the density map, an object detection map of the image is generated.
And the target detection graph comprises position identification of the target in the image. For example, when the object is a pedestrian, the highlighted point can be used as the position mark of the object in the image, so that the number and distribution of the plurality of highlighted points in the image embody the number and distribution of the plurality of objects in the image.
In the step, after the attention map and the density map are obtained, the attention map and the density map can be combined to obtain a density map with attention, namely an object detection map, so that the density distribution of a plurality of objects in the image is reflected through the object detection map, the accuracy of the positions of the objects reflected by the object detection map is improved, the definition, the accuracy and the comprehensiveness of the object detection in the image are improved, and the object detection effect is effectively improved.
The combination of the attention map and the density map is the combination of each pixel in the attention map and each pixel in the density map, and each pixel in the attention map is fused into each pixel in the density map.
In the embodiment of the disclosure, a light-weight first model is obtained based on sample data, a second model and knowledge distillation, an object in an image is detected through the first model to obtain an attention map and a density map of the image, and an object detection map reflecting the position of the object in the image is obtained by combining the attention map and the density map, so that the efficiency of object detection is improved through the first model which is light-weight and can be deployed on light-weight equipment, and the accuracy of object detection is improved through the knowledge distillation and the combination of the attention map and the density map.
Regarding the model structure of the first model and the second model:
in some embodiments, at least two network branches are included in the first model, each network branch including one or more convolutional layers. Wherein, the at least two network branches comprise a focus map branch and a density map branch. In the training process of the first model, the attention map branch is used for learning an attention map of a target in a sample image, and the density map branch is used for learning a density map of the target in the sample image; when the first model detects the target of the image, the attention map branch is used for extracting the attention map of the image, and the density map branch is used for extracting the density map of the image. For convenience of description, the attention map branch in the first model is referred to as a first attention map branch, and the density map branch in the first model map is referred to as a first density map branch.
Specifically, the first model includes a plurality of feature extraction layers, a plurality of feature decoding layers, a first attention map branch, and a first density map branch. In the first model, the last feature extraction layer is connected with the first feature decoding layer, and the last feature decoding layer is respectively connected with the first convolution layer in the first attention map branch and the first convolution layer in the first density map branch.
The first model also comprises a network layer used for fusing the attention map and the density map, the network layer is respectively connected with the last convolution layer in the first attention map branch and the last convolution layer in the first density map branch, receives the attention map output by the first attention map branch and the density map output by the first density map branch, fuses the attention map and the density map to obtain an objective detection map and outputs the objective detection map.
In some embodiments, the model structure of the second model is the same as or similar to the model structure of the first model. At least two network branches are included in the second model, each network branch including one or more convolutional layers. Wherein, the at least two network branches comprise a focus map branch and a density map branch. The attention map branch and the density map branch may refer to the related content of the attention map branch and the density map branch in the first model, and are not described again. For convenience of description, the attention map branch in the second model is referred to as a second attention map branch, and the density map branch in the second model map is referred to as a second density map branch.
Specifically, the second model includes a plurality of feature extraction layers, a plurality of feature decoding layers, a second attention map branch, and a second density map branch. In the second model, the last feature extraction layer is connected with the first feature decoding layer, and the last feature decoding layer is respectively connected with the first convolution layer in the first concerned graph branch and the first convolution layer in the first density graph branch.
The second model also comprises a network layer used for fusing the attention map and the density map, the network layer is respectively connected with the last convolution layer in the second attention map branch and the last convolution layer in the second density map branch, receives the attention map output by the second attention map branch and the density map output by the second density map branch, fuses the attention map and the density map to obtain an object detection map and outputs the object detection map.
Optionally, in the first model and/or the second model, the feature extraction layer is a Ghost module in the end-side neural network architecture Ghost net, so that the feature extraction is performed by using the Ghost module to replace the convolutional layer in the first model and/or the second model, and the advantage that the Ghost module extracts more features with fewer parameters and has a smaller calculation amount is utilized, thereby improving the target detection efficiency and the target detection accuracy of the first model.
Optionally, in the first model and/or the second model, the U-Net mode is used for information enhancement in the feature extraction and decoding process, so as to improve the accuracy of the feature map obtained after feature decoding, and improve the target detection accuracy of the first model and/or the second model.
For example, in the U-Net mode of the first model, the feature map output by the first feature extraction layer is merged with the feature map output by the second last feature decoding layer to obtain the input features of the first last feature decoding layer, the feature map output by the second feature extraction layer is merged with the feature map output by the third last feature decoding layer to obtain the input features of the second last feature decoding layer, and so on.
Regarding the model scale of the first model and the second model:
in some embodiments, the number of network layers of the second model is greater than the number of network layers of the first model. Therefore, the target detection accuracy of the second model is improved through a larger number of network layers in the second model, and the target detection accuracy of the first model obtained based on the second model and knowledge distillation is further improved.
For example, the number of feature extraction layers in the second model is greater than the number of feature extraction layers in the first model, or the number of feature decoding layers in the second model is greater than the number of feature decoding layers in the first model.
In some embodiments, the number of network layers of the second model is equal to the number of network layers of the first model, where the number of channels of the feature extraction layer in the second model is greater than the number of channels of the corresponding feature extraction layer in the first model, and/or the number of channels of the feature decoding layer in the second model is greater than the number of channels of the corresponding feature decoding layer in the first model. Therefore, in the feature extraction layer and/or the feature decoding layer of the second model, the image features which are not easy to capture by the first model are captured through the larger model capacity, so that the feature extraction accuracy of the first model obtained according to the second model and knowledge distillation training is improved, and the target detection accuracy of the first model is improved.
As an example, the number of channels of the feature extraction layer in the second model being greater than the number of channels of the corresponding feature extraction layer in the first model means: the method comprises the steps of determining a first feature extraction layer, a second feature extraction layer, a third feature extraction layer, a fourth feature extraction layer, a fifth feature extraction layer and a sixth feature extraction layer of a second model, wherein the first feature extraction layer, the second feature extraction layer, \8230 \ 8230;, and the channel number of the fifth feature extraction layer of the second model are respectively greater than those of the first feature extraction layer, the second feature extraction layer, \8230;, and the fourth feature extraction layer of the first model. The number of channels of the feature extraction layer comprises the number of input channels and the number of output channels of the feature extraction layer.
Further, the number of channels of the feature extraction layer in the second model is a preset multiple of the number of channels of the corresponding feature extraction layer in the first model, and/or the number of channels of the feature decoding layer in the second model is a preset multiple of the number of channels of the corresponding feature decoding layer in the first model, wherein the preset multiple is greater than 1. Therefore, the size relation between the scale of the feature extraction layer and/or the feature decoding layer in the second model and the scale of the feature extraction layer and/or the feature decoding layer in the first model is more stably established through the multiple relation.
In some embodiments, based on the model structure of the first model, one possible implementation manner of S202 includes: in a feature extraction layer of the first model, feature extraction is performed on the image through convolution operation and down-sampling operation, and in a feature decoding layer of the first model, feature decoding is performed on the extracted features through convolution operation and up-sampling operation to obtain image features (also called feature activation values) of the image; and inputting the image features into the first attention map branch and the first density map branch respectively to obtain an attention map and a density map of the image, thereby realizing the extraction of the attention map and the density map of the image in the first model.
In some embodiments, one possible implementation manner of S203 is: and performing dot product operation on the attention image and the density image to obtain the attention density image. Therefore, the combination of each pixel in the attention map and each pixel in the density map is realized by means of dot product operation, and the combination effect of the attention map and the density map is improved. The dot product operation of the attention map and the density map means that the attention map is regarded as a first matrix formed by a plurality of pixel values, the density map is regarded as a second matrix formed by a plurality of pixel values, the first matrix and the second matrix are subjected to click operation, and the operation result is the density map with attention.
Furthermore, the attention map is processed through an activation function, the processed attention map and the density map are subjected to dot product operation, the density map with attention is obtained, the fusion effect of the attention map in the density map is improved, and the accuracy of quantity information and density distribution conditions of targets in the image reflected by the target detection map is improved. Wherein, the activation function adopts sigmoid function.
Subsequently, the training process of the first model and the training process of the second model are described by embodiments.
The following points need to be explained: 1. the training process of the first model is carried out separately from the training process of the second model; 2. the training process of the first model and the training process of the second model can be carried out on the same equipment or different equipment; 3. the application process of the first model on the target detection of the image and the training process of the first model can be performed on the same device or different devices. For example, the trained first model may be deployed on a lightweight device, and target detection of an image may be performed on the lightweight device; the first model can be trained on a lightweight device, or on a computer, a server and other devices, and the training process of the second model can be on a computer, a server and other devices.
Referring to fig. 3, fig. 3 is a schematic flowchart of a training process of a first model in a target detection method according to an embodiment of the present disclosure. As shown in fig. 3, one training process of the first model includes:
s301: and performing feature extraction and decoding on the sample image through the first model.
According to the description of the network structure of the first model, the first model comprises a plurality of network layers, and the plurality of network layers comprise an input layer, an intermediate layer and an output layer. In one example, the first model includes an input layer, a plurality of feature extraction layers, a plurality of feature decoding layers, a plurality of convolution layers in a first attention map branch, a plurality of convolution layers in a first density map branch, and an output layer fusing the attention map and the density map and outputting a target detection map, the plurality of feature extraction layers, the plurality of feature decoding layers, the plurality of convolution layers in the first attention map branch, and the plurality of convolution layers in the first density map branch being intermediate layers.
In this step, feature extraction and decoding are performed on the sample image in the sample data through the first model, so that image features output by each intermediate layer in the first model and a feature map output by the output layer can be obtained, the feature map is a target detection map obtained by performing target detection on the sample image by the first model, and for convenience of distinguishing, the feature map of the sample image output by the output layer in the first model is called a first feature map.
S302: and extracting and decoding the characteristics of the sample image through the second model.
According to the description of the network structure of the second model, the second model includes a plurality of network layers, and the plurality of network layers include an input layer, an intermediate layer, and an output layer. In one example, the second model includes an input layer, a plurality of feature extraction layers, a plurality of feature decoding layers, a plurality of convolution layers in a second attention map branch, a plurality of convolution layers in a second density map branch, and an output layer fusing the attention map and the density map and outputting a target detection map, the plurality of feature extraction layers, the plurality of feature decoding layers, the plurality of convolution layers in the second attention map branch, and the plurality of convolution layers in the second density map branch being intermediate layers.
In this step, feature extraction and decoding are performed on the sample image in the sample data through the second model, so that image features output by each intermediate layer and a feature map output by the output layer in the second model can be obtained, and the feature map is a target detection map obtained by performing target detection on the sample image through the second model. For the sake of convenience of distinction, the feature map of the sample image output by the output layer in the second model is referred to as a second feature map.
S303: and determining a function value of the first loss function according to the image characteristics extracted by the first model and the image characteristics extracted by the second model.
In this step, a difference between the image feature extracted by the first model and the image feature extracted by the second model is determined, a function value of the distillation loss function in the first loss function is determined according to the difference between the image feature extracted by the first model and the image feature extracted by the second model, and then the function value of the first loss function is determined according to the function value of the distillation loss function. If the first loss function includes only the distillation loss function, the function value of the distillation loss function is the function value of the first loss function.
S304: the first model is modified based on the loss value of the first loss function.
In this step, after the function value of the first loss function is determined, the model parameters of the first model are adjusted based on the function value of the first loss function and an optimization algorithm (for example, a gradient descent algorithm), so as to modify the first model. Therefore, in the process of correcting the first model, the distillation loss function in the first loss function guides the image characteristics extracted by the first model to approach the image characteristics extracted by the second model, so that the target detection accuracy of the first model approaches the target detection accuracy of the second model. And finishing the training process of the first model after the first model is corrected. At this time, it is determined whether the function value of the first loss function is smaller than a first threshold, if so, the trained first model is obtained, otherwise, the next training of the first model is continued until the function value of the first loss function is smaller than the first threshold.
In some embodiments, the distillation loss function comprises a structural loss function. At this time, one possible implementation manner of S303 includes: and determining a function value of the structural similarity loss function according to the difference between the first characteristic diagram output by the output layer in the first model and the second characteristic diagram output by the output layer in the second model. Therefore, training of the first model is guided based on the difference between the first feature diagram and the second feature diagram reflected by the structural similarity loss function, so that model optimization is carried out on the first model towards the direction that the first feature diagram output by the first model approaches the second feature diagram output by the second model.
Optionally, the structural similarity between the first feature map and the second feature map is determined, where the structural similarity includes at least one of the following: brightness similarity, contrast similarity, pixel vector similarity. And determining a function value of the structural similarity loss function based on the structural similarity between the first characteristic diagram and the second characteristic diagram. The similarity of one or more aspects of brightness, contrast and pixel vector of the first feature map and the second feature map reflects the difference degree of the first feature map and the second feature map from the other side, for example, the greater the brightness similarity of the first feature map and the second feature map, the smaller the difference degree of the brightness of the first feature map and the second feature map. Therefore, the function value of the structural similarity loss function is obtained by combining the similarity of the first feature map and the second feature map in one or more aspects of brightness, contrast and pixel vectors, and the accuracy of the function value of the structural similarity loss function is improved.
Specifically, the first feature map and the second feature map may be compared in terms of one or more of brightness, contrast, and pixel vector, so as to obtain one or more comparison results of brightness similarity, contrast similarity, and pixel vector similarity of the first feature map and the second feature map. And combining the one or more comparison results to obtain the structural similarity of the first characteristic diagram and the second characteristic diagram.
Optionally, the method for determining the Structural Similarity between the first feature map and the second feature map may be a Structural Similarity (SSIM) calculation method, so that the SSIM can provide a sensing capability closer to a Human Visual System (HVS), and the accuracy of calculating the Structural Similarity between the first feature map and the second feature map is improved.
In the SSIM method, a first feature map and a second feature map are respectively normalized, and the photo degree, the contrast degree and the pixel vector similarity of the normalized first feature map and the normalized second feature map are calculated to obtain the brightness similarity, the contrast similarity and the pixel vector similarity of the first feature map and the second feature map. And multiplying the brightness similarity, the contrast similarity and the pixel vector similarity of the first characteristic diagram and the second characteristic diagram to obtain the structural similarity between the first characteristic diagram and the second characteristic diagram.
Further, the calculation formulas for determining the brightness similarity, the contrast similarity and the pixel vector similarity of the first feature map and the second feature map by using SSIM may respectively be:
Figure RE-GDA0003270622240000081
Figure RE-GDA0003270622240000082
wherein, mu a 、μ b Respectively representing the mean value of the pixel values of the first characteristic diagram and the mean value of the pixel values of the second characteristic diagram, sigma a 、σ b Respectively representing the variance of the pixel value of the first characteristic image and the variance of the pixel value of the second characteristic image, c 1 、c 2 And c 3 Is a constant of small value to avoid division by zero. L (a, b) represents the luminance similarity between the first feature map and the second feature map, C (a, b) represents the contrast similarity between the first feature map and the second feature map, and S (a, b) represents the pixel vector similarity between the first feature map and the second feature map.
At this time, the calculation formula of the structural similarity between the first feature map and the second feature map may be: SSIM (a, b) = L (a, b) × C (a, b) × S (a, b). Where SSIM (a, b) represents the structural similarity between the first profile and the second profile. The structural similarity loss function can be expressed as: l is ssim =1-SSIM (a, b). Therefore, after the structural similarity between the first characteristic diagram and the second characteristic diagram is obtained, the function value of the structural similarity loss function can be calculated.
Optionally, in addition to a manner of directly determining the structural similarity between the whole image of the first feature map and the whole image of the second feature map, the first feature map and the second feature map may be divided into a plurality of image areas respectively; and determining the structural similarity of the plurality of image areas of the first feature map and the plurality of image areas of the second feature map. And determining the structural similarity of the first feature map and the second feature map according to the structural similarity of the plurality of image areas of the first feature map and the plurality of image areas of the second feature map. Therefore, the accuracy of the structural similarity of the first characteristic diagram and the second characteristic diagram is improved in an image area dividing mode.
The sizes of the image areas in the first characteristic image and the second characteristic image are the same. According to the positions of the image areas in the feature map, a plurality of image areas in the first feature map and a plurality of image areas in the second feature map are in one-to-one correspondence, and two image areas with one-to-one correspondence in positions in the first feature map and the second feature map can form an image area pair.
And determining the structural similarity between two image areas in the image area pair, namely the structural similarity of the image area pair, aiming at each image area pair consisting of the plurality of image areas of the first characteristic image and the plurality of image areas of the second characteristic image. And determining the structural similarity of the first characteristic diagram and the second characteristic diagram according to the structural similarity of each image area pair. The structural similarity between the first feature map and the second feature map can be determined as the mean value of the structural similarities corresponding to all the image region pairs.
For example, the first feature map and the second feature map are divided into I image regions having the same size. And determining the structural similarity between the image region in the first characteristic diagram and the corresponding image region in the second characteristic diagram through SSIM to obtain the structural similarity of the I image region pairs. And calculating the average value of the structural similarity of the I image region pairs to obtain the structural similarity of the first characteristic diagram and the second characteristic diagram. The calculation formula of the structural similarity between the first characteristic diagram and the second characteristic diagram can be expressed as follows:
Figure RE-GDA0003270622240000091
wherein, a i Representing the ith image area in the first feature map, b i Indicates the ith image region in the second feature map, SSIM (a) i ,b i ) Denotes a i And b i I represents the number of image regions in the first feature map, i.e., the number of image regions in the second feature map. At this time, the structural similarity loss function can be expressed as: l is ssim =1-MSSSIM(a,b)。
In some embodiments, on the basis of a network structure that the number of network layers of the second model is equal to the number of network layers of the first model, and meanwhile, the number of channels of the feature extraction layer in the second model is a preset multiple of the number of channels of the corresponding feature extraction layer in the first model and/or the number of channels of the feature decoding layer in the second model is a preset multiple of the number of channels of the feature decoding layer in the first model, the middle layer of the teacher network includes more channels, and can extract more information. The output of the middle layer of the teacher model can be used as supervision information to guide the training of the student model, so that the student model can obtain more information in the middle layer besides the output layer. Therefore, the distillation loss function may include a channel loss function, and the difference between the middle layer output of the first model and the middle layer output of the second model is reflected by the function value of the channel loss function, so as to guide the training of the first model based on the difference, thereby enriching the middle layer of the first model. In other words, the feature extraction capability and/or feature decoding capability of the intermediate layer of the first model is made to approximate the feature extraction capability and/or feature decoding capability of the intermediate layer of the second model.
The distillation loss function comprises a channel loss function, a channel convolution layer is connected between the middle layer of the first model and the middle layer of the second model, and the channel convolution layer is used for establishing a mapping relation between the image characteristics output by the middle layer of the first model and the image characteristics output by the middle layer of the second model. In the mapping relationship, since the number of network layers of the second model is equal to the number of network layers of the first model, the network layers of the second model and the network layers of the second model are mapped one by one.
In this case, the intermediate layer of the first model includes the feature extraction layer of the first model and/or the feature extraction layer of the first model, and the intermediate layer of the second model includes the feature extraction layer of the second model and/or the feature extraction layer of the second model. Therefore, a one-to-one mapping relation between the plurality of feature extraction layers of the first model and the plurality of feature extraction layers of the second model is established through the channel convolution layer, and/or a one-to-one mapping relation between the plurality of feature decoding layers of the first model and the plurality of feature decoding layers of the second model is established.
At this time, one possible implementation manner of S303 further includes: determining a function value of the channel distillation loss function according to a difference between the image features mapped to each other among the image features of the intermediate layer output of the first model and the image features of the intermediate layer output of the second model. Therefore, training of the first model is guided based on the difference between the image characteristics of the middle layer output of the first model and the image characteristics of the middle layer output of the second characteristic diagram reflected by the channel distillation loss function, so that model optimization is carried out on the first model towards the direction that the middle layer output of the first model approaches to the middle layer output of the second model, and the accuracy of feature extraction of the first model is improved.
Optionally, in the process of determining the function value of the channel distillation loss function according to the difference between the image features mapped to each other, the difference between each pair of image features mapped to each other may be calculated through an L2 paradigm, and the function value of the channel distillation loss function is determined as the sum of the differences between the image features mapped to each other. Further, the channel distillation loss function can be expressed as:
Figure RE-GDA0003270622240000101
wherein,
Figure RE-GDA0003270622240000102
image features, z, output for the mth intermediate layer in the second model m (x) The image features output for the mth intermediate layer in the first model,
Figure RE-GDA0003270622240000103
and z m (x) Is a pair of image features that are mapped to each other,
Figure RE-GDA0003270622240000104
representing a calculation by L2 paradigm
Figure RE-GDA0003270622240000105
And z m (x) Difference between h m Representing channel convolution layers, M representing the number of intermediate layers (e.g., the number of feature extraction layers, the number of feature decoding layers, the total number of feature extraction layers and feature decoding layers) of the second model, L dis Represents the channel distillation loss function.
The channel convolution layer is a learnable convolution layer, and the parameters of the learnable convolution layer can be adjusted in the training process of the first model, so that the first model and the channel convolution layer are optimized through the mean square error between the image features which are mapped with each other, and the function value of the channel distillation loss function is minimized.
Furthermore, the channel convolution layer is a learnable convolution layer of 1 × 1, the output of the middle layer of the first model is mapped to the output of the middle layer corresponding to the second model through the learnable convolution layer of 1 × 1, and the channel of the middle layer of the first model is raised to be consistent with the channel of the middle layer corresponding to the second model.
In some embodiments, based on the model structure that the first model includes the feature extraction layer, the feature decoding layer, the first attention map branch and the first density map branch, the label data corresponding to the sample image includes the actual attention map of the sample image and the actual density map of the sample image, and the first loss function further includes a first attention map branch loss function and a first density map branch loss function. At this time, one possible implementation manner of S303 includes: inputting image features output by a feature decoding layer in the first model into a first attention map branch and a first density map branch respectively to obtain a first attention map and a first density map; and respectively determining the function value of the branch loss function of the first attention map and the function value of the branch loss function of the first density map according to the first attention map, the first density map, the actual attention map and the actual density map. Therefore, the training process of the first model is guided based on the function value of the branch loss function of the first attention map and the function value of the branch loss function of the first density map, the accuracy of the first attention map and the first density map output by the first model is improved, the first attention map output by the first model gradually approaches the actual attention map of the sample image through continuous training, and the first density map output by the first model gradually approaches the actual density map of the sample image.
Alternatively, in the process of determining the function value of the branch loss function of the first interest map and the function value of the branch loss function of the first density map according to the first interest map, the first density map, the actual interest map and the actual density map, respectively, the function value of the branch loss function of the first interest map may be determined according to a difference between the first interest map and the actual interest map, and the function value of the branch loss function of the first density map may be determined according to a difference between the first density map and the actual density map. For example, the mean value of the pixel difference values of the first attention map and the actual attention map is determined as the function value of the first attention map branch loss function, and the mean value of the pixel difference values of the first density map and the actual density map is determined as the function value of the first density map branch loss function.
Optionally, in the process of determining the function value of the branch loss function of the first attention map and the function value of the branch loss function of the first density map according to the first attention map, the first density map, the actual attention map and the actual density map, respectively, the first attention map and the first density map may be fused (the specific fusion manner may refer to the description of the foregoing embodiment), so as to obtain the first target detection map of the sample image. And determining a function value of the branch loss function of the first attention map according to the difference between the first attention map and the actual attention map. And determining a function value of the branch loss function of the first density map according to the difference between the first target detection map and the actual density map. Therefore, the first attention graph of the sample image output by the first model approaches to the actual attention graph of the sample image, the first target detection graph of the sample image output by the first model approaches to the actual density graph of the sample image, and the accuracy of the target detection graph output by the first model is improved.
For example, inputting the image feature output by the feature decoding layer in the first model into the first attention map branch, and obtaining the first attention map may be represented as: a = f (X); inputting the image features output by the feature decoding layer in the first model into a first density map branch, and obtaining a first density map which can be expressed as: d = f (X); the first target detection map of the sample image may be obtained by performing a dot product operation on the first attention map and the first density map, and a calculation formula of the first target detection map may be represented as: d F = σ (A) < > D. Wherein X represents the image feature output by the feature decoding layer in the first model, A represents the first attention map of the sample image, D represents the first density map of the sample image, and D F The first target detection map indicating the sample image indicates a dot product operation, and σ () indicates an activation function, which is, for example, a sigmoid function.
Further, the formula of the first graph-of-interest branch loss function may be expressed as:
Figure RE-GDA0003270622240000111
wherein N represents the number of sample images adopted in one training process of the first model, A i A first attention map representing an ith sample image obtained after the first model processing,
Figure RE-GDA0003270622240000112
actual attention map, L, representing the ith sample image att A first attention graph branch penalty function is represented. Therefore, the differences between the first attention map and the actual attention map of all the sample images in the primary training process of the first model are averaged to obtain the function value of the branch loss function of the first attention map, and the accuracy of the differences between the first attention map and the actual attention map of the sample images reflected by the branch loss function of the first attention map is improved.
Alternatively, the formula of the branch loss function of the first density map can be expressed as:
Figure RE-GDA0003270622240000113
wherein,
Figure RE-GDA0003270622240000114
a first target detection map representing an ith sample image after being processed by the first model,
Figure RE-GDA0003270622240000115
actual density map, L, representing the ith sample image den A first density map branch loss function is represented. Therefore, the difference between the first target detection diagram and the actual density diagram of all samples in the primary training process of the first model is averaged to obtain the function value of the branch loss function of the first density diagram, and the accuracy of the difference between the first target detection diagram and the actual density diagram of the sample image reflected by the branch loss of the first density diagram is improved.
In combination with the above embodiments, it can be seen that the first loss function may include one or more of a first interest map branch loss function, a first density map branch loss function, and a distillation loss function, and the distillation loss function may include one or more of a structural similarity loss function and a channel distillation loss function.
When the first loss function includes a first attention map branch loss function, a first density map branch loss function, and a distillation loss function, one possible implementation of S304 includes: weighting and summing the function value of the branch loss function of the first attention map, the function value of the branch loss function of the first density map and the function value of the distillation loss function to obtain a final loss value of the first model; and correcting the first model according to the final loss value of the first model. Therefore, the function value of the branch loss function of the first interest map, the function value of the branch loss function of the first density map, and the function value of the distillation loss function, which are obtained based on the sample data, the second model, and the first model under training, are combined, and the first model is corrected based on the final loss value obtained by combining, thereby improving the training effect of the first model in many ways.
Optionally, the first model's overall optimization objective L is based on a distillation loss function including a structural similarity loss function and a channel distillation loss function s Can be expressed as:
L s =w 1 ×L den +w 2 ×L att +w 3 ×L dis +w 4 ×L ssim wherein w is 1 、w 2 、w 3 、w 4 Is a preset weight. Further, L s =L den +0.1×L att +L dis +L ssim . Based on the optimization objective, a final loss value for the first model may be obtained.
Referring to fig. 4, fig. 4 is a schematic flowchart of a training process of a second model in the target detection method according to the embodiment of the present disclosure. As shown in fig. 4, one training process of the second model includes:
and S401, extracting and decoding the characteristics of the sample image through the second model.
In the process of training the second model, the sample data used may be the same as the sample data used for training the first model, or may be different from the sample data used for training the first model. Wherein the sample data used to train the second model includes the sample image and label data of the sample image.
In this step, feature extraction and decoding are performed on the sample image in the sample data through the second model, so that image features output by each intermediate layer and a feature map output by the output layer in the second model can be obtained, and the feature map is a target detection map obtained by performing target detection on the sample image through the second model. For the sake of convenience of distinction, the feature map of the sample image output by the output layer in the second model of the training process is referred to as a third feature map.
S402, determining a function value of the second loss function according to the third feature map output in the second model and label data corresponding to the sample image.
In this step, the third feature map output in the second model is obtained, the difference between the third feature map and the label data corresponding to the sample image is determined, and the function value of the second loss function is determined according to the difference between the third feature map and the label data corresponding to the sample image.
And S403, correcting the second model according to the function value of the second loss function.
In this step, based on the function value of the second loss function and an optimization algorithm (e.g., a gradient descent algorithm), the model parameters of the first model are adjusted to modify the second model, so that in the process of training the second model, the second model is optimized based on the second loss function, so that the third feature map output by the second model approximates to the label data corresponding to the sample image, and the target detection accuracy of the second model is improved. And finishing the training process of the second model after the second model is corrected. At this time, it is determined whether the function value of the second loss function is smaller than the first threshold, if so, the trained second model is obtained, otherwise, the next training of the second model is continued until the function value of the second loss function is smaller than the second threshold. The first threshold for constraining the training of the first model and the second threshold for constraining the training of the second model may be the same threshold or different thresholds. For example, the second threshold is greater than the first threshold, so that training of the second model can be constrained by a smaller threshold on a device with stronger computing power, and the target detection accuracy of the second model is improved, and further the target detection accuracy of the first model is improved.
In some embodiments, based on the model structure that the second model includes the feature extraction layer, the feature decoding layer, the second attention map branch and the second density map branch, the label data corresponding to the sample image includes the actual attention map of the sample image and the actual density map of the sample image, and the second loss function further includes a second attention map branch loss function and a second density map branch loss function. At this time, one possible implementation manner of S402 includes: inputting the third feature map into a second attention map branch and a second density map branch respectively to obtain a second attention map and a second density map; and respectively determining the function value of the branch loss function of the second attention map and the function value of the branch loss function of the second density map according to the second attention map, the second density map, the actual attention map and the actual density map. Therefore, the training process of the second model is guided based on the function value of the branch loss function of the second attention map and the function value of the branch loss function of the second density map, the accuracy of the second attention map and the second density map output by the second model is improved, the second attention map output by the second model gradually approaches the actual attention map of the sample image through continuous training, and the second density map output by the second model gradually approaches the actual density map of the sample image.
Optionally, in the process of determining the function value of the branch loss function of the second attention map and the function value of the branch loss function of the second density map according to the second attention map, the second density map, the actual attention map and the actual density map, respectively, the function value of the branch loss function of the second attention map may be determined according to a difference between the second attention map and the actual attention map, and the function value of the branch loss function of the second density map may be determined according to a difference between the second density map and the actual density map. For example, the mean value of the pixel difference values of the second attention map and the actual attention map is determined as the function value of the branch loss function of the second attention map, and the mean value of the pixel difference values of the second density map and the actual density map is determined as the function value of the branch loss function of the second density map.
Optionally, in the process of determining the function value of the branch loss function of the second attention map and the function value of the branch loss function of the second density map according to the second attention map, the second density map, the actual attention map and the actual density map, respectively, the second attention map and the second density map may be fused (the specific fusion manner may refer to the description of the foregoing embodiment), so as to obtain a second target detection map of the sample image. And determining a function value of the branch loss function of the second attention map according to the difference between the second attention map and the actual attention map. And determining a function value of the branch loss function of the second density map according to the difference between the second target detection map and the actual density map. Therefore, the second attention graph of the sample image output by the second model approaches to the actual attention graph of the sample image, and the second target detection graph of the sample image output by the second model approaches to the actual density graph of the sample image, so that the accuracy of the target detection graph output by the second model is improved.
For example, inputting the image feature output by the feature decoding layer in the second model into the second attention map branch, and obtaining the second attention map may be represented as: AT = f T (X T ) (ii) a Inputting the image features output by the feature decoding layer in the second model into a second density map branch, and obtaining a second density map which can be expressed as: DT = g T (X T ) (ii) a A second target detection map of the sample image may be obtained by performing a dot product operation on the second attention map and the second density map, and a calculation formula of the second target detection map may be represented as: DT F =σ(AT)⊙DT。
Wherein, X T Representing image features output from a feature decoding layer in a second model, X T A second attention map representing the sample image, DT representing a second density map of the sample image, DT F A second target detection map representing the sample image.
Further, the formula of the branch loss function of the first attention map can be expressed as:
Figure RE-GDA0003270622240000131
wherein N represents the second modelNumber of sample images in one training process, AT i A second attention map representing an ith sample image processed by the second model,
Figure RE-GDA0003270622240000132
actual attention map, L, representing the ith sample image att A second graph of interest branch penalty function is represented. Therefore, the differences between the second attention maps of all the sample images and the actual attention map in the primary training process of the second model are averaged to obtain the function value of the branch loss function of the second attention map, and the accuracy of the differences between the second attention map of the sample images and the actual attention map reflected by the branch loss function of the second attention map is improved.
Alternatively, the formula of the branch loss function of the first density map can be expressed as:
Figure RE-GDA0003270622240000133
wherein,
Figure RE-GDA0003270622240000134
a second target detection map representing an ith sample image processed by the second model,
Figure RE-GDA0003270622240000135
actual density map, L, representing the ith sample image den A second density map branch loss function is represented. Therefore, the difference between the second target detection diagrams of all samples and the actual density diagram in the one-time training process of the second model is averaged to obtain the function value of the branch loss function of the second density diagram, and the accuracy of the difference between the second target detection diagrams of the sample images and the actual density diagram reflected by the branch loss of the second density diagram is improved.
In combination with the above embodiments, it can be seen that the second loss function may include a second attention map branch loss function and/or a second density map branch loss function.
When the second loss function includes a second attention map branch loss function and a second density map branch loss function, one possible implementation manner of S403 includes: weighting and summing the function value of the branch loss function of the second attention map and the function value of the branch loss function of the second density map to obtain a final loss value of the second model; and correcting the second model according to the final loss value of the second model. Therefore, the function value of the branch loss function of the second attention map obtained based on the sample data and the second model after training and the function value of the branch loss function of the second density map are combined, and the second model is corrected according to the final loss value obtained after combination, thereby improving the training effect of the second model from multiple aspects.
Optionally, the overall optimization objective L of the second model T Can be expressed as:
Figure RE-GDA0003270622240000136
wherein q is 1 、q 2 Is a preset weight. In a further aspect of the present invention,
Figure RE-GDA0003270622240000137
based on the optimization objective, a final loss value for the second model may be obtained.
Therefore, based on the above embodiment, the second model is trained, the first model is trained based on the trained second model, and the trained first model is applied to the target detection of the image. The trained first model is a lightweight model and can be deployed on lightweight equipment. Therefore, on one hand, the efficiency of target detection of the first model is improved, and real-time detection of the target on the lightweight equipment can be realized; on the other hand, the target detection accuracy of the first model is improved by a structured loss function representing the difference between the output of the output layer of the first model and the output of the output layer of the second model, and/or a channel distillation loss function representing the difference between the output of the intermediate layer of the first model and the output of the intermediate layer of the second model.
Referring to fig. 5, fig. 5 is a diagram illustrating a model structure of a first model and a second model according to an embodiment of the present disclosure. As shown in fig. 5, the first model and the second model are deep learning models, and the first model and the second model respectively include a plurality of network layers, it can be seen that the model structure of the first model is the same as the model structure of the second model, but the width of the network layer of the first model is smaller than the width of the network layer of the second model, that is, the number of channels of the network layer of the first model is smaller than the number of channels of the network layer of the second model, so that the second model can extract more image features than the first model.
As shown in fig. 5, in the first model, feature extraction is performed on an input image through a network layer for feature extraction (i.e., the feature extraction layer mentioned in the above embodiment), where a total of 4 feature extraction layers are shown in fig. 5. And the network layer behind the feature extraction layer is a feature decoding layer, the extracted image features are input into the feature decoding layer, and the feature image of the input image is obtained by performing 1-by-1 convolution operation on the image features through a plurality of feature decoding layers. Then, the feature map is input into the attention map branch and the density map branch, respectively, to obtain an attention map and a density map, respectively. And fusing the attention map and the density map to obtain a density map with attention, namely a target detection map of the input image. The image processing process of the second model may refer to the image processing process of the first model, and is not described again.
As shown in fig. 5, there is 1 × 1 convolution between the first model and the second model, where the 1 × 1 convolution is the channel convolution layer in the foregoing embodiment, and a mapping relationship between the image features output by the feature decoding layer of the second model and the image features output by the feature decoding layer of the second model is established by the channel convolution layer, so as to calculate and obtain the channel distillation loss, that is, the function value of the channel distillation loss function based on the mapping relationship, and the specific calculation process may refer to the foregoing embodiment.
As shown in fig. 5, based on the density map with attention output by the first model and the density map with attention output by the second model, the structured similarity loss, i.e. the function value of the structured similarity loss function, can be calculated, and the specific calculation process can refer to the foregoing embodiment.
Fig. 6 is a block diagram of a target detection device according to an embodiment of the present disclosure, which corresponds to the target detection method according to the above embodiment. For ease of illustration, only portions that are relevant to embodiments of the present disclosure are shown. Referring to fig. 6, the object detection apparatus includes: an acquisition unit 601, a detection unit 602, and a generation unit 603.
An acquisition unit 601 configured to acquire an image to be detected;
a detecting unit 602, configured to detect an object in an image through a first model, so as to obtain a focus map and a density map of the image;
a generating unit 603 configured to generate a target detection map of the image according to the attention map and the density map.
The first model is a lightweight deep learning model obtained based on sample data, a second model and a first loss function training, the first loss function comprises a distillation loss function for representing distillation loss between the first model and the second model, and the model scale of the second model is larger than that of the first model.
In one embodiment of the present disclosure, the sample data includes a sample image and label data corresponding to the sample image, and the one-time training process of the first model includes: performing feature extraction and decoding on the sample image through the first model; performing feature extraction and decoding on the sample image through a second model; determining a function value of the first loss function according to the image characteristics extracted by the first model and the image characteristics extracted by the second model; the first model is modified according to the loss value of the first loss function.
In one embodiment of the present disclosure, the distillation loss function comprises a structural similarity loss function, and determining a function value of the first loss function from the first model-extracted image features and the second model-extracted image features comprises: and determining a function value of the structural similarity loss function according to the difference between the first characteristic diagram output by the output layer in the first model and the second characteristic diagram output by the output layer in the second model.
In one embodiment of the present disclosure, determining a function value of a structural similarity loss function according to a difference between a first feature map output by an output layer in a first model and a second feature map output by an output layer in a second model comprises: determining the structural similarity of the first characteristic diagram and the second characteristic diagram, wherein the structural similarity comprises at least one of the following: brightness similarity, contrast similarity, pixel vector similarity; and determining a function value of the structure similarity loss function according to the structure similarity.
In one embodiment of the present disclosure, determining the structural similarity of the first feature map and the second feature map comprises: dividing the first feature map and the second feature map into a plurality of image areas respectively; determining structural similarity of a plurality of image areas of the first feature map and a plurality of image areas of the second feature map; and determining the structural similarity of the first feature map and the second feature map according to the structural similarity of the plurality of image areas of the first feature map and the plurality of image areas of the second feature map.
In one embodiment of the present disclosure, the distillation loss function further includes a channel distillation loss function, a channel convolution layer is connected between the intermediate layer of the first model and the intermediate layer of the second model, and the channel convolution layer is used for establishing a mapping relationship between the image characteristics output by the intermediate layer of the first model and the image characteristics output by the intermediate layer of the second model; determining a function value of the first loss function according to the image features extracted by the first model and the image features extracted by the second model, and further comprising: determining a function value of the channel distillation loss function according to a difference between the image features mapped to each other among the image features of the intermediate layer output of the first model and the image features of the intermediate layer output of the second model.
In one embodiment of the present disclosure, determining a function value of a channel distillation loss function according to a difference between image features mapped to each other in an image feature of an intermediate layer output of a first model and an image feature of an intermediate layer output of a second model includes: calculating the difference value between each pair of image characteristics which are mapped to each other through an L2 paradigm; the function value of the channel distillation loss function is determined as the sum of the differences between all the image features mapped to each other.
In one embodiment of the present disclosure, the label data corresponding to the sample image includes an actual attention map of the sample image and an actual density map of the sample image, the first model includes a first attention map branch and a first density map branch, and the first loss function further includes a feature extraction layer, a feature decoding layer, a first attention map branch loss function, and a first density map branch loss function; determining a function value of the first loss function based on the image features extracted by the first model and the image features extracted by the second model, comprising: inputting image features output by a feature decoding layer in the first model into a first attention map branch and a first density map branch respectively to obtain a first attention map and a first density map; and respectively determining the function value of the branch loss function of the first attention map and the function value of the branch loss function of the first density map according to the first attention map, the first density map, the actual attention map and the actual density map.
In one embodiment of the present disclosure, determining a function value of a branch loss function of a first interest map and a function value of a branch loss function of a first density map from the first interest map, the first density map, the actual interest map, and the actual density map, respectively, includes: fusing the first attention image and the first density image to obtain a first target detection image of the sample image; determining a function value of a branch loss function of the first attention map according to the difference between the first attention map and the actual attention map; and determining a function value of the branch loss function of the first density map according to the difference between the first target detection map and the actual density map.
In one embodiment of the present disclosure, the first attention map branch penalty function is:
Figure RE-GDA0003270622240000151
wherein N represents the number of sample images adopted in one training process of the first model, A i A first attention map representing an ith sample image obtained after the first model processing,
Figure RE-GDA0003270622240000152
represents an actual attention map of the ith sample image, and σ () represents an activation function;
and/or the first density map branch penalty function is:
Figure RE-GDA0003270622240000153
wherein,
Figure RE-GDA0003270622240000154
a first target detection map representing an ith sample image after processing by the first model,
Figure RE-GDA0003270622240000155
representing the actual density map of the ith sample image.
In one embodiment of the present disclosure, modifying the first model based on the loss value of the first loss function includes: weighting and summing the function value of the branch loss function of the first attention map, the function value of the branch loss function of the first density map and the function value of the distillation loss function to obtain a final loss value of the first model; and correcting the first model according to the final loss value of the first model.
In one embodiment of the present disclosure, the sample image includes a sample image and label data corresponding to the sample image, and the one-time training process of the second model includes: extracting and decoding the characteristics of the sample image through a second model; determining a function value of a second loss function according to a third feature map output in the second model and label data corresponding to the sample image; and modifying the second model according to the function value of the second loss function.
In one embodiment of the present disclosure, the label data corresponding to the sample image includes an actual attention map of the sample image and an actual density map of the sample image, the second model includes a second attention map branch and a second density map branch, and the second loss function includes a second attention map branch loss function and a second density map branch loss function; determining a function value of the second loss function according to the third feature map output in the second model and the label data corresponding to the sample image, including: inputting the third feature map into a second attention map branch and a second density map branch respectively to obtain a second attention map and a second density map; and respectively determining the function value of the branch loss function of the second attention map and the function value of the branch loss function of the second density map according to the second attention map, the second density map, the actual attention map and the actual density map.
In one embodiment of the present disclosure, determining the function value of the branch loss function of the second interest map and the function value of the branch loss function of the second density map according to the second interest map, the second density map, the actual interest map, and the actual density map, respectively, includes: fusing the second attention image and the second density image to obtain a second target detection image of the sample image; determining a function value of a branch loss function of the second attention map according to the difference between the second attention map and the actual attention map; and determining a function value of the branch loss function of the second density map according to the difference between the second target detection map and the actual density map.
In one embodiment of the present disclosure, modifying the second model according to the function value of the second loss function includes: weighting and summing the function value of the branch loss function of the second attention graph and the function value of the branch loss function of the second density graph to obtain a final loss value of the second model; and correcting the second model according to the final loss value of the second model.
The device provided in this embodiment may be configured to implement the technical solutions of the method embodiments, and the implementation principles and technical effects are similar, which are not described herein again.
Referring to fig. 7, a schematic structural diagram of an electronic device 700 suitable for implementing the embodiment of the present disclosure is shown, where the electronic device 700 may be a terminal device or a server. Among them, the terminal Device may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a Digital broadcast receiver, a Personal Digital Assistant (PDA), a tablet computer (PAD), a Portable Multimedia Player (PMP), a car navigation terminal (e.g., a car navigation terminal), etc., and a fixed terminal such as a Digital TV, a desktop computer, etc. The electronic device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 7, the electronic device 700 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 701, which may perform various suitable actions and processes according to a program stored in a Read Only Memory (ROM) 702 or a program loaded from a storage means 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the electronic apparatus 700 are also stored. The processing device 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
Generally, the following devices may be connected to the I/O interface 705: input devices 706 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 707 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 708 including, for example, magnetic tape, hard disk, etc.; and a communication device 709. The communication means 709 may allow the electronic device 700 to communicate wirelessly or by wire with other devices to exchange data. While fig. 7 illustrates an electronic device 700 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via the communication means 709, or may be installed from the storage means 708, or may be installed from the ROM 702. The computer program, when executed by the processing device 701, performs the above-described functions defined in the methods of embodiments of the present disclosure.
It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform the method shown in the above embodiments.
Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of Network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of a unit does not in some cases constitute a limitation of the unit itself, for example, the first retrieving unit may also be described as a "unit for retrieving at least two internet protocol addresses".
The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems on a chip (SOCs), complex Programmable Logic Devices (CPLDs), and the like.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
In a first aspect, according to one or more embodiments of the present disclosure, there is provided a target detection method, including: acquiring an image to be detected; detecting a target in the image through a first model to obtain a focus image and a density image of the image; generating a target detection map of the image according to the attention map and the density map, wherein the target detection map comprises position marks of targets in the image; the first model is a lightweight deep learning model obtained by training based on sample data, a second model and a first loss function, the first loss function comprises a distillation loss function for reflecting distillation loss between the first model and the second model, and the model scale of the second model is larger than that of the first model.
According to one or more embodiments of the present disclosure, the sample data includes a sample image and label data corresponding to the sample image, and the one-time training process of the first model includes: performing feature extraction and decoding on the sample image through the first model; performing feature extraction and decoding on the sample image through the second model; determining a function value of the first loss function according to the image characteristics extracted by the first model and the image characteristics extracted by the second model; and correcting the first model according to the loss value of the first loss function.
According to one or more embodiments of the present disclosure, the distillation loss function comprises a structural similarity loss function, and the determining a function value of the first loss function from the first model-extracted image features and the second model-extracted image features comprises: and determining a function value of the structural similarity loss function according to the difference between a first feature map output by an output layer in the first model and a second feature map output by an output layer in the second model.
According to one or more embodiments of the present disclosure, the determining a function value of the structural similarity loss function according to a difference between a first feature map output by an output layer in the first model and a second feature map output by an output layer in the second model includes: determining the structural similarity of the first feature map and the second feature map, wherein the structural similarity includes at least one of the following: brightness similarity, contrast similarity and pixel vector similarity; and determining a function value of the structure similarity loss function according to the structure similarity.
According to one or more embodiments of the present disclosure, the determining the structural similarity between the first feature map and the second feature map includes: dividing the first feature map and the second feature map into a plurality of image areas respectively; determining structural similarity of a plurality of image regions of the first feature map and a plurality of image regions of the second feature map; and determining the structural similarity of the first feature map and the second feature map according to the structural similarity of the plurality of image areas of the first feature map and the plurality of image areas of the second feature map.
According to one or more embodiments of the present disclosure, the distillation loss function further includes a channel distillation loss function, a channel convolution layer is connected between the intermediate layer of the first model and the intermediate layer of the second model, and the channel convolution layer is used for establishing a mapping relationship between the image characteristics output by the intermediate layer of the first model and the image characteristics output by the intermediate layer of the second model; the determining a function value of the first loss function according to the image features extracted by the first model and the image features extracted by the second model, further comprising: determining a function value of the channel distillation loss function according to a difference between image features mapped to each other among the image features of the interlayer output of the first model and the image features of the interlayer output of the second model.
According to one or more embodiments of the present disclosure, the determining a function value of the channel distillation loss function according to a difference between image features mapped to each other, among the image features of the middle layer output of the first model and the image features of the middle layer output of the second model, includes: calculating the difference value between each pair of image characteristics which are mapped to each other through an L2 paradigm; determining a function value of the channel distillation loss function as a sum of differences between all of the mutually mapped image features.
According to one or more embodiments of the present disclosure, the label data corresponding to the sample image includes an actual attention map of the sample image and an actual density map of the sample image, the first model includes a first attention map branch and a first density map branch, and the first loss function further includes a first attention map branch loss function and a first density map branch loss function; the determining a function value of the first loss function from the first model-extracted image features and the second model-extracted image features comprises: inputting a first feature map output by an output layer in the first model into the first attention map branch and the first density map branch respectively to obtain a first attention map and a first density map; and respectively determining a function value of the branch loss function of the first interest map and a function value of the branch loss function of the first density map according to the first interest map, the first density map, the actual interest map and the actual density map.
According to one or more embodiments of the present disclosure, the determining, from the first interest map, the first density map, the actual interest map, and the actual density map, a function value of the first interest map branch loss function and a function value of the first density map branch loss function, respectively, includes: fusing the first attention image and the first density image to obtain a first target detection image of the sample image; determining a function value of a branch loss function of the first interest map according to the difference between the first interest map and the actual interest map; determining a function value of the first density map branch loss function according to a difference between the first target detection map and the actual density map.
In one embodiment of the present disclosure, the first attention map branch penalty function is:
Figure RE-GDA0003270622240000191
wherein N represents the number of sample images adopted in one training process of the first model, and A i A first attention map representing an ith sample image obtained after the first model processing, the first attention map
Figure RE-GDA0003270622240000201
An actual attention map representing an ith sample image, the σ () representing an activation function;
and/or the first density map branch loss function is:
Figure RE-GDA0003270622240000202
wherein, the
Figure RE-GDA0003270622240000203
A first target detection map representing an ith sample image processed by the first model, the
Figure RE-GDA0003270622240000204
Representing the actual density map of the ith sample image.
According to one or more embodiments of the present disclosure, the modifying the first model according to the loss value of the first loss function includes: weighting and summing the function value of the first attention map branch loss function, the function value of the first density map branch loss function, and the function value of the distillation loss function to obtain a final loss value of the first model; and correcting the first model according to the final loss value of the first model.
According to one or more embodiments of the present disclosure, the sample image includes a sample image and label data corresponding to the sample image, and a training process of the second model includes: performing feature extraction and decoding on the sample image through the second model; determining a function value of a second loss function according to a third feature map output in the second model and label data corresponding to the sample image; and correcting the second model according to the function value of the second loss function.
According to one or more embodiments of the present disclosure, the label data corresponding to the sample image includes an actual attention map of the sample image and an actual density map of the sample image, the second model includes a second attention map branch and a second density map branch, and the second loss function includes a second attention map branch loss function and a second density map branch loss function; determining a function value of a second loss function according to a third feature map output in the second model and label data corresponding to the sample image, including: inputting the third feature map into the second attention map branch and the second density map branch respectively to obtain a second attention map and a second density map; and respectively determining a function value of the branch loss function of the second attention map and a function value of the branch loss function of the second density map according to the second attention map, the second density map, the actual attention map and the actual density map.
According to one or more embodiments of the present disclosure, the determining, from the second interest map, the second density map, the actual interest map, and the actual density map, a function value of the second interest map branch loss function and a function value of the second density map branch loss function, respectively, includes: fusing the second attention map and the second density map to obtain a second target detection map of the sample image; determining a function value of a branch loss function of the second attention map according to the difference between the second attention map and the actual attention map; determining a function value of the branch loss function of the second density map according to a difference between the second target detection map and the actual density map.
According to one or more embodiments of the present disclosure, the modifying the second model according to the function value of the second loss function includes: weighting and summing the function value of the branch loss function of the second attention map and the function value of the branch loss function of the second density map to obtain a final loss value of the second model; and correcting the second model according to the final loss value of the second model.
In a second aspect, according to one or more embodiments of the present disclosure, there is provided an object detection apparatus including:
the device comprises an acquisition unit, a detection unit and a processing unit, wherein the acquisition unit is used for acquiring an image to be detected;
the detection unit is used for detecting a target in the image through a first model to obtain an attention map and a density map of the image;
a generation unit configured to generate a target detection map of the image from the attention map and the density map;
the first model is a lightweight deep learning model obtained by training based on sample data, a second model and a first loss function, the first loss function comprises a distillation loss function for reflecting distillation loss between the first model and the second model, and the model scale of the second model is larger than that of the first model.
According to one or more embodiments of the present disclosure, the sample data includes a sample image and label data corresponding to the sample image, and the one-time training process of the first model includes: performing feature extraction and decoding on the sample image through the first model; performing feature extraction and decoding on the sample image through the second model; determining a function value of the first loss function according to the image characteristics extracted by the first model and the image characteristics extracted by the second model; and correcting the first model according to the loss value of the first loss function.
According to one or more embodiments of the present disclosure, the distillation loss function comprises a structural similarity loss function, and the determining a function value of the first loss function from the first model-extracted image features and the second model-extracted image features comprises: and determining a function value of the structural similarity loss function according to the difference between a first feature map output by an output layer in the first model and a second feature map output by an output layer in the second model.
According to one or more embodiments of the present disclosure, the determining a function value of the structural similarity loss function according to a difference between a first feature map output by an output layer in the first model and a second feature map output by an output layer in the second model includes: determining the structural similarity of the first feature map and the second feature map, wherein the structural similarity includes at least one of the following: brightness similarity, contrast similarity, pixel vector similarity; and determining a function value of the structure similarity loss function according to the structure similarity.
According to one or more embodiments of the present disclosure, the determining the structural similarity between the first feature map and the second feature map includes: dividing the first feature map and the second feature map into a plurality of image areas respectively; determining structural similarity of a plurality of image areas of the first feature map and a plurality of image areas of the second feature map; and determining the structural similarity of the first feature map and the second feature map according to the structural similarity of the plurality of image areas of the first feature map and the plurality of image areas of the second feature map.
According to one or more embodiments of the present disclosure, the distillation loss function further includes a channel distillation loss function, a channel convolution layer is connected between the intermediate layer of the first model and the intermediate layer of the second model, and the channel convolution layer is used for establishing a mapping relationship between an image feature output by the intermediate layer of the first model and an image feature output by the intermediate layer of the second model; the determining a function value of the first loss function according to the image features extracted by the first model and the image features extracted by the second model, further comprising: determining a function value of the channel distillation loss function according to a difference between the image features mapped to each other among the image features of the middle layer output of the first model and the image features of the middle layer output of the second model.
According to one or more embodiments of the present disclosure, the determining a function value of the channel distillation loss function according to a difference between image features mapped to each other, among the image features of the intermediate layer output of the first model and the image features of the intermediate layer output of the second model, includes: calculating the difference value between each pair of image characteristics which are mapped to each other through an L2 paradigm; determining a function value of the channel distillation loss function as a sum of differences between all of the inter-mapped image features.
According to one or more embodiments of the present disclosure, the label data corresponding to the sample image includes an actual attention map of the sample image and an actual density map of the sample image, the first model includes a first attention map branch and a first density map branch, and the first loss function further includes a first attention map branch loss function and a first density map branch loss function; the determining a function value of the first loss function from the first model-extracted image features and the second model-extracted image features comprises: inputting a first feature map output by an output layer in the first model into the first attention map branch and the first density map branch respectively to obtain a first attention map and a first density map; and respectively determining a function value of the branch loss function of the first interest map and a function value of the branch loss function of the first density map according to the first interest map, the first density map, the actual interest map and the actual density map.
According to one or more embodiments of the present disclosure, the determining, from the first interest map, the first density map, the actual interest map, and the actual density map, a function value of the first interest map branch loss function and a function value of the first density map branch loss function, respectively, includes: fusing the first attention image and the first density image to obtain a first target detection image of the sample image; determining a function value of a branch loss function of the first interest map according to a difference between the first interest map and the actual interest map; determining a function value of the first density map branch loss function according to a difference between the first target detection map and the actual density map.
In one embodiment of the present disclosure, the first attention map branch penalty function is:
Figure RE-GDA0003270622240000221
wherein N represents the number of sample images adopted in one training process of the first model, and A i A first attention map representing an ith sample image obtained after the first model processing, the first attention map
Figure RE-GDA0003270622240000222
An actual attention map representing an ith sample image, the σ () representing an activation function;
and/or the first density map branch loss function is:
Figure RE-GDA0003270622240000223
wherein, the
Figure RE-GDA0003270622240000224
A first target detection map representing an ith sample image processed by the first model, the
Figure RE-GDA0003270622240000225
Representing the actual density map of the ith sample image.
According to one or more embodiments of the present disclosure, the modifying the first model according to the loss value of the first loss function includes: weighting and summing the function value of the first attention map branch loss function, the function value of the first density map branch loss function, and the function value of the distillation loss function to obtain a final loss value of the first model; and correcting the first model according to the final loss value of the first model.
According to one or more embodiments of the present disclosure, the sample image includes a sample image and label data corresponding to the sample image, and a training process of the second model includes: performing feature extraction and decoding on the sample image through the second model; determining a function value of a second loss function according to a third feature map output in the second model and label data corresponding to the sample image; and correcting the second model according to the function value of the second loss function.
According to one or more embodiments of the present disclosure, the label data corresponding to the sample image includes an actual attention map of the sample image and an actual density map of the sample image, the second model includes a second attention map branch and a second density map branch, and the second loss function includes a second attention map branch loss function and a second density map branch loss function; determining a function value of a second loss function according to a third feature map output in the second model and label data corresponding to the sample image, including: inputting the third feature map into the second attention map branch and the second density map branch respectively to obtain a second attention map and a second density map; and respectively determining the function value of the branch loss function of the second attention map and the function value of the branch loss function of the second density map according to the second attention map, the second density map, the actual attention map and the actual density map.
According to one or more embodiments of the present disclosure, the determining, from the second interest map, the second density map, the actual interest map, and the actual density map, a function value of the second interest map branch loss function and a function value of the second density map branch loss function, respectively, includes: fusing the second attention image and the second density image to obtain a second target detection image of the sample image; determining a function value of a branch loss function of the second attention map according to the difference between the second attention map and the actual attention map; and determining a function value of a branch loss function of the second density map according to the difference between the second target detection map and the actual density map.
According to one or more embodiments of the present disclosure, the modifying the second model according to the function value of the second loss function includes: weighting and summing the function value of the branch loss function of the second attention map and the function value of the branch loss function of the second density map to obtain a final loss value of the second model; and correcting the second model according to the final loss value of the second model.
In a third aspect, according to one or more embodiments of the present disclosure, there is provided an electronic device including: at least one processor and a memory;
the memory stores computer execution instructions;
the at least one processor executes computer-executable instructions stored by the memory to cause the at least one processor to perform the object detection method as described above in the first aspect and various possible designs of the first aspect.
In a fourth aspect, according to one or more embodiments of the present disclosure, there is provided a computer-readable storage medium having stored therein computer-executable instructions that, when executed by a processor, implement the object detection method as described in the first aspect above and in various possible designs of the first aspect.
In a fifth aspect, according to one or more embodiments of the present disclosure, there is provided an electronic device, including: at least one processor and memory;
the memory stores computer execution instructions;
the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the object detection method as described above in the first aspect and various possible designs of the first aspect.
In a sixth aspect, according to one or more embodiments of the present disclosure, there is provided a computer-readable storage medium having stored therein computer-executable instructions that, when executed by a processor, implement the object detection method as described in the first aspect above and in various possible designs of the first aspect.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and the technical features disclosed in the present disclosure (but not limited to) having similar functions are replaced with each other to form the technical solution.
Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims (19)

1. A method of target detection, comprising:
acquiring an image to be detected;
detecting a target in the image through a first model to obtain a focus image and a density image of the image;
generating a target detection map of the image according to the attention map and the density map, wherein the target detection map comprises position marks of targets in the image;
the first model is a lightweight deep learning model obtained by training based on sample data, a second model and a first loss function, the first loss function comprises a distillation loss function for reflecting distillation loss between the first model and the second model, and the model scale of the second model is larger than that of the first model.
2. The object detection method of claim 1, wherein the sample data comprises a sample image and label data corresponding to the sample image, and the one-time training process of the first model comprises:
performing feature extraction and decoding on the sample image through the first model;
performing feature extraction and decoding on the sample image through the second model;
determining a function value of the first loss function according to the image characteristics extracted by the first model and the image characteristics extracted by the second model;
and correcting the first model according to the loss value of the first loss function.
3. The object detection method of claim 2, the distillation loss function comprising a structural similarity loss function, the determining a function value of the first loss function from the first model-extracted image features and the second model-extracted image features comprising:
and determining a function value of the structural similarity loss function according to the difference between a first feature map output by an output layer in the first model and a second feature map output by an output layer in the second model.
4. The object detection method of claim 3, wherein determining the function value of the structural similarity loss function according to a difference between a first feature map output by an output layer in the first model and a second feature map output by an output layer in the second model comprises:
determining the structural similarity of the first feature map and the second feature map, wherein the structural similarity includes at least one of the following: brightness similarity, contrast similarity and pixel vector similarity;
and determining a function value of the structure similarity loss function according to the structure similarity.
5. The object detection method of claim 4, wherein the determining the structural similarity of the first feature map and the second feature map comprises:
dividing the first feature map and the second feature map into a plurality of image areas respectively;
determining structural similarity of a plurality of image regions of the first feature map and a plurality of image regions of the second feature map;
and determining the structural similarity of the first feature map and the second feature map according to the structural similarity of the plurality of image areas of the first feature map and the plurality of image areas of the second feature map.
6. The object detection method according to claim 3, wherein the distillation loss function further comprises a channel distillation loss function, a channel convolution layer is connected between the intermediate layer of the first model and the intermediate layer of the second model, and the channel convolution layer is used for establishing a mapping relation between the image characteristics output by the intermediate layer of the first model and the image characteristics output by the intermediate layer of the second model;
the determining a function value of the first loss function according to the image features extracted by the first model and the image features extracted by the second model, further comprising:
determining a function value of the channel distillation loss function according to a difference between image features mapped to each other among the image features of the interlayer output of the first model and the image features of the interlayer output of the second model.
7. The object detection method of claim 6, wherein determining the function value of the channel distillation loss function from a difference between image features mapped to each other in the image feature of the middle layer output of the first model and the image feature of the middle layer output of the second model, comprises:
calculating the difference value between each pair of image characteristics which are mapped to each other through an L2 paradigm;
determining a function value of the channel distillation loss function as a sum of differences between all of the mutually mapped image features.
8. The object detection method according to any one of claims 2 to 7, wherein the label data corresponding to the sample image includes an actual attention map of the sample image and an actual density map of the sample image, the first model includes a feature extraction layer, a feature decoding layer, a first attention map branch and a first density map branch, and the first loss function further includes a first attention map branch loss function and a first density map branch loss function;
determining a function value of the first loss function based on the image features extracted by the first model and the image features extracted by the second model, including:
inputting the image features output by a feature decoding layer in the first model into the first attention map branch and the first density map branch respectively to obtain a first attention map and a first density map;
and respectively determining a function value of a branch loss function of the first interest map and a function value of a branch loss function of the first density map according to the first interest map, the first density map, the actual interest map and the actual density map.
9. The object detection method of claim 8, the determining, from the first interest map, the first density map, the actual interest map, and the actual density map, a function value of the first interest map branch loss function and a function value of the first density map branch loss function, respectively, comprising:
fusing the first attention image and the first density image to obtain a first target detection image of the sample image;
determining a function value of a branch loss function of the first interest map according to a difference between the first interest map and the actual interest map;
determining a function value of a branch loss function of the first density map according to a difference between the first target detection map and the actual density map.
10. The object detection method of claim 9, the first graph-of-interest branch loss function being:
Figure FDA0003140961110000021
wherein N represents the number of sample images adopted in one training process of the first model, and A i A first attention map representing an ith sample image obtained after the first model processing, the first attention map
Figure FDA0003140961110000022
An actual attention map representing an ith sample image, the σ () representing an activation function;
and/or the first density map branch loss function is:
Figure FDA0003140961110000023
wherein, the
Figure FDA0003140961110000024
A first target detection map representing an ith sample image processed by the first model, the
Figure FDA0003140961110000025
Representing the actual density map of the ith sample image.
11. The object detection method of claim 8, wherein modifying the first model according to the loss value of the first loss function comprises:
weighting and summing the function value of the first attention map branch loss function, the function value of the first density map branch loss function, and the function value of the distillation loss function to obtain a final loss value of the first model;
and correcting the first model according to the final loss value of the first model.
12. The object detection method according to any one of claims 1-7 and 9-11, wherein the sample data comprises a sample image and label data corresponding to the sample image, and the one-time training process of the second model comprises:
performing feature extraction and decoding on the sample image through the second model;
determining a function value of a second loss function according to a third feature map output in the second model and label data corresponding to the sample image;
and correcting the second model according to the function value of the second loss function.
13. The object detection method of claim 12, wherein the label data corresponding to the sample image comprises an actual attention map of the sample image and an actual density map of the sample image, the second model comprises a second attention map branch and a second density map branch, and the second loss function comprises a second attention map branch loss function and a second density map branch loss function;
determining a function value of a second loss function according to a third feature map output in the second model and label data corresponding to the sample image, including:
inputting the third feature map into the second attention map branch and the second density map branch respectively to obtain a second attention map and a second density map;
and respectively determining the function value of the branch loss function of the second attention map and the function value of the branch loss function of the second density map according to the second attention map, the second density map, the actual attention map and the actual density map.
14. The object detection method of claim 13, the determining the function value of the second interest map branch loss function and the function value of the second density map branch loss function from the second interest map, the second density map, the actual interest map, and the actual density map, respectively, comprising:
fusing the second attention image and the second density image to obtain a second target detection image of the sample image;
determining a function value of a branch loss function of the second attention map according to the difference between the second attention map and the actual attention map;
determining a function value of the branch loss function of the second density map according to a difference between the second target detection map and the actual density map.
15. The object detection method of claim 13 or 14, said modifying the second model according to the function value of the second loss function comprising:
weighting and summing the function value of the branch loss function of the second attention map and the function value of the branch loss function of the second density map to obtain a final loss value of the second model;
and correcting the second model according to the final loss value of the second model.
16. An object detection device comprising:
an acquisition unit for acquiring an image to be detected;
the detection unit is used for detecting a target in the image through a first model to obtain an attention map and a density map of the image;
a generation unit configured to generate a target detection map of the image from the attention map and the density map;
the first model is a lightweight deep learning model obtained by training based on sample data, a second model and a first loss function, the first loss function comprises a distillation loss function for reflecting distillation loss between the first model and the second model, and the model scale of the second model is larger than that of the first model.
17. An electronic device, comprising: at least one processor and a memory;
the memory stores computer execution instructions;
execution of the computer-executable instructions stored by the memory by the at least one processor causes the at least one processor to perform the object detection method of any of claims 1-15.
18. A computer-readable storage medium having computer-executable instructions stored thereon which, when executed by a processor, implement the object detection method of any one of claims 1 to 15.
19. A computer program product comprising computer executable instructions which, when executed by a processor, implement the object detection method of any one of claims 1 to 15.
CN202110734260.1A 2021-06-30 2021-06-30 Target detection method and device Pending CN115546708A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110734260.1A CN115546708A (en) 2021-06-30 2021-06-30 Target detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110734260.1A CN115546708A (en) 2021-06-30 2021-06-30 Target detection method and device

Publications (1)

Publication Number Publication Date
CN115546708A true CN115546708A (en) 2022-12-30

Family

ID=84705766

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110734260.1A Pending CN115546708A (en) 2021-06-30 2021-06-30 Target detection method and device

Country Status (1)

Country Link
CN (1) CN115546708A (en)

Similar Documents

Publication Publication Date Title
US11367313B2 (en) Method and apparatus for recognizing body movement
CN111476309B (en) Image processing method, model training method, device, equipment and readable medium
CN109584276B (en) Key point detection method, device, equipment and readable medium
CN111368685B (en) Method and device for identifying key points, readable medium and electronic equipment
CN112101305B (en) Multi-path image processing method and device and electronic equipment
CN111369427A (en) Image processing method, image processing device, readable medium and electronic equipment
CN109977832B (en) Image processing method, device and storage medium
WO2022171036A1 (en) Video target tracking method, video target tracking apparatus, storage medium, and electronic device
CN112800276B (en) Video cover determining method, device, medium and equipment
CN113689372A (en) Image processing method, apparatus, storage medium, and program product
CN115905622A (en) Video annotation method, device, equipment, medium and product
CN112330788A (en) Image processing method, image processing device, readable medium and electronic equipment
CN113610034B (en) Method and device for identifying character entities in video, storage medium and electronic equipment
CN111783632B (en) Face detection method and device for video stream, electronic equipment and storage medium
CN113111684B (en) Training method and device for neural network model and image processing system
CN116453154A (en) Pedestrian detection method, system, electronic device and readable medium
CN114422698B (en) Video generation method, device, equipment and storage medium
CN111353470B (en) Image processing method and device, readable medium and electronic equipment
CN114399696A (en) Target detection method and device, storage medium and electronic equipment
CN115546708A (en) Target detection method and device
CN114596580B (en) Multi-human-body target identification method, system, equipment and medium
CN117115139A (en) Endoscope video detection method and device, readable medium and electronic equipment
CN114596529A (en) Video frame identification method and device, readable medium and electronic equipment
CN113989803A (en) Target detection method and device
CN112906551A (en) Video processing method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination