CN115578613B - Training method of target re-identification model and target re-identification method - Google Patents

Training method of target re-identification model and target re-identification method Download PDF

Info

Publication number
CN115578613B
CN115578613B CN202211272814.1A CN202211272814A CN115578613B CN 115578613 B CN115578613 B CN 115578613B CN 202211272814 A CN202211272814 A CN 202211272814A CN 115578613 B CN115578613 B CN 115578613B
Authority
CN
China
Prior art keywords
feature
image
image sample
feature maps
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211272814.1A
Other languages
Chinese (zh)
Other versions
CN115578613A (en
Inventor
张欣彧
王健
冯浩城
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202211272814.1A priority Critical patent/CN115578613B/en
Publication of CN115578613A publication Critical patent/CN115578613A/en
Application granted granted Critical
Publication of CN115578613B publication Critical patent/CN115578613B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/62Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Image Analysis (AREA)

Abstract

The disclosure provides a training method of a target re-recognition model and a target re-recognition method, relates to the technical field of artificial intelligence, in particular to the technical field of deep learning, image processing and computer vision, and can be applied to scenes such as face recognition and smart cities. The specific implementation scheme is as follows: determining a plurality of first feature maps output by a pre-trained converter model and a plurality of second feature maps output by a convolutional neural network model to be trained based on the image sample set; determining a first distillation loss value using a first knowledge distillation loss function based on the plurality of first feature maps and the plurality of second feature maps; determining a second distillation loss value using a second knowledge distillation loss function; and training the convolutional neural network model to be trained according to the first distillation loss value and the second distillation loss value to obtain a convolutional neural network model capable of performing target re-identification. According to the technical scheme, the convolutional neural network model with better reasoning speed and recognition performance can be obtained through training.

Description

Training method of target re-identification model and target re-identification method
Technical Field
The disclosure relates to the technical field of artificial intelligence, in particular to the technical field of deep learning, image processing and computer vision, and can be applied to scenes such as face recognition and smart cities.
Background
Pedestrian re-recognition (Person-identification), also known as pedestrian re-recognition, is a technique that uses computer vision techniques to determine whether a particular pedestrian is present in an image or video sequence. Pedestrian re-identification aims to make up for the visual limitation of a fixed camera, can be combined with pedestrian detection/pedestrian tracking technology, and can be widely applied to the fields of intelligent video monitoring, intelligent security and the like.
Because of the difference between different camera devices, pedestrians have the characteristics of rigidity and flexibility, and the appearance is easily influenced by wearing, dimensions, shielding, postures, visual angles and the like, the pedestrian re-recognition becomes a hot subject which has research value and is very challenging in the field of computer vision.
Disclosure of Invention
The disclosure provides a training method of a target re-recognition model and a target re-recognition method.
According to an aspect of the present disclosure, there is provided a training method of a target re-recognition model, including:
determining a plurality of first feature maps output by a pre-trained converter model and a plurality of second feature maps output by a convolutional neural network model to be trained based on the image sample set;
Determining a first distillation loss value using a first knowledge distillation loss function based on the plurality of first feature maps and the plurality of second feature maps;
determining a second distillation loss value using a second knowledge distillation loss function based on the plurality of first feature maps and the plurality of second feature maps; and
and training the convolutional neural network model to be trained according to the first distillation loss value and the second distillation loss value to obtain a convolutional neural network model capable of performing target re-identification.
According to another aspect of the present disclosure, there is provided a target re-recognition method including:
determining a target image;
and performing target re-identification on the target image by utilizing the convolutional neural network model trained by the method of any embodiment of the disclosure so as to identify the target object in the target image.
According to another aspect of the present disclosure, there is provided a training apparatus of a target re-recognition model, including:
a first determining module, configured to determine, based on the image sample set, a plurality of first feature maps output by the pre-trained converter model and a plurality of second feature maps output by the convolutional neural network model to be trained;
the second determining module is used for determining a first distillation loss value according to the first characteristic diagrams and the second characteristic diagrams by utilizing the first knowledge distillation loss function;
A third determining module for determining a second distillation loss value using a second knowledge distillation loss function based on the plurality of first feature maps and the plurality of second feature maps; and
the training module is used for training the convolutional neural network model to be trained according to the first distillation loss value and the second distillation loss value so as to obtain the convolutional neural network model capable of carrying out target re-identification.
According to another aspect of the present disclosure, there is provided an object re-recognition apparatus including:
the image determining module is used for determining a target image;
and the image recognition module is used for carrying out target re-recognition on the target image by utilizing the convolutional neural network model trained by the method of any embodiment of the disclosure so as to recognize the target object in the target image.
According to another aspect of the present disclosure, there is provided an electronic device including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the embodiments of the present disclosure.
According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform a method according to any one of the embodiments of the present disclosure.
According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method according to any of the embodiments of the present disclosure.
According to the technical scheme, the convolutional neural network model with better reasoning speed and recognition performance can be obtained through training.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a schematic diagram of a training method of a target re-recognition model according to an embodiment of the present disclosure;
FIG. 2 is a schematic diagram of computing an image block similarity matrix according to an embodiment of the present disclosure;
FIG. 3 is a schematic illustration of computing a first similarity matrix according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of a training method of a target re-recognition model according to an embodiment of the present disclosure;
FIG. 5 is a schematic diagram of a target re-identification method according to an embodiment of the present disclosure;
FIG. 6 is a schematic diagram of a training apparatus of a target re-recognition model according to an embodiment of the present disclosure;
FIG. 7 is a schematic diagram of a target re-identification apparatus according to an embodiment of the disclosure;
fig. 8 is a block diagram of an electronic device for implementing the methods of embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
As shown in fig. 1, an embodiment of the present disclosure provides a training method of a target re-identification model, including:
step S101: based on the set of image samples, a plurality of first feature maps output by the pre-trained converter model and a plurality of second feature maps output by the convolutional neural network model to be trained are determined.
Step S102: a first distillation loss value is determined from the plurality of first feature maps and the plurality of second feature maps using a first knowledge distillation loss function.
Step S103: a second distillation loss value is determined from the plurality of first feature maps and the plurality of second feature maps using a second knowledge distillation loss function. And
Step S104: and training the convolutional neural network model to be trained according to the first distillation loss value and the second distillation loss value to obtain a convolutional neural network model capable of performing target re-identification.
According to the embodiment of the disclosure, it is to be noted that:
the image sample set may include image samples of a plurality of objects, each of the plurality of objects corresponding to the plurality of image samples. For example, a batch of samples (image sample set) is b=p×k, i.e., K ID (identification code, identity document) targets are selected, each ID target selecting P image samples, K and P representing numbers.
The plurality of targets may include any object, including humans, animals, objects, etc., without specific limitation herein. The image samples of the plurality of targets may be understood to include image samples of a person, image samples of B person, and image samples of C person. Each object corresponds to a plurality of image samples, which can be understood to include image samples taken from a person from a plurality of angles and/or from a plurality of locations, respectively.
A pre-trained transducer model may be understood as a pre-trained transducer (transducer) model that can be subject to target re-recognition. The specific structure of the converter model is not particularly limited herein, and it is only necessary to be able to perform a visual task, i.e., to determine whether a specific target exists in an image or video by using a computer vision technique.
The plurality of first feature maps may be understood as including: the pre-trained converter model is based on a plurality of first feature maps output by a plurality of different image samples of the same target, and the pre-trained converter model is based on a plurality of first feature maps output by an image sample of a plurality of different targets. Wherein the number of the plurality of first feature maps output by the pre-trained converter model corresponds to the number of batch samples input to the converter model.
The convolutional neural network model to be trained can be understood as a convolutional neural network model (CNN, convolutional Neural Networks) requiring model training to have the target re-recognition function.
The plurality of second feature maps may be understood as including: the convolutional neural network model to be trained is based on a plurality of second feature maps output by a plurality of different image samples of the same target, and the convolutional neural network model to be trained is based on a plurality of second feature maps output by the image samples of the plurality of different targets. Wherein the number of the plurality of second feature maps output by the convolutional neural network model to be trained corresponds to the number of batch samples (a plurality of image samples) of the input converter model.
The first knowledge distillation loss function and the second knowledge distillation loss function can be selected according to the needs, so long as a knowledge distillation mechanism can be realized, and the high-quality knowledge learned by the pre-trained converter model is migrated to the convolutional neural network model by utilizing a plurality of first characteristic diagrams and a plurality of second characteristic diagrams.
Training the convolutional neural network model to be trained according to the first distillation loss value and the second distillation loss value can be understood as optimizing parameters of the convolutional neural network model to be trained according to the first distillation loss value and the second distillation loss value until the model converges. It is also understood that the parameters of the convolutional neural network model to be trained are optimized until the model converges, based on the first distillation loss value, the second distillation loss value, and the loss value obtained from the loss function of the convolutional neural network model itself to be trained.
With the development of deep learning, the convolutional neural network model can be well supported on each hardware device, and high speed and high precision are obtained. Model structures based on converter models are emerging. The transducer model structure is more capable of describing images globally than the convolutional neural network model, so that effects exceeding the convolutional neural network model, such as recognition, detection, segmentation, fine-granularity classification and the like, are achieved on each visual task. However, in the limited environment of edge devices and resources, the converter model lacks efficient hardware design support, and the speed of reasoning of the converter model after the devices are deployed is slow, so that the balance of speed and precision is difficult to achieve. The method of the embodiment of the disclosure effectively solves the problems. According to the embodiment of the disclosure, through a knowledge distillation mechanism and an image sample set, a trained converter model is used as a teacher model, and a convolutional neural network model to be distilled and trained is used as a student model, so that high-quality target re-identification knowledge learned on the converter model can be migrated to the convolutional neural network model, the convolutional neural network model can learn better characteristic characterization of the converter model, the identification precision of the converter model is achieved, and the problems that the convolutional neural network model structure is limited by a convolutional kernel receptive field in the convolutional neural network model, global characteristic interaction is difficult to realize, and generalization capability is poor are solved. The performance of the convolutional neural network model in the aspects of identification, detection, segmentation, fine granularity classification and the like is improved. And through knowledge distillation of the converter model, the global feature characterization performance of the convolutional neural network model is improved, and further, the precision of the convolutional neural network model is improved, so that the convolutional neural network model has the capability of feature fusion through global feature characterization and local feature characterization when the convolutional neural network model is subjected to target re-identification, and further, the feature discrimination capability of the convolutional neural network model during retrieval is ensured. Meanwhile, the convolutional neural network model with the converter model performance obtained through training can be well supported on each hardware device, the calculation reasoning speed of the convolutional neural network model running on the hardware device can be ensured, and the problem that the high performance of the converter model is difficult to well deploy on limited hardware devices and cannot be supported is solved. The device operation performance of the hardware device is effectively improved, and the operation speed is improved. Hardware devices such as GPUs (graphics processors, graphics processing unit), edge devices, and the like.
In one application example, a convolutional neural network model for pedestrian re-recognition may be trained using the training method of the target re-recognition model of the embodiments of the present disclosure.
In one application example, the training method of the target re-recognition model according to the embodiment of the present disclosure may be used to train a convolutional neural network model applicable to scenes such as face recognition, smart cities, and the like.
In one example, the plurality of first feature maps output by the converter model based on the set of image samples may be represented as { G ] 1 ,G 2 ,…,G B },G b ∈R p*c Where p is the number of patches (image blocks) of the first feature map, p=h×w, H is the height of the first feature map, W is the width of the first feature map, c is the feature dimension, B is the number of the first feature map, and B is the total number of the first feature map. G 1 、G 2 Can be understood as a different first feature map of the same ID object, G 3 Is with G 2 A first profile of a different ID target.
In one example, the plurality of second feature maps output by the convolutional neural network model to be trained based on the set of image samples may be represented as { F 1 ,F 2 ,…,F B },F b ∈R p*c Where p is the number of feature vectors of the second feature map, p=h×w, H is the height of the second feature map, W is the width of the second feature map, c is the feature dimension, B is the number of the first feature map, and B is the total number of the first feature map. F (F) 1 、F 2 Can be understood as a second, different feature map of the same ID object, G 3 Is with G 2 A second profile of a different ID target.
In one implementation, the training method of the target re-recognition model of the embodiment of the present disclosure includes steps S101 to S104, where step S101: based on the set of image samples, determining a plurality of first feature maps of the pre-trained converter model output and a plurality of second feature maps of the convolutional neural network model output to be trained may include:
step S1011: based on the set of image samples, a plurality of first feature maps of the pre-trained converter model output are determined.
Step S1012: based on the image sample set, a plurality of second feature maps output by the convolutional neural network model to be trained are determined.
Step S1013: the dimensions of the first plurality of feature maps and/or the second plurality of feature maps are adjusted in response to the first plurality of feature maps being different from the second plurality of feature maps.
According to the embodiment of the disclosure, it is to be noted that:
the specific manner of adjusting the sizes of the plurality of first feature maps and/or the plurality of second feature maps is not particularly limited herein, and it is only necessary to ensure that the adjusted sizes are consistent and do not affect the characterization of the feature vectors.
When the size adjustment is performed, the size of the first feature map may be adjusted to fit the second feature map based on the size of the second feature map. The size of the second feature map may also be adjusted to fit the first feature map based on the size of the first feature map. The size of the first feature map and the size of the second feature map can be adjusted at the same time, so that the sizes of the first feature map and the second feature map are adjusted to be matched. The size adaptation may be understood as a uniform size or a size conforming to a preset scaling law.
According to the embodiment of the disclosure, by adjusting the sizes of the plurality of first feature maps and/or the plurality of second feature maps, when the first knowledge distillation loss function and the second knowledge distillation loss function are used for calculation, a more accurate first distillation loss value and a more accurate second distillation loss value are obtained, and therefore training effect and training speed of a convolutional neural network model to be trained are improved.
In one implementation, the training method of the target re-recognition model of the embodiment of the present disclosure includes steps S101 to S104, where step S1013: in response to the plurality of first feature maps being different in size from the plurality of second feature maps, adjusting the size of the plurality of first feature maps and/or the plurality of second feature maps may include:
In response to the plurality of first feature maps and the plurality of second feature maps being different in size, one of the following is used for sizing:
and performing downsampling processing on the first characteristic maps so as to adapt the sizes of the first characteristic maps and the second characteristic maps.
And carrying out up-sampling processing on the plurality of second feature maps so as to adapt the sizes of the plurality of first feature maps and the plurality of second feature maps.
And performing downsampling processing on the first feature images and upsampling processing on the second feature images so as to adapt the sizes of the first feature images and the second feature images.
According to the embodiment of the disclosure, by adjusting the sizes of the plurality of first feature maps and/or the plurality of second feature maps, when the first knowledge distillation loss function and the second knowledge distillation loss function are used for calculation, a more accurate first distillation loss value and a more accurate second distillation loss value are obtained, and therefore training effect and training speed of a convolutional neural network model to be trained are improved.
In one implementation, the training method of the target re-recognition model of the embodiment of the present disclosure includes steps S101 to S104, where step S102: determining a first distillation loss value from the first plurality of feature maps and the second plurality of feature maps using a first knowledge distillation loss function, comprising:
Step S1021: and calculating a similarity matrix of each first feature map in the plurality of first feature maps and the image block of the first feature map.
Step S1022: and calculating a feature vector similarity matrix of each second feature map in the plurality of second feature maps and the second feature map.
Step S1023: and calculating a first distance according to the image block similarity matrix of the first feature map and the feature vector similarity matrix of the second feature map, which correspond to the same image sample of the same target in the image sample set.
Step S1024: and determining a first distillation loss value according to the first distance of each image sample in the image sample set by using a first knowledge distillation loss function.
According to the embodiment of the disclosure, it is to be noted that:
calculating the similarity matrix of each first feature image and the image block of the first feature image, and understandingAnd performing similarity calculation for each image block of the first characteristic diagram and each image block of the first characteristic diagram, namely a self-image block similarity distillation mechanism. For example, as shown in FIG. 2, the first feature map is a feature map obtained by the converter model based on the A-character image sample. The first feature map comprises four image blocks, which are respectively t 1 、t 2 、t 3 And t 4 . Calculating the similarity matrix of the first feature image and the image block of the first feature image, which is equivalent to t 1 Respectively with t 1 、t 2 、t 3 And t 4 Multiplication, t 2 Respectively with t 1 、t 2 、t 3 And t 4 Multiplication, and so on. That is, the feature matrix of the first feature map of 2x2 is multiplied by the feature matrix of the first feature map of 2x2 to obtain a 4x4 image block similarity matrix.
The calculation of the similarity matrix of each second feature map and its own feature vector can be understood as that each feature vector of the second feature map performs similarity calculation with each feature vector of the second feature map, namely, a self-image block (feature vector) similarity distillation mechanism. For example, the second feature map is a feature map obtained by a convolutional neural network model based on the a-character image sample. The second feature map comprises four feature vectors, r respectively 1 、r 2 、r 3 And r 4 The representation of the feature vector matrix of the second feature map may refer to the representation of the image block matrix of the first feature map of fig. 2. Calculating a similarity matrix of the second feature map and the feature vector of the second feature map, which is equal to r 1 Respectively with r 1 、r 2 、r 3 And r 4 Multiplication, r 2 Respectively with r 1 、r 2 、r 3 And r 4 Multiplication, and so on. That is, the feature matrix of the second feature map of 2x2 is multiplied by the feature matrix of the second feature map of 2x2 to obtain a feature vector similarity matrix of 4x 4.
According to the embodiment of the disclosure, since the converter model divides the image sample into a plurality of image blocks at the front end of the model, each image block is used as an independent feature vector of the feature map in the training process, the similarity between the feature vectors of the feature map output by the convolutional neural network model can be corresponding to the similarity between the feature vectors of the image blocks of the feature map of the converter model based on the similarity distillation mechanism of the image blocks, and the first knowledge distillation loss function is used for supervision. By means of the method, the convolutional neural network model to be trained can fully learn the feature effectiveness of the converter model, so that high-quality target re-identification knowledge learned on the converter model is distilled and transferred to the convolutional neural network model.
In one example, the first distance is calculated according to an image block similarity matrix of a first feature map and a feature vector similarity matrix of a second feature map corresponding to the same image sample of the same object in the image sample set, and the first distance may be calculated by cosine similarity (included angle cosine distance).
In one implementation, the training method of the target re-recognition model of the embodiments of the present disclosure includes steps S101 to S104, where the first knowledge distillation loss function may be a mean square error loss function (MSE, mean squared error). Step S102: determining a first distillation loss value from a first distance of each image sample in the set of image samples using a first knowledge distillation loss function, comprising:
and determining a first distillation loss value by utilizing a mean square error loss function according to the first distance of each image sample in the image sample set.
According to the embodiment of the disclosure, the first distillation loss value can be calculated more accurately by using the mean square error loss function, so that the effect and the speed of optimizing the convolutional neural network model to be trained by using the first distillation loss value are improved.
In one example, steps S1021 to S1024 implementing "distillation loss function with first knowledge to determine first distillation loss value according to the plurality of first feature maps and the plurality of second feature maps" may be converted into a formula to calculate, where the formula is as follows:
Where B is the number of image samples in the set of image samples of the one-time input model (pre-trained transducer model and convolutional neural network model to be trained).
Wherein b is a subscript and represents a certain feature map.
Wherein F is b Is a second feature map.
Wherein G is b Is a first feature map.
Wherein Sim (F b ,F b ) Corresponding to step S1022: and calculating a feature vector similarity matrix of each second feature map in the plurality of second feature maps and the second feature map.
Wherein Sim (G) b ,G b ) Corresponding to step S1021: and calculating a similarity matrix of each first feature map in the plurality of first feature maps and the image block of the first feature map.
Wherein Sim (F b ,F b )-Sim(G b ,G b ) Corresponding to step S1023: and calculating a first distance according to the image block similarity matrix of the first feature map and the feature vector similarity matrix of the second feature map, which correspond to the same image sample of the same target in the image sample set.
Wherein, the equation of L1 corresponds to steps S1021 to S1024.
In one implementation, the training method of the target re-recognition model of the embodiment of the present disclosure includes steps S101 to S104, where step S103: determining a second distillation loss value from the plurality of first feature maps and the plurality of second feature maps using a second knowledge distillation loss function, comprising:
Step S1031: according to the first feature images and the second feature images, a first similarity matrix of feature images corresponding to every two different image samples of the same target in the image sample set is calculated. The feature map corresponding to each two different image samples of the same target comprises: a first feature map of a first image sample and a second feature map of a second image sample of the same object.
Step S1032: and calculating the second distance according to each first similarity matrix.
Step S1033: and calculating a second similarity matrix of the feature maps corresponding to the image samples of each two different targets in the image sample set according to the first feature maps and the second feature maps. The feature map corresponding to the image samples of each two different targets comprises: a first feature map corresponding to the image sample of the first object and a second feature map corresponding to the image sample of the second object.
Step S1034: and calculating a third distance according to each second similarity matrix.
Step S1035: and determining a second distillation loss value by using a second knowledge distillation loss function according to the second distance corresponding to the same target and the third distance corresponding to each two different targets.
According to the embodiment of the disclosure, it is to be noted that:
the feature dimensions of the feature maps used to calculate the first and second similarity matrices may be the same or similar.
The first feature map of the first image sample and the second feature map of the second image sample of the same object can be understood as: for example, a first feature map output by the pre-trained converter model based on the A1 image samples of the a target person and a second feature map output by the convolutional neural network model to be trained based on the A2 image samples of the a target person.
The first feature map corresponding to the image sample of the first object and the second feature map corresponding to the image sample of the second object can be understood as: for example, a first feature map output by the pre-trained converter model based on A1 image samples of the a target person and a second feature map output by the convolutional neural network model to be trained based on B1 image samples of the B target person.
The first similarity matrix of the feature images corresponding to each two different image samples of the same target in the image sample set is calculated, and the first similarity matrix can be understood as a distillation mechanism of similarity of the cross image blocks, wherein each image block of the first feature image corresponding to the first image sample of the target A and each feature vector of the second feature image corresponding to the second image sample of the target A are calculated in a similarity mode. For example, as shown in FIG. 3, the first feature map is based on the A character And a feature map obtained by an image sample. The first feature map comprises four image blocks s 1 、s 2 、s 3 Sum s 4 . The second feature map is a feature map obtained based on a second image sample of the a person. The second feature map comprises four feature vectors, t respectively 1 、t 2 、t 3 And t 4 . Calculating a first similarity matrix, corresponding to s 1 Respectively with t 1 、t 2 、t 3 And t 4 Multiplication, s 2 Respectively with t 1 、t 2 、t 3 And t 4 Multiplication, and so on. I.e. the feature matrix of the first feature map of 2x2 is multiplied by the feature matrix of the second feature map of 2x2 to obtain a first similarity matrix of 4x 4.
The second similarity matrix of the feature images corresponding to the image samples of each two different targets in the image sample set is calculated, and the similarity calculation can be understood as a distillation mechanism of the similarity of the cross image blocks, wherein each image block of the first feature image corresponding to the image sample of the target character A and each feature vector of the second feature image corresponding to the image sample of the target character B are calculated. For example, the first feature map is a feature map obtained based on a first image sample of the a person. The first feature map comprises four image blocks s 1 、s 2 、s 3 Sum s 4 . The second feature map is a feature map obtained based on a second image sample of the B person. The second feature map comprises four feature vectors, t respectively 1 、t 2 、t 3 And t 4 . Calculating a second similarity matrix, corresponding to s 1 Respectively with t 1 、t 2 、t 3 And t 4 Multiplication, s 2 Respectively with t 1 、t 2 、t 3 And t 4 Multiplication, and so on. I.e. the feature matrix of the first feature map of 2x2 is multiplied by the feature matrix of the second feature map of 2x2 to obtain a second similarity matrix of 4x 4.
According to the embodiment of the disclosure, since the image block features of the feature images of the same object should be more similar and the image block features of the feature images of different objects should be more different, the embodiment of the disclosure adopts a distillation mechanism based on the similarity of crossed image blocks, namely, calculates the similarity for the image blocks between the different feature images of the same object and reduces the distance between them. Similarity is calculated for image blocks between different feature maps of different targets and their distance is increased while supervision is performed using a second knowledge distillation loss function. By means of the method, the convolutional neural network model to be trained can fully learn the feature effectiveness of the converter model, so that high-quality target re-identification knowledge learned on the converter model is distilled and transferred to the convolutional neural network model.
In one example, the second distance is calculated from each first similarity matrix and the third distance is calculated from each second similarity matrix, which may be calculated by cosine similarity (angle cosine distance).
In one implementation, the training method of the target re-recognition model of the embodiments of the present disclosure includes steps S101 to S104, wherein the second knowledge distillation loss function is a triplet (triplet) loss function. Determining a second distillation loss value according to a second distance corresponding to the same target and a third distance corresponding to each two different targets by using a second knowledge distillation loss function, including:
and determining a second distillation loss value by using a triplet loss function according to the second distance corresponding to the same target and the third distance corresponding to each two different targets.
According to the embodiment of the disclosure, the second distillation loss value can be calculated more accurately by using the triplet loss function, so that the effect and the speed of optimizing the convolutional neural network model to be trained by using the second distillation loss value are improved.
In one implementation manner, the training method of the target re-recognition model in the embodiment of the present disclosure includes steps S101 to S104, where calculating, according to a plurality of first feature maps and a plurality of second feature maps, a first similarity matrix of feature maps corresponding to each two different image samples of the same target in an image sample set includes:
according to the first feature images and the second feature images, a first similarity matrix of feature images corresponding to every two different image samples of the same target in the image sample set is calculated.
The feature map corresponding to each two different image samples of the same target comprises: the first feature map of the first image sample of the same object and the second feature map of the second image sample with the lowest similarity thereto, and the second feature map of the first image sample of the same object and the first feature map of the second image sample with the lowest similarity thereto. The second image sample is a positive sample.
According to the embodiment of the disclosure, it is to be noted that:
the first similarity matrix of the feature images corresponding to each two different image samples of the same target in the image sample set is calculated, and the first similarity matrix can be understood as a distillation mechanism of similarity of each image block of the first feature image corresponding to the first image sample of the target A and each feature vector of the second feature image corresponding to the second image sample of the target A, which has the lowest similarity with the first feature image, are calculated. For example, as shown in fig. 3, the first feature map is a feature map obtained based on a first image sample of the a person. The first feature map comprises four image blocks s 1 、s 2 、s 3 Sum s 4 . The second feature map having the lowest similarity to the first feature map is a feature map obtained based on a second image sample of the a person. The second feature map comprises four feature vectors, t respectively 1 、t 2 、t 3 And t 4 . Calculating a first similarity matrix, corresponding to s 1 Respectively with t 1 、t 2 、t 3 And t 4 Multiplication, s 2 Respectively with t 1 、t 2 、t 3 And t 4 Multiplication, and so on. I.e. the feature matrix of the first feature map of 2x2 is multiplied by the feature matrix of the second feature map of 2x2 to obtain a first similarity matrix of 4x 4. And a distillation mechanism for calculating the similarity between each image block of the second feature map corresponding to the first image sample of the A object and each feature vector of the first feature map corresponding to the second image sample of the A object, which has the lowest similarity with the second feature map, namely the similarity of the crossed image blocks.
According to the embodiment of the disclosure, since the image block features of the feature images of the same object should be more similar and the image block features of the feature images of different objects should be more different, the embodiment of the disclosure adopts a distillation mechanism based on the similarity of crossed image blocks, namely, calculates the similarity for the image blocks between the different feature images of the same object and reduces the distance between them. By means of the method, the convolutional neural network model to be trained can fully learn the feature effectiveness of the converter model, so that high-quality target re-identification knowledge learned on the converter model is distilled and transferred to the convolutional neural network model.
In one implementation manner, the training method of the target re-recognition model in the embodiment of the present disclosure includes steps S101 to S104, where calculating, according to a plurality of first feature maps and a plurality of second feature maps, a second similarity matrix of feature maps corresponding to image samples of each two different targets in an image sample set includes:
and calculating a second similarity matrix of the feature maps corresponding to the image samples of each two different targets in the image sample set according to the first feature maps and the second feature maps.
The feature map corresponding to the image samples of each two different targets comprises: the method comprises the steps of obtaining a first feature map corresponding to an image sample of a first target and a second feature map corresponding to an image sample of a second target with the lowest similarity, and obtaining a second feature map corresponding to an image sample of the first target and a first feature map corresponding to an image sample of the second target with the lowest similarity. The image sample of the second object is a negative sample.
According to the embodiment of the disclosure, it is to be noted that:
the second similarity matrix of the feature images corresponding to the image samples of each two different targets in the image sample set is calculated, and the similarity calculation can be understood as a distillation mechanism of the similarity of the crossed image blocks, wherein each image block of the first feature image corresponding to the image sample of the target character A and each feature vector of the second feature image corresponding to the image sample of the target character B, which has the lowest similarity with the first feature image, are performed. Example(s) For example, the first feature map is a feature map obtained based on a first image sample of the a person. The first feature map comprises four image blocks s 1 、s 2 、s 3 Sum s 4 . The second feature map is a feature map obtained based on a second image sample of the B person. The second feature map comprises four feature vectors, t respectively 1 、t 2 、t 3 And t 4 . Calculating a second similarity matrix, corresponding to s 1 Respectively with t 1 、t 2 、t 3 And t 4 Multiplication, s 2 Respectively with t 1 、t 2 、t 3 And t 4 Multiplication, and so on. I.e. the feature matrix of the first feature map of 2x2 is multiplied by the feature matrix of the second feature map of 2x2 to obtain a second similarity matrix of 4x 4. And performing similarity calculation on each image block of the second feature map corresponding to the image sample of the A target person and each feature vector of the first feature map with the lowest similarity with the second feature map corresponding to the image sample of the B target person.
According to the embodiment of the disclosure, since the image block features of the feature map of the same object should be more similar and the image block features of the feature map of different objects should be more different, the embodiment of the disclosure adopts a distillation mechanism based on the similarity of crossed image blocks, that is, calculates the similarity of the image blocks between the different feature maps of different objects and increases their distance, and uses the second knowledge to distill the loss function for supervision. By means of the method, the convolutional neural network model to be trained can fully learn the feature effectiveness of the converter model, so that high-quality target re-identification knowledge learned on the converter model is distilled and transferred to the convolutional neural network model.
In one example, steps S1031 to S1035 implementing "determine a second distillation loss value from the plurality of first feature maps and the plurality of second feature maps using the second knowledge distillation loss function" may be converted into a formula for calculation, where the formula is as follows:
where B is the number of image samples in the set of image samples of the one-time input model (pre-trained transducer model and convolutional neural network model to be trained).
Wherein b is a subscript and represents a certain feature map.
Where α is the hyper-parameter of the triplet loss function.
Wherein Fb is a second feature map.
Wherein Gb is a first feature map.
Wherein G b + is the first feature map of the positive image sample with the lowest similarity to the second feature map of the same object.
Wherein F b + is the second feature map of the positive image sample with the lowest similarity to the first feature map of the same object.
Wherein,is the first feature map of the negative image sample with the lowest similarity to the second feature map of the different object.
Wherein,and the second characteristic diagram is a negative image sample with the lowest similarity with the first characteristic diagram of the different targets.
Wherein,corresponding to step S1031: according to the first feature images and the second feature images, a first similarity matrix of feature images corresponding to every two different image samples of the same target in the image sample set is calculated. The feature map corresponding to each two different image samples of the same target comprises: a second feature map of a first image sample of the same object and a first feature map of a second image sample having a lowest similarity thereto. The second image sample is a positive sample.
Wherein,corresponding to step S1031: according to the first feature images and the second feature images, a first similarity matrix of feature images corresponding to every two different image samples of the same target in the image sample set is calculated. The feature map corresponding to each two different image samples of the same target comprises: a first feature map of a first image sample of the same object and a second feature map of a second image sample having a minimum similarity thereto. The second image sample is a positive sample.
Wherein,corresponding to step S1032: and calculating the second distance according to each first similarity matrix.
Wherein,corresponding to step S1033: and calculating a second similarity matrix of the feature maps corresponding to the image samples of each two different targets in the image sample set according to the first feature maps and the second feature maps. The feature map corresponding to the image samples of each two different targets comprises: the second feature map corresponding to the image sample of the first object and the first feature map corresponding to the image sample of the second object with the lowest similarity. The image sample of the second object is a negative sample.
Wherein,corresponding to step S1033: and calculating a second similarity matrix of the feature maps corresponding to the image samples of each two different targets in the image sample set according to the first feature maps and the second feature maps. The feature map corresponding to the image samples of each two different targets comprises: a first feature map corresponding to an image sample of a first object and a second feature map corresponding to an image sample of a second object having a lowest similarity thereto. The image sample of the second object is a negative sample.
Wherein,corresponding to step S1034: and calculating a third distance according to each second similarity matrix.
Wherein the equation of L2 corresponds to steps S1031 to S1035.
In one example, training the convolutional neural network model to be trained in step S104 to obtain a convolutional neural network model that is subject to target re-recognition includes:
performing parameter tuning on a convolutional neural network model to be trained according to the first distillation loss value and the second distillation loss value;
under the condition that the convolutional neural network model after parameter tuning is determined to be converged, the convolutional neural network model to be trained is trained, and a convolutional neural network model capable of carrying out target re-identification is obtained;
and under the condition that the convolutional neural network model after parameter tuning is determined to be not converged, performing steps S101 to S104 in a circulating manner until the convolutional neural network model to be trained is converged after parameter tuning is performed on the convolutional neural network model to be trained according to the first distillation loss value and the second distillation loss value.
In an application example, as shown in fig. 4, a training method of a target re-recognition model according to an embodiment of the present disclosure includes:
step 100: and respectively inputting the image sample set into a pre-trained transducer pedestrian re-recognition model and a CNN pedestrian re-recognition model to be trained. Wherein the image sample set includes pedestrian pictures of a plurality of ID persons.
Step 110: based on the image sample set, a plurality of first feature maps (e.g., ID1 first feature map, ID2 first feature map, ID3 first feature map in fig. 4) and a plurality of second feature maps (e.g., ID1 second feature map, ID2 second feature map, ID3 second feature map in fig. 4) of the CNN pedestrian re-recognition model output to be trained are determined, and the plurality of first feature maps are downsampled to adapt the plurality of first feature maps to the plurality of second feature maps.
Step 120: a first distillation loss value is determined based on a self-image block similarity rectification mechanism using a first knowledge distillation loss function based on the first plurality of feature maps and the second plurality of feature maps.
Step 130: a second distillation loss value is determined based on the cross-image block similarity distillation mechanism using a second knowledge distillation loss function based on the first plurality of feature maps and the second plurality of feature maps.
Step 140: and training the convolutional neural network model to be trained according to the first distillation loss value and the second distillation loss value to obtain a convolutional neural network model capable of performing target re-identification.
As shown in fig. 5, an embodiment of the present disclosure provides a target re-recognition method, including:
Step S501: a target image is determined.
Step S502: and performing target re-identification on the target image by utilizing the convolutional neural network model trained by the method of any embodiment of the disclosure so as to identify the target object in the target image.
According to the embodiment of the disclosure, the convolutional neural network model trained by the method of any embodiment of the disclosure can better re-identify the target in the target image.
As shown in fig. 6, an embodiment of the present disclosure provides a training apparatus for a target re-recognition model, including:
a first determining module 610 is configured to determine, based on the set of image samples, a plurality of first feature maps of the pre-trained converter model output and a plurality of second feature maps of the convolutional neural network model output to be trained.
The second determining module 620 is configured to determine a first distillation loss value according to the plurality of first feature maps and the plurality of second feature maps using the first knowledge distillation loss function.
A third determining module 630 is configured to determine a second distillation loss value based on the plurality of first feature maps and the plurality of second feature maps using the second knowledge distillation loss function. And
The training module 640 is configured to train the convolutional neural network model to be trained according to the first distillation loss value and the second distillation loss value, so as to obtain a convolutional neural network model capable of performing target re-recognition.
In one embodiment, the set of image samples includes a plurality of image samples of the object, each object of the plurality of objects corresponding to the plurality of image samples.
In one embodiment, the first determining module 610 includes:
a first determination submodule for determining a plurality of first feature maps of the pre-trained converter model output based on the set of image samples.
And the second determining submodule is used for determining a plurality of second feature graphs output by the convolutional neural network model to be trained based on the image sample set.
And the adjusting sub-module is used for adjusting the sizes of the first feature maps and/or the second feature maps in response to the sizes of the first feature maps and the second feature maps being different.
In one embodiment, the adjustment submodule is to:
in response to the plurality of first feature maps and the plurality of second feature maps being different in size, one of the following is used for sizing:
and performing downsampling processing on the first characteristic maps so as to adapt the sizes of the first characteristic maps and the second characteristic maps.
And carrying out up-sampling processing on the plurality of second feature maps so as to adapt the sizes of the plurality of first feature maps and the plurality of second feature maps.
And performing downsampling processing on the first feature images and upsampling processing on the second feature images so as to adapt the sizes of the first feature images and the second feature images.
In one embodiment, the second determination module 620 includes:
the first calculating module is used for calculating the similarity matrix of each first feature map in the plurality of first feature maps and the image block of the first feature map.
And the second calculation module is used for calculating a feature vector similarity matrix of each second feature map in the plurality of second feature maps and the second feature map.
And the third calculation module is used for calculating the first distance according to the image block similarity matrix of the first feature map and the feature vector similarity matrix of the second feature map, which correspond to the same image sample of the same target in the image sample set.
And a fourth determining module, configured to determine a first distillation loss value according to a first distance of each image sample in the image sample set by using a first knowledge distillation loss function.
In one embodiment, the first knowledge distillation loss function is a mean square error loss function.
And a fourth determining module, configured to determine a first distillation loss value according to a first distance of each image sample in the image sample set by using a mean square error loss function.
In one embodiment, the third determination module 630 includes:
and the fourth computing sub-module is used for computing a first similarity matrix of the feature graphs corresponding to each two different image samples of the same target in the image sample set according to the plurality of first feature graphs and the plurality of second feature graphs. The feature map corresponding to each two different image samples of the same target comprises: a first feature map of a first image sample and a second feature map of a second image sample of the same object.
And a fifth calculation sub-module, configured to calculate the second distance according to each first similarity matrix.
And a sixth computing sub-module, configured to compute a second similarity matrix of feature maps corresponding to image samples of each two different targets in the image sample set according to the plurality of first feature maps and the plurality of second feature maps. The feature map corresponding to the image samples of each two different targets comprises: a first feature map corresponding to the image sample of the first object and a second feature map corresponding to the image sample of the second object.
And a seventh calculation sub-module for calculating the third distance according to each second similarity matrix.
And a fifth determining module, configured to determine a second distillation loss value according to a second distance corresponding to the same target and a third distance corresponding to each two different targets by using a second knowledge distillation loss function.
In one embodiment, the second knowledge distillation loss function is a triplet loss function.
And a fifth determining module, configured to determine a second distillation loss value according to a second distance corresponding to the same target and a third distance corresponding to each two different targets by using a triplet loss function.
In one embodiment, the fourth computing submodule is to:
according to the first feature images and the second feature images, a first similarity matrix of feature images corresponding to every two different image samples of the same target in the image sample set is calculated.
The feature map corresponding to each two different image samples of the same target comprises: the first feature map of the first image sample of the same object and the second feature map of the second image sample with the lowest similarity thereto, and the second feature map of the first image sample of the same object and the first feature map of the second image sample with the lowest similarity thereto. The second image sample is a positive sample.
In one embodiment, the sixth computing submodule is to:
and calculating a second similarity matrix of the feature maps corresponding to the image samples of each two different targets in the image sample set according to the first feature maps and the second feature maps.
The feature map corresponding to the image samples of each two different targets comprises: the method comprises the steps of obtaining a first feature map corresponding to an image sample of a first target and a second feature map corresponding to an image sample of a second target with the lowest similarity, and obtaining a second feature map corresponding to an image sample of the first target and a first feature map corresponding to an image sample of the second target with the lowest similarity. The image sample of the second object is a negative sample.
As shown in fig. 7, an embodiment of the present disclosure provides a target re-recognition apparatus, including:
an image determination module 710 for determining a target image.
The image recognition module 720 is configured to perform target re-recognition on the target image by using the convolutional neural network model trained by the method according to any one of the embodiments of the present disclosure, so as to recognize the target object in the target image.
For descriptions of specific functions and examples of each module and sub-module of the apparatus in the embodiments of the present disclosure, reference may be made to the related descriptions of corresponding steps in the foregoing method embodiments, which are not repeated herein.
In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
Fig. 8 illustrates a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile apparatuses, such as personal digital assistants, cellular telephones, smartphones, wearable devices, and other similar computing apparatuses. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM803, various programs and data required for the operation of the device 800 can also be stored. The computing unit 801, the ROM802, and the RAM803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.
Various components in device 800 are connected to I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The calculation unit 801 performs the respective methods and processes described above, such as a training method of the target re-recognition model and a target re-recognition method. For example, in some embodiments, the training method of the target re-recognition model and the target re-recognition method may be implemented as computer software programs tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 800 via ROM 802 and/or communication unit 809. When the computer program is loaded into the RAM 803 and executed by the computing unit 801, one or more steps of the training method of the target re-recognition model and the target re-recognition method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the training method and the target re-recognition method of the target re-recognition model in any other suitable manner (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions, improvements, etc. that are within the principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (24)

1. A training method of a target re-recognition model, comprising:
determining a plurality of first feature maps output by a pre-trained converter model and a plurality of second feature maps output by a convolutional neural network model to be trained based on the image sample set;
determining a first distillation loss value by using a first knowledge distillation loss function according to an image block similarity matrix of a first feature map and a feature vector similarity matrix of a second feature map corresponding to the same image sample of the same target in the image sample set; the image block similarity matrix of the first feature image is obtained by calculating each image block of the first feature image and each image block respectively, and the feature vector similarity matrix of the second feature image is obtained by calculating each feature vector of the second feature image and each feature vector respectively;
Determining a second distillation loss value by using a second knowledge distillation loss function according to a first similarity matrix of feature graphs corresponding to each two different image samples of the same target in the image sample set and a second similarity matrix of feature graphs corresponding to each two different image samples of the image sample set; and
and training the convolutional neural network model to be trained according to the first distillation loss value and the second distillation loss value to obtain a convolutional neural network model capable of performing target re-identification.
2. The method of claim 1, wherein the set of image samples includes a plurality of image samples of objects, each object of the plurality of objects corresponding to a plurality of image samples.
3. The method of claim 1, wherein the determining, based on the set of image samples, a plurality of first feature maps of the pre-trained converter model output and a plurality of second feature maps of the convolutional neural network model output to be trained comprises:
determining a plurality of first feature maps of the pre-trained converter model output based on the set of image samples;
determining a plurality of second feature graphs output by a convolutional neural network model to be trained based on the image sample set;
And adjusting the sizes of the first feature maps and/or the second feature maps in response to the first feature maps and the second feature maps being different in size.
4. A method according to claim 3, wherein the adjusting the size of the plurality of first feature maps and/or the plurality of second feature maps in response to the plurality of first feature maps being different from the plurality of second feature maps comprises:
in response to the plurality of first feature maps and the plurality of second feature maps being different in size, one of the following is used for sizing:
downsampling the first plurality of feature maps to adapt the sizes of the first plurality of feature maps and the second plurality of feature maps;
upsampling the second plurality of feature maps to adapt the first plurality of feature maps to the dimensions of the second plurality of feature maps;
and performing downsampling processing on the first feature maps and upsampling processing on the second feature maps so as to adapt the sizes of the first feature maps and the second feature maps.
5. The method according to any one of claims 1 to 4, wherein determining the first distillation loss value from the image block similarity matrix of the first feature map and the feature vector similarity matrix of the second feature map corresponding to the same image sample of the same object in the set of image samples using the first knowledge distillation loss function comprises:
Calculating an image block similarity matrix of each first feature map in the plurality of first feature maps and the first feature map;
calculating a feature vector similarity matrix of each second feature map in the plurality of second feature maps and the second feature map;
calculating a first distance according to an image block similarity matrix of a first feature image and a feature vector similarity matrix of a second feature image corresponding to the same image sample of the same target in the image sample set;
and determining a first distillation loss value according to a first distance of each image sample in the image sample set by using a first knowledge distillation loss function.
6. The method of claim 5, wherein the first knowledge distillation loss function is a mean square error loss function;
determining a first distillation loss value according to a first distance of each image sample in the image sample set by using a first knowledge distillation loss function, wherein the method comprises the following steps:
and determining a first distillation loss value by utilizing a mean square error loss function according to the first distance of each image sample in the image sample set.
7. The method of any of claims 1 to 4, wherein determining a second distillation loss value from a first similarity matrix of feature maps corresponding to each two different image samples of the same object in the set of image samples and a second similarity matrix of feature maps corresponding to each two different image samples of the set of image samples using a second knowledge distillation loss function comprises:
According to the plurality of first feature images and the plurality of second feature images, calculating a first similarity matrix of feature images corresponding to each two different image samples of the same target in the image sample set; the feature map corresponding to each two different image samples of the same target comprises: a first feature map of a first image sample and a second feature map of a second image sample of the same object;
calculating a second distance according to each first similarity matrix;
calculating a second similarity matrix of feature images corresponding to image samples of each two different targets in the image sample set according to the plurality of first feature images and the plurality of second feature images; the feature map corresponding to the image samples of each two different targets comprises: a first feature map corresponding to the image sample of the first target and a second feature map corresponding to the image sample of the second target;
calculating a third distance according to each second similarity matrix;
and determining a second distillation loss value by using a second knowledge distillation loss function according to the second distance corresponding to the same target and the third distance corresponding to each two different targets.
8. The method of claim 7, wherein the second knowledge distillation loss function is a triplet loss function;
And determining a second distillation loss value according to the second distance corresponding to the same target and the third distance corresponding to each two different targets by using a second knowledge distillation loss function, wherein the method comprises the following steps:
and determining a second distillation loss value by utilizing the triple loss function according to the second distance corresponding to the same target and the third distance corresponding to each two different targets.
9. The method of claim 7, wherein the calculating a first similarity matrix for feature maps corresponding to each two different image samples of the same object in the set of image samples from the plurality of first feature maps and the plurality of second feature maps comprises:
according to the plurality of first feature images and the plurality of second feature images, calculating a first similarity matrix of feature images corresponding to each two different image samples of the same target in the image sample set;
the feature map corresponding to each two different image samples of the same target comprises: the first feature map of the first image sample of the same target and the second feature map of the second image sample with the lowest similarity, and the second feature map of the first image sample of the same target and the first feature map of the second image sample with the lowest similarity; the second image sample is a positive sample.
10. The method of claim 7, wherein the computing a second similarity matrix for feature maps corresponding to image samples of each two different objects in the set of image samples from the plurality of first feature maps and the plurality of second feature maps comprises:
calculating a second similarity matrix of feature images corresponding to image samples of each two different targets in the image sample set according to the plurality of first feature images and the plurality of second feature images;
the feature map corresponding to the image samples of each two different targets comprises: a first feature map corresponding to an image sample of a first object and a second feature map corresponding to an image sample of a second object with the lowest similarity, and a second feature map corresponding to an image sample of the first object and a first feature map corresponding to an image sample of a second object with the lowest similarity; the image sample of the second object is a negative sample.
11. A target re-identification method, comprising:
determining a target image;
target re-recognition is performed on the target image by using the convolutional neural network model trained by the method of any one of claims 1 to 10 to identify a target object in the target image.
12. A training device for a target re-recognition model, comprising:
a first determining module, configured to determine, based on the image sample set, a plurality of first feature maps output by the pre-trained converter model and a plurality of second feature maps output by the convolutional neural network model to be trained;
the second determining module is used for determining a first distillation loss value by utilizing a first knowledge distillation loss function according to an image block similarity matrix of a first feature map and a feature vector similarity matrix of a second feature map, which correspond to the same image sample of the same target in the image sample set; the image block similarity matrix of the first feature image is obtained by calculating each image block of the first feature image and each image block respectively, and the feature vector similarity matrix of the second feature image is obtained by calculating each feature vector of the second feature image and each feature vector respectively;
a third determining module, configured to determine a second distillation loss value according to a first similarity matrix of feature maps corresponding to each two different image samples of the same object in the image sample set and a second similarity matrix of feature maps corresponding to each two different image samples of the image sample set by using a second knowledge distillation loss function; and
And the training module is used for training the convolutional neural network model to be trained according to the first distillation loss value and the second distillation loss value so as to obtain a convolutional neural network model capable of carrying out target re-identification.
13. The apparatus of claim 12, wherein the set of image samples includes a plurality of image samples of targets, each target of the plurality of targets corresponding to a plurality of image samples.
14. The apparatus of claim 12, wherein the first determination module comprises:
a first determination sub-module for determining a plurality of first feature maps of the pre-trained converter model output based on the set of image samples;
a second determining submodule, configured to determine a plurality of second feature maps output by a convolutional neural network model to be trained based on the image sample set;
and the adjusting sub-module is used for adjusting the sizes of the first feature maps and/or the second feature maps in response to the sizes of the first feature maps and the second feature maps being different.
15. The apparatus of claim 14, wherein the adjustment submodule is to:
in response to the plurality of first feature maps and the plurality of second feature maps being different in size, one of the following is used for sizing:
Downsampling the first plurality of feature maps to adapt the sizes of the first plurality of feature maps and the second plurality of feature maps;
upsampling the second plurality of feature maps to adapt the first plurality of feature maps to the dimensions of the second plurality of feature maps;
and performing downsampling processing on the first feature maps and upsampling processing on the second feature maps so as to adapt the sizes of the first feature maps and the second feature maps.
16. The apparatus of any of claims 12 to 15, wherein the second determination module comprises:
the first computing module is used for computing an image block similarity matrix of each first feature map in the plurality of first feature maps and the first feature map;
the second calculation module is used for calculating a feature vector similarity matrix of each second feature map in the plurality of second feature maps and the second feature map;
the third calculation module is used for calculating a first distance according to an image block similarity matrix of a first feature map and a feature vector similarity matrix of a second feature map, wherein the image block similarity matrix corresponds to the same image sample of the same target in the image sample set;
And a fourth determining module, configured to determine a first distillation loss value according to a first distance of each image sample in the image sample set by using a first knowledge distillation loss function.
17. The apparatus of claim 16, wherein the first knowledge distillation loss function is a mean square error loss function;
the fourth determining module is configured to determine a first distillation loss value according to a first distance of each image sample in the image sample set by using a mean square error loss function.
18. The apparatus of any of claims 12 to 15, wherein the third determination module comprises:
a fourth computing sub-module, configured to compute, according to the plurality of first feature maps and the plurality of second feature maps, a first similarity matrix of feature maps corresponding to each two different image samples of the same object in the image sample set; the feature map corresponding to each two different image samples of the same target comprises: a first feature map of a first image sample and a second feature map of a second image sample of the same object;
a fifth calculation sub-module, configured to calculate a second distance according to each first similarity matrix;
A sixth computing sub-module, configured to compute, according to the plurality of first feature maps and the plurality of second feature maps, a second similarity matrix of feature maps corresponding to image samples of each two different targets in the image sample set; the feature map corresponding to the image samples of each two different targets comprises: a first feature map corresponding to the image sample of the first target and a second feature map corresponding to the image sample of the second target;
a seventh calculation sub-module for calculating a third distance according to each second similarity matrix;
and a fifth determining module, configured to determine a second distillation loss value according to the second distance corresponding to the same target and the third distance corresponding to each two different targets by using a second knowledge distillation loss function.
19. The apparatus of claim 18, wherein the second knowledge distillation loss function is a triplet loss function;
and the fifth determining module is used for determining a second distillation loss value by utilizing the triplet loss function according to the second distance corresponding to the same target and the third distance corresponding to each two different targets.
20. The apparatus of claim 18, wherein the fourth computing submodule is to:
According to the plurality of first feature images and the plurality of second feature images, calculating a first similarity matrix of feature images corresponding to each two different image samples of the same target in the image sample set;
the feature map corresponding to each two different image samples of the same target comprises: the first feature map of the first image sample of the same target and the second feature map of the second image sample with the lowest similarity, and the second feature map of the first image sample of the same target and the first feature map of the second image sample with the lowest similarity; the second image sample is a positive sample.
21. The apparatus of claim 18, wherein the sixth computing submodule is to:
calculating a second similarity matrix of feature images corresponding to image samples of each two different targets in the image sample set according to the plurality of first feature images and the plurality of second feature images;
the feature map corresponding to the image samples of each two different targets comprises: a first feature map corresponding to an image sample of a first object and a second feature map corresponding to an image sample of a second object with the lowest similarity, and a second feature map corresponding to an image sample of the first object and a first feature map corresponding to an image sample of a second object with the lowest similarity; the image sample of the second object is a negative sample.
22. An object re-recognition apparatus comprising:
the image determining module is used for determining a target image;
an image recognition module, configured to perform target re-recognition on the target image by using the convolutional neural network model trained by the method according to any one of claims 1 to 10, so as to recognize a target object in the target image.
23. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 11.
24. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1 to 11.
CN202211272814.1A 2022-10-18 2022-10-18 Training method of target re-identification model and target re-identification method Active CN115578613B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211272814.1A CN115578613B (en) 2022-10-18 2022-10-18 Training method of target re-identification model and target re-identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211272814.1A CN115578613B (en) 2022-10-18 2022-10-18 Training method of target re-identification model and target re-identification method

Publications (2)

Publication Number Publication Date
CN115578613A CN115578613A (en) 2023-01-06
CN115578613B true CN115578613B (en) 2024-03-08

Family

ID=84584149

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211272814.1A Active CN115578613B (en) 2022-10-18 2022-10-18 Training method of target re-identification model and target re-identification method

Country Status (1)

Country Link
CN (1) CN115578613B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111554268A (en) * 2020-07-13 2020-08-18 腾讯科技(深圳)有限公司 Language identification method based on language model, text classification method and device
CN114494776A (en) * 2022-01-24 2022-05-13 北京百度网讯科技有限公司 Model training method, device, equipment and storage medium
WO2022104550A1 (en) * 2020-11-17 2022-05-27 华为技术有限公司 Model distillation training method and related apparatus, device, and readable storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11205096B2 (en) * 2018-11-19 2021-12-21 Google Llc Training image-to-image translation neural networks

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111554268A (en) * 2020-07-13 2020-08-18 腾讯科技(深圳)有限公司 Language identification method based on language model, text classification method and device
WO2022104550A1 (en) * 2020-11-17 2022-05-27 华为技术有限公司 Model distillation training method and related apparatus, device, and readable storage medium
CN114494776A (en) * 2022-01-24 2022-05-13 北京百度网讯科技有限公司 Model training method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN115578613A (en) 2023-01-06

Similar Documents

Publication Publication Date Title
CN108898086B (en) Video image processing method and device, computer readable medium and electronic equipment
CN111797893B (en) Neural network training method, image classification system and related equipment
CN108427927B (en) Object re-recognition method and apparatus, electronic device, program, and storage medium
CN107679513B (en) Image processing method and device and server
CN113343982B (en) Entity relation extraction method, device and equipment for multi-modal feature fusion
CN113920307A (en) Model training method, device, equipment, storage medium and image detection method
CN113361710B (en) Student model training method, picture processing device and electronic equipment
CN113971751A (en) Training feature extraction model, and method and device for detecting similar images
CN111488985A (en) Deep neural network model compression training method, device, equipment and medium
CN108229658B (en) Method and device for realizing object detector based on limited samples
CN111914908B (en) Image recognition model training method, image recognition method and related equipment
CN115482395B (en) Model training method, image classification device, electronic equipment and medium
CN112529180B (en) Method and apparatus for model distillation
WO2022161302A1 (en) Action recognition method and apparatus, device, storage medium, and computer program product
CN115147680B (en) Pre-training method, device and equipment for target detection model
CN113837965B (en) Image definition identification method and device, electronic equipment and storage medium
CN112800932B (en) Method for detecting remarkable ship target in offshore background and electronic equipment
CN113469025A (en) Target detection method and device applied to vehicle-road cooperation, road side equipment and vehicle
CN115578613B (en) Training method of target re-identification model and target re-identification method
CN114913339B (en) Training method and device for feature map extraction model
CN116935368A (en) Deep learning model training method, text line detection method, device and equipment
CN115546554A (en) Sensitive image identification method, device, equipment and computer readable storage medium
CN114821190A (en) Image classification model training method, image classification method, device and equipment
CN114707638A (en) Model training method, model training device, object recognition method, object recognition device, object recognition medium and product
CN113808151A (en) Method, device and equipment for detecting weak semantic contour of live image and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant