CN115578613A - Training method of target re-recognition model and target re-recognition method - Google Patents

Training method of target re-recognition model and target re-recognition method Download PDF

Info

Publication number
CN115578613A
CN115578613A CN202211272814.1A CN202211272814A CN115578613A CN 115578613 A CN115578613 A CN 115578613A CN 202211272814 A CN202211272814 A CN 202211272814A CN 115578613 A CN115578613 A CN 115578613A
Authority
CN
China
Prior art keywords
feature maps
feature
image
image sample
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211272814.1A
Other languages
Chinese (zh)
Other versions
CN115578613B (en
Inventor
张欣彧
王健
冯浩城
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202211272814.1A priority Critical patent/CN115578613B/en
Publication of CN115578613A publication Critical patent/CN115578613A/en
Application granted granted Critical
Publication of CN115578613B publication Critical patent/CN115578613B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/62Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Image Analysis (AREA)

Abstract

The disclosure provides a training method of a target re-recognition model and a target re-recognition method, relates to the technical field of artificial intelligence, in particular to the technical field of deep learning, image processing and computer vision, and can be applied to scenes such as face recognition and smart cities. The specific implementation scheme is as follows: determining a plurality of first feature maps output by a pre-trained converter model and a plurality of second feature maps output by a convolutional neural network model to be trained on the basis of an image sample set; determining a first distillation loss value using a first knowledge distillation loss function based on the plurality of first profiles and the plurality of second profiles; determining a second distillation loss value using a second knowledge distillation loss function; and training the convolutional neural network model to be trained according to the first distillation loss value and the second distillation loss value to obtain the convolutional neural network model capable of carrying out target re-identification. According to the technical scheme, the convolutional neural network model with better inference speed and recognition performance can be trained.

Description

Training method of target re-recognition model and target re-recognition method
Technical Field
The disclosure relates to the technical field of artificial intelligence, in particular to the technical field of deep learning, image processing and computer vision, and can be applied to scenes such as face recognition and smart cities.
Background
Pedestrian re-identification (Person re-identification), also known as pedestrian re-identification, is a technique that uses computer vision techniques to determine whether a particular pedestrian is present in an image or video sequence. The pedestrian re-identification aims to make up the visual limitation of the fixed camera, can be combined with a pedestrian detection/pedestrian tracking technology, and can be widely applied to the fields of intelligent video monitoring, intelligent security and the like.
Due to the difference between different camera devices and the characteristic of rigidity and flexibility of pedestrians, the appearance is easily affected by wearing, size, shielding, posture, visual angle and the like, so that the pedestrian re-identification becomes a hot topic which has research value and is very challenging in the field of computer vision.
Disclosure of Invention
The present disclosure provides a training method of a target re-recognition model and a target re-recognition method.
According to an aspect of the present disclosure, there is provided a training method of a target re-recognition model, including:
determining a plurality of first feature maps output by a pre-trained converter model and a plurality of second feature maps output by a convolutional neural network model to be trained on the basis of an image sample set;
determining a first distillation loss value by using a first knowledge distillation loss function according to the plurality of first characteristic maps and the plurality of second characteristic maps;
determining a second distillation loss value by using a second knowledge distillation loss function according to the plurality of first characteristic graphs and the plurality of second characteristic graphs; and
and training the convolutional neural network model to be trained according to the first distillation loss value and the second distillation loss value to obtain the convolutional neural network model capable of carrying out target re-identification.
According to another aspect of the present disclosure, there is provided an object re-recognition method including:
determining a target image;
and performing target re-identification on the target image by using the convolutional neural network model obtained by training in any embodiment of the method disclosed by the invention so as to identify the target object in the target image.
According to another aspect of the present disclosure, there is provided a training apparatus for a target re-recognition model, including:
the first determination module is used for determining a plurality of first feature maps output by a pre-trained converter model and a plurality of second feature maps output by a convolutional neural network model to be trained on the basis of the image sample set;
a second determination module for determining a first distillation loss value using the first knowledge distillation loss function based on the plurality of first feature maps and the plurality of second feature maps;
a third determining module for determining a second distillation loss value using a second knowledge distillation loss function according to the plurality of first feature maps and the plurality of second feature maps; and
and the training module is used for training the convolutional neural network model to be trained according to the first distillation loss value and the second distillation loss value so as to obtain the convolutional neural network model capable of carrying out target re-identification.
According to another aspect of the present disclosure, there is provided an object re-recognition apparatus including:
an image determination module for determining a target image;
and the image identification module is used for carrying out target re-identification on the target image by utilizing the convolutional neural network model obtained by training through the method of any embodiment of the disclosure so as to identify the target object in the target image.
According to another aspect of the present disclosure, there is provided an electronic device including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to cause the at least one processor to perform a method according to any one of the embodiments of the present disclosure.
According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform a method according to any one of the embodiments of the present disclosure.
According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method according to any one of the embodiments of the present disclosure.
According to the technical scheme, the convolutional neural network model with better inference speed and recognition performance can be trained.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a schematic diagram of a training method of a target re-identification model according to an embodiment of the present disclosure;
FIG. 2 is a schematic diagram of computing an image block similarity matrix according to an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of computing a first similarity matrix according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of a training method of a target re-recognition model according to an embodiment of the present disclosure;
FIG. 5 is a schematic diagram of a target re-identification method according to an embodiment of the present disclosure;
FIG. 6 is a schematic diagram of a training apparatus for a target re-recognition model according to an embodiment of the present disclosure;
FIG. 7 is a schematic diagram of an object re-identification apparatus according to an embodiment of the present disclosure;
FIG. 8 is a block diagram of an electronic device used to implement methods of embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
As shown in fig. 1, an embodiment of the present disclosure provides a training method of a target re-recognition model, including:
step S101: based on the image sample set, a plurality of first feature maps output by the pre-trained converter model and a plurality of second feature maps output by the convolutional neural network model to be trained are determined.
Step S102: a first distillation loss value is determined using a first knowledge distillation loss function based on the plurality of first profiles and the plurality of second profiles.
Step S103: and determining a second distillation loss value by using a second knowledge distillation loss function according to the plurality of first characteristic graphs and the plurality of second characteristic graphs. And
step S104: and training the convolutional neural network model to be trained according to the first distillation loss value and the second distillation loss value to obtain the convolutional neural network model capable of carrying out target re-identification.
According to the embodiments of the present disclosure, it should be noted that:
the image sample set may include image samples of a plurality of objects, and each object in the plurality of objects corresponds to a plurality of image samples. For example, the batch sample (image sample set) is B = P × K, that is, K ID (Identity document) targets are selected, each ID target selects P image samples, and K and P each represent a number.
The plurality of objects may include any objects such as persons, animals, objects, etc., and are not specifically limited herein. The image samples of the plurality of objects may be understood to include an image sample of a person, an image sample of B person, and an image sample of C person. There are multiple image samples for each target, which may be understood to include image samples of the A person taken from multiple angles and/or multiple positions, respectively.
The pre-trained Transformer model may be understood as a Transformer (Transformer) model that is pre-trained and can perform object re-recognition. The specific structure of the converter model is not specifically limited herein, and the converter model can perform a visual task, that is, the converter model can determine whether a specific target exists in an image or a video by using a computer vision technology.
A plurality of first profiles, can be understood to comprise: the pre-trained converter model is based on a plurality of first feature maps output by a plurality of different image samples of the same target, and the pre-trained converter model is based on a plurality of first feature maps output by a plurality of image samples of different targets. The number of the plurality of first feature maps output by the pre-trained converter model corresponds to the number of the batch samples input into the converter model.
The Convolutional Neural network model to be trained can be understood as a Convolutional Neural network model (CNN) that needs model training to have a target re-recognition function.
A plurality of second characteristic diagrams can be understood to comprise: the convolutional neural network model to be trained is based on a plurality of second feature maps output by a plurality of different image samples of the same target, and the convolutional neural network model to be trained is based on a plurality of second feature maps output by a plurality of image samples of different targets. The number of the second feature maps output by the convolutional neural network model to be trained corresponds to the number of the batch samples (a plurality of image samples) input into the converter model.
The first knowledge distillation loss function and the second knowledge distillation loss function can be selected as required, as long as the knowledge-based distillation mechanism can be realized, and the high-quality knowledge learned by the pre-trained converter model can be migrated to the convolutional neural network model by using the plurality of first feature maps and the plurality of second feature maps.
Training the convolutional neural network model to be trained according to the first distillation loss value and the second distillation loss value, which can be understood as optimizing parameters of the convolutional neural network model to be trained according to the first distillation loss value and the second distillation loss value until the model converges. The method can also be understood as optimizing the parameters of the convolutional neural network model to be trained according to the first distillation loss value, the second distillation loss value and the loss value obtained by the loss function of the convolutional neural network model to be trained until the model converges.
With the development of deep learning, the convolutional neural network model can be well supported on each hardware device, and high speed and high precision are achieved. Model structures based on converter models are emerging gradually. Compared with a convolutional neural network model, the converter model structure can describe the image globally, so that the effects of the convolutional neural network model, such as performances in the aspects of identification, detection, segmentation, fine-grained classification and the like, are achieved on various visual tasks. However, in an environment with limited edge devices and resources, the converter model lacks efficient hardware design support, and the converter model after being deployed on the devices has a slow inference speed and is difficult to achieve the balance between speed and precision. The method of the embodiment of the disclosure effectively solves the above problems. According to the embodiment of the disclosure, through a knowledge distillation mechanism and an image sample set, a trained converter model is used as a teacher model, and a convolutional neural network model to be distilled and trained is used as a student model, so that high-quality target recognition knowledge learned on the converter model can be migrated to the convolutional neural network model, the convolutional neural network model can learn better characteristic representation of the converter model, the recognition accuracy of the converter model is achieved, and the problems that the convolutional neural network model structure is limited by a convolutional kernel receptive field in the convolutional neural network model, global characteristic interaction is difficult to realize, and the generalization capability is poor are solved. The performance of the convolutional neural network model in the aspects of identification, detection, segmentation, fine-grained classification and the like is improved. Through knowledge distillation of the converter model, the global feature characterization performance of the convolutional neural network model is improved, and further the precision of the convolutional neural network model is improved, so that the convolutional neural network model has the capability of performing feature fusion through the global feature characterization and the local feature characterization when a target is re-identified, and further the feature discrimination capability during retrieval of the convolutional neural network model is ensured. Meanwhile, the trained convolutional neural network model with the performance of the converter model can be well supported on each hardware device, the computational reasoning speed of the convolutional neural network model running on the hardware device can be ensured, and the problem that the converter model is difficult to be well deployed on the limited hardware device and the high performance cannot be supported is solved. The device operation performance of the hardware device is effectively improved, and the operation speed is improved. Hardware devices such as a GPU (graphics processing unit), edge device, and the like.
In one application example, the convolutional neural network model for pedestrian re-recognition may be trained by using the training method of the target re-recognition model of the embodiment of the present disclosure.
In an application example, the convolutional neural network model applicable to scenes such as face recognition and smart cities can be trained by using the training method of the target re-recognition model of the embodiment of the disclosure.
In one example, the plurality of first feature maps output by the converter model based on the set of image samples may be represented as { G } 1 ,G 2 ,…,G B },G b ∈R p*c Where p is the number of patches (image blocks) of the first feature map, p = H × W, H is the height of the first feature map, W is the width of the first feature map, c is the feature dimension, B is the serial number of the first feature map, and B is the total number of the first feature map. G 1 、G 2 Can be understood as a different first profile, G, of the same ID object 3 Is a reaction with G 2 A first profile of different ID objects.
In one example, the plurality of second feature maps output by the convolutional neural network model to be trained based on the image sample set may be represented as { F } 1 ,F 2 ,…,F B },F b ∈R p*c Wherein p is the number of feature vectors of the second feature map, p = H × W, H is the height of the second feature map, W is the width of the second feature map, c is the feature dimension, B is the serial number of the first feature map, and B is the total number of the first feature map. F 1 、F 2 Can be understood as a second, different profile, G, of the same ID object 3 Is a reaction with G 2 A second profile of different ID objects.
In one implementation manner, the training method of the target re-recognition model of the embodiment of the present disclosure includes steps S101 to S104, where step S101: determining a plurality of first feature maps output by the pre-trained converter model and a plurality of second feature maps output by the convolutional neural network model to be trained based on the image sample set may include:
step S1011: based on the set of image samples, a plurality of first feature maps output by the pre-trained converter model are determined.
Step S1012: and determining a plurality of second feature maps output by the convolutional neural network model to be trained on the basis of the image sample set.
Step S1013: and adjusting the sizes of the plurality of first feature maps and/or the plurality of second feature maps in response to the sizes of the plurality of first feature maps and the plurality of second feature maps being different.
According to the embodiments of the present disclosure, it should be noted that:
the specific manner of adjusting the sizes of the plurality of first feature maps and/or the plurality of second feature maps is not specifically limited, and it is sufficient to ensure that the adjusted sizes are consistent and the characterization of the feature vectors is not affected.
In the size adjustment, the size of the first feature map may be adjusted to fit the second feature map based on the size of the second feature map. The size of the second feature map may be adjusted to fit the first feature map based on the size of the first feature map. The size of the first characteristic diagram and the size of the second characteristic diagram can be adjusted at the same time, so that the sizes of the first characteristic diagram and the second characteristic diagram are adjusted to be matched. Wherein, the size adaptation can be understood as the consistent size or the size conforming to the preset scaling rule.
According to the embodiment of the disclosure, the sizes of the plurality of first characteristic diagrams and/or the plurality of second characteristic diagrams are adjusted, so that when the first knowledge distillation loss function and the second knowledge distillation loss function are subsequently utilized for calculation, more accurate first distillation loss values and second distillation loss values are obtained, and further the training effect and the training speed of the convolutional neural network model to be trained are improved.
In one implementation, the training method of the target re-recognition model of the embodiment of the present disclosure includes steps S101 to S104, wherein step S1013: in response to the plurality of first feature maps differing in size from the plurality of second feature maps, resizing the plurality of first feature maps and/or the plurality of second feature maps may include:
in response to the first plurality of feature maps differing in size from the second plurality of feature maps, performing a resizing in one of the following ways:
and performing downsampling processing on the plurality of first feature maps so as to adapt the sizes of the plurality of first feature maps and the plurality of second feature maps.
And performing upsampling processing on the plurality of second feature maps to adapt the sizes of the plurality of first feature maps and the plurality of second feature maps.
The plurality of first feature maps are downsampled, and the plurality of second feature maps are upsampled, so that the sizes of the plurality of first feature maps and the plurality of second feature maps are matched.
According to the embodiment of the disclosure, the sizes of the plurality of first characteristic diagrams and/or the plurality of second characteristic diagrams are adjusted, so that when the first knowledge distillation loss function and the second knowledge distillation loss function are subsequently utilized for calculation, more accurate first distillation loss values and second distillation loss values are obtained, and further the training effect and the training speed of the convolutional neural network model to be trained are improved.
In one implementation, the training method of the target re-recognition model of the embodiment of the present disclosure includes steps S101 to S104, where step S102: determining a first distillation loss value using a first knowledge distillation loss function based on the plurality of first profiles and the plurality of second profiles, comprising:
step S1021: and calculating the image block similarity matrix of each first feature map in the plurality of first feature maps and the first feature map.
Step S1022: and calculating a feature vector similarity matrix of each second feature map in the plurality of second feature maps and the second feature map.
Step S1023: and calculating a first distance according to the image block similarity matrix of the first characteristic diagram and the characteristic vector similarity matrix of the second characteristic diagram corresponding to the same image sample of the same target in the image sample set.
Step S1024: a first distillation loss value is determined using a first knowledge distillation loss function based on a first distance for each image sample in the set of image samples.
According to the embodiments of the present disclosure, it should be noted that:
calculating the similarity matrix of each first feature map and its own image block can be understood as that each image block of the first feature map and each image block of the first feature map are subjected to similarity calculation, i.e. a self-image block similarity distillation mechanism. For example, as shown in fig. 2, the first feature map is a feature map obtained by the converter model based on an a person image sample. The first characteristic diagram comprises four image blocks which are respectively t 1 、t 2 、t 3 And t 4 . Calculating a similarity matrix of the first characteristic diagram and the image block of the first characteristic diagram, which is equivalent to t 1 Are each related to t 1 、t 2 、t 3 And t 4 Multiplication, t 2 Respectively with t 1 、t 2 、t 3 And t 4 Multiply by each other, and so on. Namely, the feature matrix of the 2x2 first feature map is multiplied by the feature matrix of the 2x2 first feature map to obtain a 4x4 image block similarity matrix.
Calculating the similarity matrix of each second feature map and its own feature vector can be understood as that each feature vector of the second feature map and its respective feature vector of the second feature map are subjected to similarity calculation, i.e. a self-image block (feature vector) similarity distillation mechanism. For example, the second feature map is a feature map obtained by a convolutional neural network model based on an a person image sample.The second feature map comprises four feature vectors, r 1 、r 2 、r 3 And r 4 The representation of the eigenvector matrix of the second eigenmap may refer to the representation of the image block matrix of the first eigenmap of fig. 2. Calculating a similarity matrix of the second characteristic diagram and the characteristic vector thereof, which is equivalent to r 1 Are respectively associated with r 1 、r 2 、r 3 And r 4 Multiplication of r 2 Are respectively associated with r 1 、r 2 、r 3 And r 4 Multiply by each other, and so on. That is, the feature matrix of the 2x2 second feature map is multiplied by the feature matrix of the 2x2 second feature map to obtain a 4x4 feature vector similarity matrix.
According to the embodiment of the disclosure, because the converter model divides the image sample into a plurality of image blocks at the front end of the model, and each image block is used as an individual feature vector of the feature map in the training process, the similarity between the feature vectors of the feature map output by the convolutional neural network model can be corresponded to the similarity between the image block feature vectors of the feature map of the converter model based on a self-image block similarity distillation mechanism, and the first knowledge distillation loss function is used for supervision. By means of the method and the device, the convolutional neural network model to be trained can fully learn the feature effectiveness of the converter model, and high-quality target re-identification knowledge which is learned on the converter model is distilled and transferred to the convolutional neural network model.
In one example, the first distance is calculated according to an image block similarity matrix of a first feature map and a feature vector similarity matrix of a second feature map corresponding to the same image sample of the same target in the image sample set, and the first distance may be calculated by cosine similarity (included angle cosine distance).
In one embodiment, the training method of the target re-identification model of the embodiment of the present disclosure includes steps S101 to S104, wherein the first knowledge distillation loss function may be a mean squared error loss function (MSE). Step S102: determining a first distillation loss value using a first knowledge distillation loss function based on a first distance of each image sample in the set of image samples, comprising:
and determining a first distillation loss value by utilizing a mean square error loss function according to the first distance of each image sample in the image sample set.
According to the embodiment of the disclosure, the first distillation loss value can be calculated more accurately by using the mean square error loss function, so that the effect and speed of optimizing the convolutional neural network model to be trained by using the first distillation loss value are improved.
In one example, the steps S1021 to S1024 of implementing "determining the first distillation loss value using the first knowledge distillation loss function based on the plurality of first feature maps and the plurality of second feature maps" may be converted into a formula for calculation, where the formula is as follows:
Figure BDA0003895772360000091
where B is the number of image samples in the set of image samples of the one-time input model (the pre-trained converter model and the convolutional neural network model to be trained).
Wherein b is a subscript indicating a certain characteristic diagram.
Wherein, F b Is a second characteristic diagram.
Wherein G is b Is a first characteristic diagram.
Wherein, sim (F) b ,F b ) Corresponding to step S1022: and calculating a feature vector similarity matrix of each second feature map in the plurality of second feature maps and the second feature map.
Wherein, sim (G) b ,G b ) Corresponding to step S1021: and calculating an image block similarity matrix of each first feature map in the plurality of first feature maps and the first feature map.
Wherein, sim (F) b ,F b )-Sim(G b ,G b ) Corresponding to step S1023: and calculating a first distance according to the image block similarity matrix of the first characteristic diagram and the characteristic vector similarity matrix of the second characteristic diagram corresponding to the same image sample of the same target in the image sample set.
The equation of L1 corresponds to steps S1021 to S1024.
In one implementation, the training method of the target re-recognition model of the embodiment of the present disclosure includes steps S101 to S104, where step S103: determining a second distillation loss value using a second knowledge distillation loss function based on the plurality of first profiles and the plurality of second profiles, comprising:
step S1031: and calculating a first similarity matrix of the feature maps corresponding to every two different image samples of the same target in the image sample set according to the first feature maps and the second feature maps. Wherein, the characteristic diagram that every two different image samples of same target correspond includes: a first feature map of a first image sample and a second feature map of a second image sample of the same object.
Step S1032: and calculating the second distance according to each first similarity matrix.
Step S1033: and calculating a second similarity matrix of the feature maps corresponding to the image samples of every two different targets in the image sample set according to the plurality of first feature maps and the plurality of second feature maps. Wherein, the characteristic diagram corresponding to the image samples of every two different targets comprises: the image sample of the first target corresponds to the first feature map and the image sample of the second target corresponds to the second feature map.
Step S1034: and calculating a third distance according to each second similarity matrix.
Step S1035: and determining a second distillation loss value by utilizing a second knowledge distillation loss function according to a second distance corresponding to the same target and a third distance corresponding to every two different targets.
According to the embodiments of the present disclosure, it should be noted that:
the feature dimensions of each feature map used to calculate the first similarity matrix and the second similarity matrix may be the same or similar.
The first feature map of the first image sample and the second feature map of the second image sample of the same object may be understood as follows: for example, a first feature map output by a pre-trained converter model based on an A1 image sample of an A target person, and a second feature map output by a convolutional neural network model to be trained based on an A2 image sample of the A target person.
The first feature map corresponding to the image sample of the first object and the second feature map corresponding to the image sample of the second object may be understood as follows: for example, a first feature map output by a pre-trained converter model based on A1 image samples of an a target person, and a second feature map output by a convolutional neural network model to be trained based on B1 image samples of a B target person.
Calculating a first similarity matrix of feature maps corresponding to every two different image samples of the same target in the image sample set, which can be understood as performing similarity calculation on each image block of a first feature map corresponding to a first image sample of the target a and each feature vector of a second feature map corresponding to a second image sample of the target a, namely a distillation mechanism of cross image block similarity. For example, as shown in fig. 3, the first feature map is a feature map obtained based on a first image sample of the a person. The first characteristic diagram comprises four image blocks which are respectively s 1 、s 2 、s 3 And s 4 . The second feature map is a feature map obtained based on a second image sample of the a person. The second feature map comprises four feature vectors, t 1 、t 2 、t 3 And t 4 . Calculating a first similarity matrix corresponding to s 1 Are each related to t 1 、t 2 、t 3 And t 4 Multiplication, s 2 Are each related to t 1 、t 2 、t 3 And t 4 Multiply by each other, and so on. That is, the feature matrix of the 2x2 first feature map is multiplied by the feature matrix of the 2x2 second feature map to obtain a 4x4 first similarity matrix.
The second similarity matrix of the feature maps corresponding to every two different target image samples in the image sample set is calculated, which can be understood as a distillation mechanism for calculating the similarity of each image block of the first feature map corresponding to the image sample of the target person a and each feature vector of the second feature map corresponding to the image sample of the target person B, that is, the similarity of cross image blocks. For example, the first feature map is based on A-charactersA feature map obtained from the first image sample. The first characteristic diagram comprises four image blocks which are respectively s 1 、s 2 、s 3 And s 4 . The second feature map is a feature map obtained based on a second image sample of the B person. The second feature map comprises four feature vectors, t 1 、t 2 、t 3 And t 4 . Calculating a second similarity matrix, corresponding to s 1 Are each related to t 1 、t 2 、t 3 And t 4 Multiplication, s 2 Respectively with t 1 、t 2 、t 3 And t 4 Multiply by each other, and so on. That is, the feature matrix of the 2x2 first feature map is multiplied by the feature matrix of the 2x2 second feature map to obtain a 4x4 second similarity matrix.
According to the embodiment of the present disclosure, since the image block features of the feature maps of the same object should be more similar, and the image block features of the feature maps of different objects should be more different, the embodiment of the present disclosure adopts a distillation mechanism based on the cross image block similarity, that is, the similarity is calculated for the image blocks between different feature maps of the same object, and their distances are reduced. Similarity is calculated for image blocks between different feature maps of different targets and their distance is increased while supervision is performed using a second knowledge distillation loss function. By means of the method and the device, the convolutional neural network model to be trained can fully learn the feature effectiveness of the converter model, and high-quality target re-identification knowledge which is learned on the converter model is distilled and transferred to the convolutional neural network model.
In one example, the second distance is calculated according to each first similarity matrix, and the third distance is calculated according to each second similarity matrix, and the calculation can be performed by cosine similarity (included angle cosine distance).
In one implementation, the training method of the object re-identification model of the embodiment of the present disclosure includes steps S101 to S104, where the second knowledge distillation loss function is a triple (triplet) loss function. Determining a second distillation loss value by utilizing a second knowledge distillation loss function according to a second distance corresponding to the same target and a third distance corresponding to every two different targets, wherein the second knowledge distillation loss function comprises the following steps:
and determining a second distillation loss value by utilizing a triplet loss function according to a second distance corresponding to the same target and a third distance corresponding to each two different targets.
According to the embodiment of the disclosure, the second distillation loss value can be calculated more accurately by using the triplet loss function, and the effect and speed of optimizing the convolutional neural network model to be trained by using the second distillation loss value are further improved.
In one implementation manner, a method for training a target re-recognition model according to an embodiment of the present disclosure includes steps S101 to S104, where calculating a first similarity matrix of feature maps corresponding to each two different image samples of a same target in an image sample set according to a plurality of first feature maps and a plurality of second feature maps includes:
and calculating a first similarity matrix of the feature maps corresponding to every two different image samples of the same target in the image sample set according to the first feature maps and the second feature maps.
Wherein, the characteristic diagram that every two different image samples of same target correspond includes: the first feature map of the first image sample of the same object and the second feature map of the second image sample with the lowest similarity thereto, and the second feature map of the first image sample of the same object and the first feature map of the second image sample with the lowest similarity thereto. The second image sample is a positive sample.
According to the embodiments of the present disclosure, it should be noted that:
the first similarity matrix of the feature maps corresponding to every two different image samples of the same target in the image sample set is calculated, which can be understood as a distillation mechanism for calculating the similarity of each image block of the first feature map corresponding to the first image sample of the target a and each feature vector of the second feature map corresponding to the second image sample of the target a, which has the lowest similarity with the first feature map, i.e. the similarity of cross image blocks. For example, as shown in fig. 3, the first feature map is a feature map obtained based on a first image sample of the a person. The first characteristic diagram comprises four image blocksIs other than s 1 、s 2 、s 3 And s 4 . The second feature map having the lowest similarity to the first feature map is a feature map obtained based on a second image sample of the a person. The second feature map comprises four feature vectors, t 1 、t 2 、t 3 And t 4 . Calculating a first similarity matrix corresponding to s 1 Respectively with t 1 、t 2 、t 3 And t 4 Multiplication, s 2 Respectively with t 1 、t 2 、t 3 And t 4 Multiply by each other, and so on. That is, the feature matrix of the 2x2 first feature map is multiplied by the feature matrix of the 2x2 second feature map to obtain a 4x4 first similarity matrix. And a distillation mechanism which can be understood as that each image block of the second feature map corresponding to the first image sample of the A target is subjected to similarity calculation with each feature vector of the first feature map corresponding to the second image sample of the A target with the lowest similarity with the second feature map, namely the similarity of the crossed image blocks.
According to the embodiment of the present disclosure, since the image block features of the feature maps of the same object should be more similar, and the image block features of the feature maps of different objects should be more different, the embodiment of the present disclosure adopts a distillation mechanism based on the cross image block similarity, that is, the similarity is calculated for the image blocks between different feature maps of the same object, and their distances are reduced. By means of the method and the device, the convolutional neural network model to be trained can fully learn the feature effectiveness of the converter model, and high-quality target re-identification knowledge which is learned on the converter model is distilled and transferred to the convolutional neural network model.
In one implementation, the method for training a target re-recognition model according to the embodiment of the present disclosure includes steps S101 to S104, where calculating a second similarity matrix of feature maps corresponding to image samples of every two different targets in an image sample set according to a plurality of first feature maps and a plurality of second feature maps includes:
and calculating a second similarity matrix of the feature maps corresponding to the image samples of every two different targets in the image sample set according to the plurality of first feature maps and the plurality of second feature maps.
Wherein, the characteristic diagram corresponding to the image samples of every two different targets comprises: the image processing method comprises the steps of obtaining a first feature map corresponding to an image sample of a first target and a second feature map corresponding to an image sample of a second target with the lowest similarity, and obtaining a second feature map corresponding to the image sample of the first target and a first feature map corresponding to an image sample of the second target with the lowest similarity. The image samples of the second object are negative samples.
According to the embodiments of the present disclosure, it should be noted that:
the second similarity matrix of the feature maps corresponding to every two different target image samples in the image sample set is calculated, which can be understood as a distillation mechanism for calculating the similarity of each image block of the first feature map corresponding to the image sample of the target person a and each feature vector of the second feature map corresponding to the image sample of the target person B and having the lowest similarity with the first feature map, that is, the similarity of the cross image blocks. For example, the first feature map is a feature map obtained based on a first image sample of the a person. The first characteristic diagram comprises four image blocks which are respectively s 1 、s 2 、s 3 And s 4 . The second feature map is a feature map obtained based on a second image sample of the B person. The second feature map comprises four feature vectors, t 1 、t 2 、t 3 And t 4 . Calculating a second similarity matrix, corresponding to s 1 Respectively with t 1 、t 2 、t 3 And t 4 Multiplication, s 2 Respectively with t 1 、t 2 、t 3 And t 4 Multiply by each other, and so on. That is, the feature matrix of the 2x2 first feature map is multiplied by the feature matrix of the 2x2 second feature map to obtain a 4x4 second similarity matrix. And performing similarity calculation on each image block of the second feature map corresponding to the image sample of the target person A and each feature vector of the first feature map with the lowest similarity with the second feature map corresponding to the image sample of the target person B.
According to the embodiment of the disclosure, because the image block features of the feature maps of the same object should be more similar, and the image block features of the feature maps of different objects should be more different, the embodiment of the disclosure adopts a distillation mechanism based on the cross image block similarity, that is, the similarity is calculated for the image blocks between different feature maps of different objects, and the distances between the image blocks are increased, and at the same time, a second knowledge distillation loss function is used for supervision. By means of the method and the device, the convolutional neural network model to be trained can fully learn the feature effectiveness of the converter model, and the high-quality target recognization knowledge learned on the converter model is distilled and migrated to the convolutional neural network model.
In one example, the steps S1031 to S1035 implementing "determining a second distillation loss value using the second knowledge distillation loss function from the plurality of first feature maps and the plurality of second feature maps" may be converted into a formula for calculation, the formula being as follows:
Figure BDA0003895772360000141
where B is the number of image samples in the set of image samples of the one-time input model (the pre-trained converter model and the convolutional neural network model to be trained).
Wherein b is a subscript indicating a certain characteristic diagram.
Where α is the hyperparameter of the triplet loss function.
Wherein Fb is a second characteristic diagram.
Wherein Gb is the first characteristic diagram.
And G b + is the first feature map of the positive image sample with the lowest similarity with the second feature map of the same target.
And F b + is the second feature map of the positive image sample with the lowest similarity to the first feature map of the same target.
Wherein,
Figure BDA0003895772360000151
the first feature map of the negative image sample with the lowest similarity with the second feature map of different targets.
Wherein,
Figure BDA0003895772360000152
and the second feature map of the negative image sample with the lowest similarity to the first feature maps of different targets.
Wherein,
Figure BDA0003895772360000153
corresponding to step S1031: and calculating a first similarity matrix of the feature maps corresponding to every two different image samples of the same target in the image sample set according to the first feature maps and the second feature maps. Wherein, the characteristic diagram that every two different image samples of same target correspond includes: the second feature map of the first image sample of the same object and the first feature map of the second image sample with the lowest similarity. The second image sample is a positive sample.
Wherein,
Figure BDA0003895772360000154
corresponding to step S1031: and calculating a first similarity matrix of the feature maps corresponding to every two different image samples of the same target in the image sample set according to the first feature maps and the second feature maps. Wherein, the feature map corresponding to every two different image samples of the same target comprises: the first feature map of a first image sample of the same object and the second feature map of a second image sample having the lowest similarity thereto. The second image sample is a positive sample.
Wherein,
Figure BDA0003895772360000155
corresponding to step S1032: and calculating the second distance according to each first similarity matrix.
Wherein,
Figure BDA0003895772360000156
corresponding to step S1033: and calculating a second similarity matrix of the feature maps corresponding to the image samples of every two different targets in the image sample set according to the first feature maps and the second feature maps. Wherein each two different targetsThe feature map corresponding to the image sample of (1) includes: the second feature map corresponding to the image sample of the first object and the first feature map corresponding to the image sample of the second object with the lowest similarity. The image samples of the second object are negative samples.
Wherein,
Figure BDA0003895772360000157
corresponding to step S1033: and calculating a second similarity matrix of the feature maps corresponding to the image samples of every two different targets in the image sample set according to the first feature maps and the second feature maps. Wherein, the characteristic diagram corresponding to the image samples of every two different targets comprises: the image sample of the first target corresponds to the first feature map and the image sample of the second target with the lowest similarity to the first feature map. The image samples of the second object are negative samples.
Wherein,
Figure BDA0003895772360000158
corresponding to step S1034: and calculating a third distance according to each second similarity matrix.
Wherein the equation of L2 corresponds to steps S1031 to S1035.
In one example, the training of the convolutional neural network model to be trained in step S104 to obtain a convolutional neural network model capable of performing target re-recognition includes:
performing parameter tuning on the convolutional neural network model to be trained according to the first distillation loss value and the second distillation loss value;
under the condition that the parameter-tuned convolutional neural network model is determined to be converged, finishing training the convolutional neural network model to be trained to obtain a convolutional neural network model capable of carrying out target re-identification;
and under the condition that the convolution neural network model after parameter tuning is determined not to be converged, circularly executing the steps S101 to S104 until the obtained convolution neural network model is converged after the parameter tuning is carried out on the convolution neural network model to be trained according to the first distillation loss value and the second distillation loss value.
In an application example, as shown in fig. 4, a training method of a target re-recognition model according to an embodiment of the present disclosure includes:
step 100: and respectively inputting the image sample set into a pre-trained Transformer pedestrian re-identification model and a CNN pedestrian re-identification model to be trained. Wherein the image sample set comprises pedestrian pictures of a plurality of ID persons.
Step 110: based on the image sample set, a plurality of first feature maps (for example, an ID1 first feature map, an ID2 first feature map, and an ID3 first feature map in fig. 4) output by the Transformer pedestrian re-identification model and a plurality of second feature maps (for example, an ID1 second feature map, an ID2 second feature map, and an ID3 second feature map in fig. 4) output by the CNN pedestrian re-identification model to be trained are determined, and the plurality of first feature maps are down-sampled to fit the plurality of first feature maps with the plurality of second feature maps.
Step 120: and determining a first distillation loss value based on the self-image block similarity rectification mechanism by utilizing a first knowledge distillation loss function according to the plurality of first feature maps and the plurality of second feature maps.
Step 130: a second distillation loss value is determined based on the cross image block similarity distillation scheme using a second knowledge distillation loss function based on the plurality of first feature maps and the plurality of second feature maps.
Step 140: and training the convolutional neural network model to be trained according to the first distillation loss value and the second distillation loss value to obtain the convolutional neural network model capable of carrying out target re-identification.
As shown in fig. 5, an embodiment of the present disclosure provides a target re-identification method, including:
step S501: a target image is determined.
Step S502: the target image is subjected to target re-identification by utilizing the convolutional neural network model obtained by training through the method of any embodiment of the disclosure, so that the target object in the target image is identified.
According to the embodiment of the disclosure, the convolutional neural network model trained by the method of any embodiment of the disclosure can better re-identify the target in the target image.
As shown in fig. 6, an embodiment of the present disclosure provides a training apparatus for a target re-recognition model, including:
a first determining module 610, configured to determine, based on the image sample set, a plurality of first feature maps output by the pre-trained converter model and a plurality of second feature maps output by the convolutional neural network model to be trained.
A second determining module 620, configured to determine a first distillation loss value according to the plurality of first feature maps and the plurality of second feature maps by using the first knowledge distillation loss function.
A third determining module 630, configured to determine a second distillation loss value using a second knowledge distillation loss function according to the plurality of first feature maps and the plurality of second feature maps. And
and the training module 640 is configured to train the convolutional neural network model to be trained according to the first distillation loss value and the second distillation loss value, so as to obtain a convolutional neural network model capable of performing target re-identification.
In one embodiment, the set of image samples includes image samples of a plurality of objects, each object of the plurality of objects corresponding to a plurality of image samples.
In one embodiment, the first determining module 610 includes:
a first determining sub-module for determining a plurality of first feature maps output by the pre-trained converter model based on the set of image samples.
And the second determining submodule is used for determining a plurality of second feature maps output by the convolutional neural network model to be trained on the basis of the image sample set.
And the adjusting submodule is used for adjusting the sizes of the plurality of first feature maps and/or the plurality of second feature maps in response to the sizes of the plurality of first feature maps and the plurality of second feature maps being different.
In one embodiment, the adjustment submodule is operable to:
in response to the first plurality of feature maps differing in size from the second plurality of feature maps, performing a resizing in one of the following ways:
and performing downsampling processing on the plurality of first feature maps so as to adapt the sizes of the plurality of first feature maps and the plurality of second feature maps.
And performing upsampling processing on the plurality of second feature maps to adapt the sizes of the plurality of first feature maps and the plurality of second feature maps.
The plurality of first feature maps are downsampled, and the plurality of second feature maps are upsampled, so that the sizes of the plurality of first feature maps and the plurality of second feature maps are matched.
In one embodiment, the second determining module 620 includes:
the first calculation module is used for calculating an image block similarity matrix of each first feature map in the plurality of first feature maps and the first feature map.
And the second calculation module is used for calculating a feature vector similarity matrix of each second feature map and the second feature map.
And the third calculating module is used for calculating the first distance according to the image block similarity matrix of the first characteristic diagram and the characteristic vector similarity matrix of the second characteristic diagram corresponding to the same image sample of the same target in the image sample set.
And the fourth determining module is used for determining a first distillation loss value by utilizing a first knowledge distillation loss function according to the first distance of each image sample in the image sample set.
In one embodiment, the first knowledge distillation loss function is a mean square error loss function.
And the fourth determining module is used for determining a first distillation loss value by utilizing a mean square error loss function according to the first distance of each image sample in the image sample set.
In one embodiment, the third determining module 630 includes:
and the fourth calculating submodule is used for calculating a first similarity matrix of the feature maps corresponding to every two different image samples of the same target in the image sample set according to the plurality of first feature maps and the plurality of second feature maps. Wherein, the feature map corresponding to every two different image samples of the same target comprises: a first feature map of a first image sample and a second feature map of a second image sample of the same object.
And the fifth calculation submodule is used for calculating the second distance according to each first similarity matrix.
And the sixth calculating submodule is used for calculating a second similarity matrix of the feature maps corresponding to the image samples of every two different targets in the image sample set according to the plurality of first feature maps and the plurality of second feature maps. Wherein, the characteristic diagram corresponding to the image samples of every two different targets comprises: the image sample of the first target corresponds to the first feature map and the image sample of the second target corresponds to the second feature map.
And the seventh calculation submodule is used for calculating the third distance according to each second similarity matrix.
And the fifth determining module is used for determining a second distillation loss value by utilizing a second knowledge distillation loss function according to the second distance corresponding to the same target and the third distance corresponding to every two different targets.
In one embodiment, the second knowledge distillation loss function is a triplet loss function.
And the fifth determining module is used for determining a second distillation loss value by utilizing a triplet loss function according to the second distance corresponding to the same target and the third distance corresponding to every two different targets.
In one embodiment, the fourth computation submodule is configured to:
and calculating a first similarity matrix of the feature maps corresponding to every two different image samples of the same target in the image sample set according to the first feature maps and the second feature maps.
Wherein, the characteristic diagram that every two different image samples of same target correspond includes: the first feature map of the first image sample of the same object and the second feature map of the second image sample with the lowest similarity thereto, and the second feature map of the first image sample of the same object and the first feature map of the second image sample with the lowest similarity thereto. The second image sample is a positive sample.
In one embodiment, the sixth computation submodule is configured to:
and calculating a second similarity matrix of the feature maps corresponding to the image samples of every two different targets in the image sample set according to the first feature maps and the second feature maps.
Wherein, the characteristic diagram that every two image samples of different targets correspond includes: the image processing method comprises the steps of obtaining a first feature map corresponding to an image sample of a first target and a second feature map corresponding to an image sample of a second target with the lowest similarity, and obtaining a second feature map corresponding to the image sample of the first target and a first feature map corresponding to an image sample of the second target with the lowest similarity. The image samples of the second object are negative samples.
As shown in fig. 7, an embodiment of the present disclosure provides an object re-recognition apparatus, including:
an image determination module 710 for determining a target image.
The image recognition module 720 is configured to perform target re-recognition on the target image by using the convolutional neural network model trained by the method according to any embodiment of the present disclosure, so as to recognize the target object in the target image.
For a description of specific functions and examples of each module and each sub-module of the apparatus in the embodiment of the present disclosure, reference may be made to the related description of the corresponding steps in the foregoing method embodiments, and details are not repeated here.
In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
FIG. 8 illustrates a schematic block diagram of an example electronic device 800 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not intended to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 8, the apparatus 800 includes a computing unit 801 which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM803, various programs and data necessary for the operation of the device 800 can also be stored. The calculation unit 801, the ROM802, and the RAM803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.
A number of components in the device 800 are connected to the I/O interface 805, including: an input unit 806 such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 801 executes the respective methods and processes described above, such as the training method of the object re-recognition model and the object re-recognition method. For example, in some embodiments, the training method of the target re-recognition model and the target re-recognition method may be implemented as computer software programs tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 800 via ROM802 and/or communications unit 809. When the computer program is loaded into the RAM803 and executed by the computing unit 801, one or more steps of the training method of the object re-recognition model and the object re-recognition method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the training method of the object re-recognition model and the object re-recognition method in any other suitable manner (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.
It should be understood that various forms of the flows shown above, reordering, adding or deleting steps, may be used. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (25)

1. A method of training a target re-recognition model, comprising:
determining a plurality of first feature maps output by a pre-trained converter model and a plurality of second feature maps output by a convolutional neural network model to be trained on the basis of an image sample set;
determining a first distillation loss value using a first knowledge distillation loss function based on the plurality of first feature maps and the plurality of second feature maps;
determining a second distillation loss value using a second knowledge distillation loss function based on the plurality of first feature maps and the plurality of second feature maps; and
and training the convolutional neural network model to be trained according to the first distillation loss value and the second distillation loss value to obtain the convolutional neural network model capable of carrying out target re-identification.
2. The method of claim 1, wherein the set of image samples includes image samples of a plurality of objects, each object of the plurality of objects corresponding to a plurality of image samples.
3. The method of claim 1, wherein the determining, based on the set of image samples, a plurality of first feature maps output by a pre-trained transformer model and a plurality of second feature maps output by a convolutional neural network model to be trained comprises:
determining a plurality of first feature maps output by the pre-trained converter model based on the image sample set;
determining a plurality of second feature maps output by a convolutional neural network model to be trained on the basis of the image sample set;
and adjusting the sizes of the plurality of first feature maps and/or the plurality of second feature maps in response to the sizes of the plurality of first feature maps and the plurality of second feature maps being different.
4. The method of claim 3, wherein the resizing the plurality of first feature maps and/or the plurality of second feature maps in response to the plurality of first feature maps differing in size from the plurality of second feature maps comprises:
in response to the first plurality of profiles differing in size from the second plurality of profiles, resizing in one of:
down-sampling the plurality of first feature maps to fit the sizes of the plurality of first feature maps and the plurality of second feature maps;
performing upsampling processing on the plurality of second feature maps to adapt the sizes of the plurality of first feature maps and the plurality of second feature maps;
the plurality of first feature maps are downsampled, and the plurality of second feature maps are upsampled so that the sizes of the plurality of first feature maps and the plurality of second feature maps are matched.
5. The method of any of claims 1 to 4, wherein determining a first distillation loss value from the plurality of first profiles and the plurality of second profiles using a first knowledge distillation loss function comprises:
calculating an image block similarity matrix of each first feature map and the first feature map in the plurality of first feature maps;
calculating a feature vector similarity matrix of each second feature map in the plurality of second feature maps and the second feature map;
calculating a first distance according to an image block similarity matrix of a first characteristic diagram and a characteristic vector similarity matrix of a second characteristic diagram corresponding to the same image sample of the same target in the image sample set;
and determining a first distillation loss value by utilizing a first knowledge distillation loss function according to the first distance of each image sample in the image sample set.
6. The method of claim 5, wherein the first knowledge distillation loss function is a mean square error loss function;
determining a first distillation loss value according to a first distance of each image sample in the image sample set by using a first knowledge distillation loss function, including:
and determining a first distillation loss value by utilizing a mean square error loss function according to the first distance of each image sample in the image sample set.
7. The method of any of claims 1 to 4, wherein determining a second distillation loss value using a second knowledge distillation loss function from the plurality of first profiles and the plurality of second profiles comprises:
according to the first feature maps and the second feature maps, calculating a first similarity matrix of feature maps corresponding to every two different image samples of the same target in the image sample set; wherein, the feature maps corresponding to every two different image samples of the same target comprise: a first feature map of a first image sample and a second feature map of a second image sample of the same object;
calculating a second distance according to each first similarity matrix;
according to the first feature maps and the second feature maps, calculating a second similarity matrix of the feature maps corresponding to the image samples of every two different targets in the image sample set; wherein, the feature map corresponding to the image samples of every two different targets comprises: a first feature map corresponding to the image sample of the first target and a second feature map corresponding to the image sample of the second target;
calculating a third distance according to each second similarity matrix;
and determining a second distillation loss value by utilizing a second knowledge distillation loss function according to the second distance corresponding to the same target and the third distance corresponding to each two different targets.
8. The method of claim 7, wherein the second knowledge distillation loss function is a triplet loss function;
determining a second distillation loss value by using a second knowledge distillation loss function according to the second distance corresponding to the same target and the third distance corresponding to each two different targets, wherein the determining comprises the following steps:
and determining a second distillation loss value by utilizing the triple loss function according to the second distance corresponding to the same target and the third distance corresponding to each two different targets.
9. The method according to claim 7, wherein the calculating, according to the plurality of first feature maps and the plurality of second feature maps, a first similarity matrix of feature maps corresponding to every two different image samples of a same target in the image sample set includes:
according to the first feature maps and the second feature maps, calculating a first similarity matrix of feature maps corresponding to every two different image samples of the same target in the image sample set;
wherein, the feature maps corresponding to every two different image samples of the same target comprise: the first feature map of the first image sample of the same object and the second feature map of the second image sample with the lowest similarity to the first feature map of the first image sample of the same object, and the second feature map of the first image sample of the same object and the first feature map of the second image sample with the lowest similarity to the second feature map of the first image sample of the same object; the second image sample is a positive sample.
10. The method according to claim 7, wherein the calculating a second similarity matrix of the feature maps corresponding to the image samples of each two different targets in the image sample set according to the plurality of first feature maps and the plurality of second feature maps comprises:
according to the first feature maps and the second feature maps, calculating a second similarity matrix of the feature maps corresponding to the image samples of every two different targets in the image sample set;
wherein, the feature map corresponding to the image samples of every two different targets comprises: a first feature map corresponding to an image sample of a first target and a second feature map corresponding to an image sample of a second target with the lowest similarity to the first feature map, and a second feature map corresponding to the image sample of the first target and a first feature map corresponding to an image sample of the second target with the lowest similarity to the second feature map; the image samples of the second target are negative samples.
11. A method of object re-identification, comprising:
determining a target image;
performing target re-identification on the target image by using a convolutional neural network model trained by the method of any one of claims 1 to 10 to identify a target object in the target image.
12. A training apparatus for an object re-recognition model, comprising:
the first determining module is used for determining a plurality of first feature maps output by a pre-trained converter model and a plurality of second feature maps output by a convolutional neural network model to be trained on the basis of the image sample set;
a second determination module for determining a first distillation loss value using a first knowledge distillation loss function based on the plurality of first feature maps and the plurality of second feature maps;
a third determining module for determining a second distillation loss value using a second knowledge distillation loss function based on the plurality of first feature maps and the plurality of second feature maps; and
and the training module is used for training the convolutional neural network model to be trained according to the first distillation loss value and the second distillation loss value so as to obtain the convolutional neural network model capable of carrying out target re-identification.
13. The apparatus of claim 12, wherein the set of image samples comprises image samples of a plurality of objects, each object of the plurality of objects corresponding to a plurality of image samples.
14. The apparatus of claim 12, wherein the first determining means comprises:
a first determining submodule, configured to determine, based on the image sample set, a plurality of first feature maps output by the pre-trained converter model;
the second determining submodule is used for determining a plurality of second feature maps output by the convolutional neural network model to be trained on the basis of the image sample set;
and the adjusting submodule is used for adjusting the sizes of the plurality of first feature maps and/or the plurality of second feature maps in response to the fact that the sizes of the plurality of first feature maps are different from the sizes of the plurality of second feature maps.
15. The apparatus of claim 14, wherein the adjustment submodule is to:
in response to the first plurality of feature maps differing in size from the second plurality of feature maps, performing a resizing in one of:
down-sampling the plurality of first feature maps to fit the sizes of the plurality of first feature maps and the plurality of second feature maps;
performing upsampling processing on the plurality of second feature maps to adapt the sizes of the plurality of first feature maps and the plurality of second feature maps;
and performing down-sampling processing on the plurality of first feature maps and performing up-sampling processing on the plurality of second feature maps so as to adapt the sizes of the plurality of first feature maps and the plurality of second feature maps.
16. The apparatus of any of claims 12 to 15, wherein the second determining means comprises:
the first calculation module is used for calculating an image block similarity matrix of each first feature map in the plurality of first feature maps and the first feature map;
the second calculation module is used for calculating a feature vector similarity matrix of each second feature map in the plurality of second feature maps and the second feature map;
the third calculation module is used for calculating a first distance according to an image block similarity matrix of the first feature map and a feature vector similarity matrix of the second feature map corresponding to the same image sample of the same target in the image sample set;
and the fourth determining module is used for determining a first distillation loss value by utilizing a first knowledge distillation loss function according to the first distance of each image sample in the image sample set.
17. The apparatus of claim 16, wherein the first knowledge distillation loss function is a mean square error loss function;
and the fourth determining module is used for determining a first distillation loss value by using a mean square error loss function according to the first distance of each image sample in the image sample set.
18. The apparatus of any of claims 12 to 15, wherein the third determining means comprises:
the fourth calculation submodule is used for calculating a first similarity matrix of the feature maps corresponding to every two different image samples of the same target in the image sample set according to the first feature maps and the second feature maps; wherein, the feature maps corresponding to every two different image samples of the same target comprise: a first feature map of a first image sample and a second feature map of a second image sample of the same object;
the fifth calculation submodule is used for calculating the second distance according to each first similarity matrix;
a sixth calculating sub-module, configured to calculate, according to the multiple first feature maps and the multiple second feature maps, a second similarity matrix of feature maps corresponding to image samples of every two different targets in the image sample set; wherein, the feature map corresponding to the image samples of every two different targets comprises: a first feature map corresponding to the image sample of the first target and a second feature map corresponding to the image sample of the second target;
the seventh calculation submodule is used for calculating a third distance according to each second similarity matrix;
and the fifth determining module is used for determining a second distillation loss value by utilizing a second knowledge distillation loss function according to the second distance corresponding to the same target and the third distance corresponding to each two different targets.
19. The apparatus of claim 18, wherein the second knowledge distillation loss function is a triplet loss function;
and the fifth determining module is used for determining a second distillation loss value by utilizing the triple loss function according to the second distance corresponding to the same target and the third distance corresponding to each two different targets.
20. The apparatus of claim 18, wherein the fourth computation submodule is to:
according to the first feature maps and the second feature maps, calculating a first similarity matrix of feature maps corresponding to every two different image samples of the same target in the image sample set;
wherein, the feature maps corresponding to every two different image samples of the same target comprise: the first feature map of the first image sample of the same object and the second feature map of the second image sample with the lowest similarity to the first feature map of the first image sample of the same object, and the second feature map of the first image sample of the same object and the first feature map of the second image sample with the lowest similarity to the second feature map of the first image sample of the same object; the second image sample is a positive sample.
21. The apparatus of claim 18, wherein the sixth computation submodule is to:
according to the first feature maps and the second feature maps, calculating a second similarity matrix of the feature maps corresponding to the image samples of every two different targets in the image sample set;
wherein, the feature map corresponding to the image samples of every two different targets comprises: a first feature map corresponding to an image sample of a first target and a second feature map corresponding to an image sample of a second target with the lowest similarity to the first feature map, and a second feature map corresponding to the image sample of the first target and a first feature map corresponding to an image sample of the second target with the lowest similarity to the second feature map; the image samples of the second target are negative samples.
22. An object re-identification apparatus comprising:
an image determination module for determining a target image;
an image recognition module, configured to perform target re-recognition on the target image by using the convolutional neural network model trained by the method according to any one of claims 1 to 10, so as to identify a target object in the target image.
23. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 11.
24. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1 to 11.
25. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 11.
CN202211272814.1A 2022-10-18 2022-10-18 Training method of target re-identification model and target re-identification method Active CN115578613B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211272814.1A CN115578613B (en) 2022-10-18 2022-10-18 Training method of target re-identification model and target re-identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211272814.1A CN115578613B (en) 2022-10-18 2022-10-18 Training method of target re-identification model and target re-identification method

Publications (2)

Publication Number Publication Date
CN115578613A true CN115578613A (en) 2023-01-06
CN115578613B CN115578613B (en) 2024-03-08

Family

ID=84584149

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211272814.1A Active CN115578613B (en) 2022-10-18 2022-10-18 Training method of target re-identification model and target re-identification method

Country Status (1)

Country Link
CN (1) CN115578613B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200160113A1 (en) * 2018-11-19 2020-05-21 Google Llc Training image-to-image translation neural networks
CN111554268A (en) * 2020-07-13 2020-08-18 腾讯科技(深圳)有限公司 Language identification method based on language model, text classification method and device
CN114494776A (en) * 2022-01-24 2022-05-13 北京百度网讯科技有限公司 Model training method, device, equipment and storage medium
WO2022104550A1 (en) * 2020-11-17 2022-05-27 华为技术有限公司 Model distillation training method and related apparatus, device, and readable storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200160113A1 (en) * 2018-11-19 2020-05-21 Google Llc Training image-to-image translation neural networks
CN111554268A (en) * 2020-07-13 2020-08-18 腾讯科技(深圳)有限公司 Language identification method based on language model, text classification method and device
WO2022104550A1 (en) * 2020-11-17 2022-05-27 华为技术有限公司 Model distillation training method and related apparatus, device, and readable storage medium
CN114494776A (en) * 2022-01-24 2022-05-13 北京百度网讯科技有限公司 Model training method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN115578613B (en) 2024-03-08

Similar Documents

Publication Publication Date Title
US11768876B2 (en) Method and device for visual question answering, computer apparatus and medium
CN113920307A (en) Model training method, device, equipment, storage medium and image detection method
CN113971751A (en) Training feature extraction model, and method and device for detecting similar images
CN112561060B (en) Neural network training method and device, image recognition method and device and equipment
CN111488985A (en) Deep neural network model compression training method, device, equipment and medium
US20230068238A1 (en) Method and apparatus for processing image, electronic device and storage medium
CN113963176B (en) Model distillation method and device, electronic equipment and storage medium
CN113642583B (en) Deep learning model training method for text detection and text detection method
CN115482395B (en) Model training method, image classification device, electronic equipment and medium
CN114187459A (en) Training method and device of target detection model, electronic equipment and storage medium
CN112966744A (en) Model training method, image processing method, device and electronic equipment
CN113361710A (en) Student model training method, picture processing device and electronic equipment
US20230046088A1 (en) Method for training student network and method for recognizing image
CN114266897A (en) Method and device for predicting pox types, electronic equipment and storage medium
CN115147680B (en) Pre-training method, device and equipment for target detection model
CN115409855A (en) Image processing method, image processing device, electronic equipment and storage medium
CN114913339A (en) Training method and device of feature map extraction model
CN114120454A (en) Training method and device of living body detection model, electronic equipment and storage medium
CN115273148B (en) Pedestrian re-recognition model training method and device, electronic equipment and storage medium
CN113887535B (en) Model training method, text recognition method, device, equipment and medium
CN116229584A (en) Text segmentation recognition method, system, equipment and medium in artificial intelligence field
CN115879004A (en) Target model training method, apparatus, electronic device, medium, and program product
CN115578613B (en) Training method of target re-identification model and target re-identification method
CN114707638A (en) Model training method, model training device, object recognition method, object recognition device, object recognition medium and product
CN114821190A (en) Image classification model training method, image classification method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant