CN112597866B

CN112597866B - Knowledge distillation-based visible light-infrared cross-modal pedestrian re-identification method

Info

Publication number: CN112597866B
Application number: CN202011489557.8A
Authority: CN
Inventors: 邵昊; 高广谓; 吴飞; 徐国安; 岳东
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2020-12-16
Filing date: 2020-12-16
Publication date: 2022-08-02
Anticipated expiration: 2040-12-16
Also published as: CN112597866A

Abstract

The invention discloses a knowledge distillation-based visible light-infrared cross-modal pedestrian re-identification method, which is based on a residual error network and comprises a feature extraction part, a feature mapping part and a loss function part; inputting K pairs of pictures to a feature extraction part initially, and performing shallow feature extraction; each pair of the K pairs of pictures comprises a visible light picture and an infrared picture aiming at the same target, a knowledge distillation function is introduced, and a loss function is calculated; then inputting the shallow layer feature extraction result into a feature mapping part, and extracting the mode sharing features of the visible light mode and the infrared mode; finally, the classification result is output after sequentially passing through a GEM pooling layer, a batch normalization layer and a full connection layer; the invention also designs an improved enumeration loss function, and further solves the problem of large modal difference from the traditional visible light image mode to the infrared image mode.

Description

Knowledge distillation-based visible light-infrared cross-modal pedestrian re-identification method

Technical Field

The invention relates to the technical field of computer vision, in particular to a knowledge distillation-based visible light-infrared cross-modal pedestrian re-identification method.

Background

Pedestrian re-identification is a popular research subject in the field of computer vision, integrates a computer image processing technology and a statistical technology, and is widely applied to the fields of security protection, intelligent monitoring and the like. It has the difficulties of large intra-class difference (the apparent characteristics of the same person can be very different), small inter-class difference (the apparent characteristics of different persons can be very similar), and the like. This is mainly due to factors such as the shooting angle of the camera, the difference of illumination, the change of the posture of the pedestrian, and the shielding. Pedestrian re-identification algorithms in the prior art mainly study the daytime pedestrian re-identification based on visible light (RGB) images. Night scenes are also the key fields in the fields of monitoring, security and the like. Although many surveillance cameras can automatically convert from a visible mode (visible mode) to an Infrared mode (Infrared mode) and acquire color (RGB) images and Infrared (Infrared) images, respectively, many excellent pedestrian re-identification algorithms do not support such matching between color and Infrared images. Because of the large modal gap between color (RGB) and Infrared (Infrared) images. The visible image has 3 channels containing color information, while the infrared image has only 1 channel containing invisible light information.

At present, pedestrian re-identification algorithms based on visible light-infrared cross-modal are mainly divided into two types: (1) a dual stream network based method; (2) a method of generating a network based on a countermeasure. The first approach addresses inter-modal differences by aligning feature distributions of different modalities; the second approach resolves the inter-modality differences through modality conversion while preserving identity information.

Disclosure of Invention

The purpose of the invention is as follows: the invention provides a knowledge distillation-based visible light-infrared cross-modal pedestrian re-identification method, which effectively solves the problem of large modal difference from a visible light mode to an infrared image mode by aiming at the problems of large intra-class difference, small inter-class difference, easy shielding and the like of the traditional pedestrian re-identification method.

The technical scheme is as follows: in order to achieve the purpose, the invention adopts the technical scheme that:

a knowledge distillation-based visible light-infrared cross-modal pedestrian re-identification method is characterized by comprising the following steps:

s1, initially inputting K to a feature extraction part of the picture, and performing shallow feature extraction; each pair of the K pairs of pictures comprises a visible light picture and an infrared picture aiming at the same target; the feature extraction is as follows:

I _V ＝F _V (i _V )

I _T ＝F _T (i _T )

wherein i _V Representing a visible light picture, F _V Indicating shallow feature extraction of visible light, I _V Representing features of visible light picture extraction; i.e. i _T Representing an infrared picture, F _T Indicating shallow feature extraction of an infrared picture, I _T Representing the characteristics of infrared picture extraction;

step S2, introducing a knowledge distillation function KD Loss, and obtaining a characteristic pair I according to the step S1 _V And I _T The loss function is calculated as follows:

step S3, the characteristic pair I acquired in the step S1 _V And I _T Inputting the data into a characteristic mapping part, and extracting the mode sharing characteristics of a visible light mode and an infrared mode as follows:

K _V ＝E(I _V )

K _T ＝E(I _T )

wherein E represents the operation of deeply extracting modal shared features; k _V And K _T Respectively representing the extracted sharing characteristics;

step S4, sharing the characteristics K with the modality obtained in the step S3 _V And K _T The classification result is output to the GEM pooling layer, the batch normalization layer and the full connection layer in sequence, and the output classification result is as follows:

L _V ＝FC(BN(GEM(K _V )))

L _T ＝FC(BN(GEM(K _T )))

wherein GEM represents pooling operations as follows:

is characteristic of pooled output. P is a hyper-parameter, which can be set in advance or learned by back propagation, and corresponds to maximum pooling when P → ∞ and average pooling when P → 1;

BN represents batch normalization operation, FC represents full connection layer;

step S5, introducing an improved enumeration loss function;

step S5.1, introducing inter-class cross-modal enumeration loss function L _c The following were used:

wherein { (x) ₁ ,y ₁ ),(x ₂ ,y ₂ ),…(x _i ,y _i )…(x _n ,y _n ) Denotes a picture with N pairs of different modalities, x _i As standard sample pictures, y _i For positive sample pictures, y _j A negative sample picture;

step S5.2, introducing in-class homomorphic enumeration loss function L _s The following were used:

wherein { (x) ₁ ,y ₁ ),(x ₂ ,y ₂ ),…(x _i ,y _i )…(x _n ,y _n ) Denotes a picture with N pairs of different modalities, x _i As standard sample pictures, y _i For an active sample picture, x _j A negative sample picture;

step S5.3, introduce the compact term C as follows:

wherein f is _r (y _i ) Denotes f (y) _i ) The (c) th element of (a),

denotes f (y) _i ) R is the deep local feature representation of the output f (y) _i ) Dimension (d);

the final enumeration loss function is as follows:

L _enumerate ＝L _c +L _s +λC

where λ is a balancing term to balance the tightening term C;

step S6, integrating the identity information into the overall loss function; specifically, the cross entropy loss function is designed as follows:

where N represents the class number of the identity information of the sample, L _idt Cross entropy loss function, L, representing an infrared picture _idv Represents the cross entropy loss function of a visible picture, q () represents the predicted label, p () represents the true label, x _i Representing a visible picture, y _i Representing an infrared picture;

step S7, based on steps S2, S5, and S6, the final loss function is as follows:

L _total ＝L _enumerate +L _idv +L _idt +L _KD 。

has the advantages that:

according to the method, shallow feature extraction is carried out on an infrared image and a visible light image to extract modal unique features of a visible light modality and an infrared modality, then deep feature extraction is carried out on the infrared image and the visible light image in a feature mapping part to extract modal sharing features of the visible light modality and the infrared modality, and finally classification results are output through operations of pooling, batch normalization and the like. A knowledge distillation function is introduced, the difference between a visible light mode and an infrared mode in a shallow network is reduced, a common feature space with different angles can be learned by designing an improved enumeration loss function, and the included angle between mapping features can be effectively constrained through the common feature space. Most of the existing work uses euclidean metric based constraints to account for differences between different modal characteristics. However, these methods cannot learn angle-discriminating embedded features because euclidean distance cannot effectively measure the angle between embedded features, and thus the improved enumeration loss function solves this problem with cosine distances. Since a common feature space that is angularly distinguishable is particularly important for classification based on pedestrian images between mapped features, an improved enumeration loss function may better learn this space.

Drawings

Fig. 1 is an overall network framework diagram of a pedestrian re-identification method provided by the invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings.

The knowledge distillation-based visible light-infrared cross-modal pedestrian re-identification method is based on a residual error network shown in figure 1. Specifically, the residual network comprises a residual block, a convolutional layer, a batch normalization layer, an activation function layer, a full connection layer and a pooling layer.

In the figure, stage0, stage1, stage2, stage3 and stage4 respectively represent the shallow convolutional layer, the first residual block, the second residual block, the third residual block and the fourth residual block of the Resnet50 network. The shallow convolutional layers and the structures of the residual blocks are shown in table 1 below. GEM represents a pooling operation. BNNeck is a batch normalization layer. FC is a full connectivity layer.

TABLE 1 concrete Structure of shallow convolution layer and residual Block

The re-identification method mainly comprises 3 parts, namely a feature extraction part, a feature mapping part and a loss function part, and the implementation mode of each part is described in detail below.

(1) Feature extraction section

I _V ＝F _V (i _V )

I _T ＝F _T (i _T )

step S2, in order to reduce the difference between the visible light mode and the infrared mode in the shallow layer network, introducing a knowledge distillation function KD Loss, and obtaining a characteristic pair I according to the step S1 _V And I _T The loss function is calculated as follows:

(2) feature mapping section

K _V ＝E(I _V )

K _T ＝E(I _T )

wherein E represents the operation of deeply extracting modal shared features; k _V And K _T Respectively representing the extracted shared features;

step S4, sharing the characteristics K with the modality obtained in the step S3 _V And K _T The classification result is output to a GEM pooling layer, a batch normalization layer and a full connection layer in sequence, and the classification result is output as follows:

L _V ＝FC(BN(GEM(K _V )))

L _T ＝FC(BN(GEM(K _T )))

where GEM represents a pooling approach, BN represents batch normalization operations, and FC represents the fully connected layer.

As a fine-grained example search, maximal pooling or average pooling, which is widely used by pooling operations, fails to capture domain-specific distinguishing features. Instead of exploiting the widely used maximal pooling or average pooling, we use a GEM pooling layer to convert 3-dimensional features into 1-dimensional feature vectors. Given the 3-dimensional features, the following are specific:

x is a characteristic of the pooling operation output. P is a hyper-parameter, which can be set in advance or learned by back propagation, and corresponds to maximum pooling when P → ∞ and average pooling when P → 1;

after GEM pooling, in order to enable an enumeration loss function to constrain features in a free Euclidean space and simultaneously constrain features near a hypersphere for classification loss, a batch normalization layer is introduced, and then the features are obtained after dropout and activation function operation.

(3) Part of loss function

Step S5, introducing an improved enumeration loss function;

this section will discuss the enumerated penalty function proposed in this patent in more detail. The enumeration loss function provided by the invention is inspired by a common triplet loss function, and the calculation formula of the common triplet loss function is as follows:

wherein x _i ^a ,

And x _i ⁿ Respectively representing a standard sample picture, an active sample picture and a passive sample picture, f() Representing a feature extraction operation, | () | luminance ₂ ² Represents the squaring operation of Euclidean distance, alpha represents a preset hyper-parameter, [ z ]] ₊ ＝max(z,0)。

The existing triple loss function has two key problems: (1) and selecting a sample picture. (2) And setting a hyper-parameter alpha. The selection of the existing triplet loss function sample picture is usually determined by an online hard/soft selection strategy, and the hyper-parameter alpha is manually set. There are the following problems: how to select a suitable sample picture in a cross-modality scene.

Suppose now there are N pairs of different modality pictures, { (x) ₁ ,y ₁ ),(x ₂ ,y ₂ ),…(x _i ,y _i )…(x _n ,y _n ) Is input to the network for training, where x _i And y _i A pair of pictures representing different modalities but the same identity information. Assume that the standard sample picture is x _i Then the aggressive sample picture is y _i But the selection of the negative sample pictures has a different form, x _j Or y _j J indicates identity information different from i, one of which is from x _i Negative picture samples are selected from pictures with different modes and different identity information, so that a cross-mode triple loss function calculation formula is as follows:

wherein L is _cross Representing the calculated transmembrane state triplet loss function.

Because the difference between the local horizontal infrared mode and the visible light mode is smaller, meanwhile, the traditional method for describing local characteristics by hand design is concentrated on reducing the difference between the modes, and based on the difference, an enumeration loss function is provided. The purpose of this loss function is to completely eliminate the inter-modal local differences by means of deep convolutional networks, in particular, the purpose of enumerating the loss function is to have a standard sample picture x _i And positive sample picture y _i Not only is less than x _j (with standards)The picture has the same mode and different identity information) and is less than y _j (different modality from standard sample, different identity information).

wherein { (x) ₁ ,y ₁ ),(x ₂ ,y ₂ ),…(x _i ,y _i )…(x _n ,y _n ) Denotes a picture with N pairs of different modalities, x _i As a standard sample picture, y _i For positive sample pictures, y _j A negative sample picture; the present embodiment defaults cross-modal changes within a class to be less than cross-modal changes between classes.

wherein { (x) ₁ ,y ₁ ),(x ₂ ,y ₂ ),…(x _i ,y _i )…(x _n ,y _n ) Denotes a picture with N pairs of different modalities, x _i As a standard sample picture, y _i For an active sample picture, x _j A negative sample picture; the same cross-modal variation within a default class of the present embodiment is less than the same modal variation between classes.

Step S5.3, the two loss functions can help to train deep local feature description when ignoring inter-modal differences in the training phase. But in the practical experimental process, convergence based on the combination of the two loss functions is difficult, so a tight term C is introduced to ensure that each dimension in the generated deep local feature description is distributed as uniformly as possible, so that the obtained feature description is more compact and information-rich, and the tight term C is introduced as follows:

wherein f is _r (y _i ) Denotes f (y) _i ) The (c) th element of (a),

denotes f (y) _i ) R is the output deep local feature representation f (y) _i ) Dimension (d); the purpose of the compact term is to avoid overfitting of the network in the training process, in the experiment, if no network loss exists, the network is difficult to converge, and the compact term can help reduce redundancy, so that the deep local features are more discriminative and informative.

The final enumeration loss function is as follows:

L _enumerate ＝L _c +L _s +λC

where λ is a balancing term to balance the tightening term C;

step S6, the characteristics of the visible and infrared images may be completely different due to cross-modality variation, and therefore, the loss function will fall into a convergence problem due to incorrect relationship metrics, and is difficult to converge for large datasets. At the same time, learned features cannot account for intra-class variations by using relationship constraints only. The identity information is thus integrated into the overall loss function in this embodiment. This is done using a cross entropy loss function that is widely used. The identity loss function will model the identity specific information to enhance robustness in the feature learning process. The cross entropy loss function is calculated as follows:

where N represents the class number of the identity information of the sample, L _idt Cross entropy loss function, L, representing infrared pictures _idv Represents the cross entropy loss function of a visible picture, q () represents the predicted label, p () represents the true label, x _i Representing a visible picture, y _i Representing an infrared picture;

step S7, based on steps S2, S5, and S6, the final loss function is as follows:

L _total ＝L _enumerate +L _idv +L _idt +L _KD 。

in the experiment, an Adam optimizer is selected to optimize the model, the initial learning rate is set to be 1 x 10-4, and the results of part of the experiment are shown in the following tables 2-3. The present invention is still optimal in accuracy on the SYSU-MM01 dataset without using pre-training experimental processing methods. Compared with a Hi-cmd method, the rank1 value is improved by 8.29%, the index is most important in an actual application scene, and other indexes are obviously improved. Meanwhile, on the RegDB data set, the present invention is still optimal in terms of accuracy without using a pre-training experimental processing method. Compared with the best Edfl method, rank1 is improved by 17, 72%, and other indexes such as Map and the like are also improved obviously. In conclusion, the improvement of the two data sets in the invention is a great difference in the pedestrian re-identification field.

TABLE 2 Experimental results on SYSU-MM01 data set

TABLE 3 results of experiments on the RegDB data set

Method	Rank 1	Rank 10	Rank 20	Map
					Zero[1]	17.75	34.21	44.35	18.9
Hcml[2]	24.44	47.53	56.78	20.8
					Hsme[3]	50.85	73.36	81.66	47
Edfl[5]	52.58	72.1	81.47	52.98
					ours	70.3	80.31	87.25	69.32

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. A knowledge distillation-based visible light-infrared cross-modal pedestrian re-identification method is characterized by comprising the following steps:

I _V ＝F _V (i _V )

I _T ＝F _T (i _T )

K _V ＝E(I _V )

K _T ＝E(I _T )

L _V ＝FC(BN(GEM(K _V )))

L _T ＝FC(BN(GEM(K _T )))

wherein GEM represents pooling operations as follows:

x is a characteristic output via a pooling operation; p is a hyper-parameter and is acquired by adopting preset or back propagation learning, and when p → ∞ is equivalent to maximum pooling, and when p → 1 is equivalent to average pooling;

step S5, introducing an improved enumeration loss function;

step S5.1, introducing an inter-class cross-modal enumeration loss function L _c The following:

wherein { (x) ₁ ,y ₁ ),(x ₂ ,y ₂ ),…(x _i ,y _i )…(x _n ,y _n ) Denotes a picture with N pairs of different modalities, x _i As a standard sample picture, y _i For positive sample pictures, y _j A negative sample picture;

wherein { (x) ₁ ,y ₁ ),(x ₂ ,y ₂ ),…(x _i ,y _i )…(x _n ,y _n ) Denotes a picture with N pairs of different modalities, x _i As a standard sample picture, y _i For an active sample picture, x _j A negative sample picture;

step S5.3, introduce the compact term C as follows:

wherein f is _r (y _i ) Denotes f (y) _i ) The (c) th element of (a),

the final enumeration loss function is as follows:

L _enumerate ＝L _c +L _s +λC

where λ is a balancing term to balance the tightening term C;

step S7, based on steps S2, S5, and S6, the final loss function is as follows:

L _total ＝L _enumerate +L _idv +L _idt +L _KD 。