CN109635728B

CN109635728B - Heterogeneous pedestrian re-identification method based on asymmetric metric learning

Info

Publication number: CN109635728B
Application number: CN201811515924.XA
Authority: CN
Inventors: 赖剑煌; 程海杰; 张权
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2018-12-12
Filing date: 2018-12-12
Publication date: 2020-10-13
Anticipated expiration: 2038-12-12
Also published as: CN109635728A

Abstract

The invention discloses a heterogeneous pedestrian re-identification method based on asymmetric metric learning, which carries out asymmetric metric on depth features under different modes and comprises the following steps: respectively projecting different modal depth features to a shared space by using two sparse self-encoders which do not share parameters, and simultaneously introducing global constraint and local constraint to constrain the distance between the different modal depth features, so that the intra-class distance between the different modal features is reduced and the inter-class distance between the different modal features is increased; and the constraint results of the global constraint and the local constraint are used as supervision signals to be propagated back to the training network for correcting various parameters. According to the invention, by reducing the modal difference between different modes, the network can ignore modal information as much as possible and pay more attention to identity information, thereby improving the pedestrian characteristic expression and the pedestrian matching accuracy.

Description

Heterogeneous pedestrian re-identification method based on asymmetric metric learning

Technical Field

The invention relates to the field of computer vision, in particular to a heterogeneous pedestrian re-identification method based on asymmetric metric learning.

Background

Along with the rapid development of modern society, the urban population density is higher and higher, and people pay more and more attention to safety problems. In order to prevent and avoid safety incidents in time, a large number of monitoring cameras are installed and applied in public places. In the face of a complex monitoring network and massive monitoring data, how to automatically analyze and interpret information provided by a multi-camera vision field monitoring system is realized, and the system has a positive promoting effect on preventing crimes and maintaining good social security. Therefore, pedestrian re-identification has become a hot research content in the field of computer vision.

Pedestrian re-identification (person re-identification) is used as a key component in the field of video monitoring research, and aims to accurately and quickly identify a certain target pedestrian appearing in the visual field of a monitoring camera from a large number of pedestrians in the visual fields of other cameras of a monitoring network. The application of the pedestrian re-identification technology can greatly reduce the manual participation in video monitoring and realize the rapid and accurate analysis of the pedestrians and the behaviors thereof in the monitoring video. At present, the mainstream pedestrian re-identification method mainly extracts the appearance features and color features (RGB features) of pedestrians for matching, and these methods can be regarded as RGB-RGB single-mode pedestrian matching. However, these methods all have a strong assumption: assuming that the clothing of the same pedestrian in the presence of different cameras remains as unchanged as possible, it can be considered as a short-time pedestrian re-identification. Thus, when the pedestrian's clothing changes significantly or when the color characteristics of the pedestrian become unavailable, the performance of these methods will decline dramatically because the color characteristics at this time more show interference with the model, largely misjudging different pedestrians wearing the same color clothing as the same pedestrian. Therefore, in recent years, in order to overcome the defect of color feature failure under extreme conditions, data of other modes is introduced to make up the deficiency of RGB data, such as infrared data (IR data), which can be regarded as RGB-IR cross-mode pedestrian matching (heterogeneous pedestrian matching), and the biggest challenge of heterogeneous pedestrian re-identification is how to narrow the modal gap between different modes. At present, a scholars proposes a depth zero filling method to reduce the modal gap between different modes, but the identification result of the method is not accurate and can not meet the requirement of practical application.

Disclosure of Invention

The invention aims to overcome the defects and shortcomings of the prior art, and provides a heterogeneous pedestrian re-identification method based on asymmetric metric learning, which can overcome the defects of color feature failure and low accuracy of heterogeneous pedestrian re-identification under extreme conditions, and can reduce modal differences among different modes to enable a network to ignore modal information as much as possible and pay more attention to identity information, thereby improving pedestrian feature expression and pedestrian matching accuracy.

The purpose of the invention is realized by the following technical scheme: a heterogeneous pedestrian re-identification method based on asymmetric metric learning comprises the following steps:

inputting pedestrian images in two modes in the process of training a model, and respectively extracting depth features;

carrying out asymmetric measurement on depth features under different modes, and the method comprises the following steps: respectively projecting different modal depth features to a shared space by using two sparse self-encoders which do not share parameters, and simultaneously introducing global constraint and local constraint to constrain the distance between the different modal depth features, so that the intra-class distance between the different modal features is reduced and the inter-class distance between the different modal features is increased; the constraint results of the global constraint and the local constraint are used as supervision signals to be reversely transmitted to a training network for correcting each parameter;

and calculating the loss of the global feature and the local feature according to the depth feature, and optimizing the training model by using the global loss, the local loss and the sum of the global constraint and the local constraint in the asymmetry metric to achieve minimization.

Through the steps, the heterogeneous pedestrian re-identification model can be trained as long as pedestrian re-identification training data of any two modes are given, and the heterogeneous pedestrian re-identification model has the advantages of high precision and high speed in heterogeneous pedestrian matching.

Preferably, the depth features of the images in different modalities are extracted, and the method comprises the following steps:

firstly, a ResNet50 classification model pre-trained on an ImageNet data set is used as a main network, and the main network is divided into three branches;

then, from top to bottom, each branch extracts the high-level features of the classification model and horizontally and uniformly divides the high-level features into blocks;

then, obtaining a plurality of global features and local features with fixed sizes by each branch through pooling and dimension reduction operations;

and finally, splicing the global features and the local features together in sequence to obtain the depth features of the input image, namely the complete feature expression of the pedestrian.

In order to reduce the modal difference among heterogeneous pedestrian data, the invention carries out asymmetric measurement on depth characteristics under different modes, and the steps are as follows:

first, the extracted depth features are divided into F^BAnd F^RTwo groups of the first and the second groups of the second,

b, R respectively representing an RGB mode and an IR mode,

representing an ith depth feature vector;

next, two sets of features F are combined^BAnd F^RRespectively passing through two sparse self-encoders SAE which do not share parameters, wherein each sparse self-encoder is composed of two fully-connected layers and respectively serves as an encoder E and a decoder D, the encoder E is responsible for projecting different modal characteristics to a shared space, and the decoder D is responsible for remapping the encoded characteristics to a space with the same size as an input characteristic space;

then, a reconstruction loss is constructed, denoted as l_rFor constraining the output and input of the SAE to be as consistent as possible:

l_r＝||f^B，D^B(E^B(f^B))||₂+||f^R，D^R(E^R(f^R))||₂；

f^B，f^R，E^B，E^R，D^B，D^Rfeatures, encoders and decoders representing modes B and R, respectively;

and finally, introducing global constraint for constraining the difference between different modal characteristic distributions in a shared space, introducing local constraint for reducing the intra-class distance and increasing the inter-class distance between different modal characteristics, and reversely transmitting the constraint result serving as a supervision signal back to the training model to correct each parameter.

Further, global constraints are used to constrain the differences between the distributions of different modal features, denoted as l_global＝W(E^B(f^B)，E^R(f^R))²Where W satisfies X-N (m) for any given two distributions_X，C_X) And Y ═ N (m)_r，C_Y) M, C represent the mean and variance of the X and Y distributions, respectively, having

Further, local constraints are used to reduce the intra-class distance between different modal featuresAnd increasing the distance between classes, denoted as l_localThe distance between the two features is represented by (max (d (f, p)) -min (d (f, n)) + α), p ∈ a (f), n ∈ b (f), a (f), b (f) respectively represent feature sets with the same identity information and different identity information as the feature f, d (·, · denotes the euclidean distance between the two features, and α is a hyper-parameter for controlling the distance between the positive and negative sample pairs.

Further, to make the shared space more efficient, a sparse loss l is constructed_sparseUnconstrained hidden layer output,/_sparse＝||E^B(f^B)||₁+||E^R(f^R)||₁。

Furthermore, each sparse autoencoder consists of two fully-connected layers with a ReLU activation function.

Preferably, for the extracted depth features, the triple Loss function is used for calculating the Loss of the global features, the Softmax function is used for calculating the Loss of the local features, and the goal is to optimize the training model by minimizing the sum of the global Loss, the local Loss and the reconstruction Loss, the sparse Loss, the global constraint and the local constraint in the asymmetry measurement. Thereby improving the expression force of the model characteristics.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. aiming at the defects of large modal information difference and large pedestrian matching difficulty between different modalities in the existing heterogeneous pedestrian re-identification task, the invention provides the asymmetric measurement of depth features under different modalities to reduce the feature distance between the same pedestrians under different modalities, can be applied to any feature extraction network to provide supervision information for a training network and realize end-to-end training, and effectively improves the quality of pedestrian feature extraction and accelerates network convergence.

2. In the invention, different loss functions are adopted for different characteristics in a differentiated mode for collaborative training, and compared with single loss function training, the method purposefully enables the network to ignore modal information and pay more attention to pedestrian identity information as far as possible, so that more complete characteristic expression of pedestrians is obtained, and the method is far better than the existing method in precision.

3. The invention uses the idea of combining the global and local parts to extract the pedestrian characteristics, and compared with the single characteristics, the invention obtains more complete characteristic expression of the pedestrian, thereby improving the precision.

Drawings

Fig. 1 is a general functional framework diagram of the method of the present embodiment.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted. The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

As shown in fig. 1, the method for re-identifying a heterogeneous pedestrian based on asymmetric metric learning mainly includes three steps of feature extraction, asymmetric metric and classification, and each step is specifically described below.

Firstly, feature extraction

The method for extracting the features of the pedestrian comprises the following steps:

then, from top to bottom, each branch extracts the high-level features of the classification model and horizontally and uniformly divides the high-level features into 1 block, 2 blocks and 3 blocks;

then, each branch is subjected to pooling and dimension reduction operation to obtain 3 global features and 5 local features with fixed size of 256 dimensions;

finally, 8 256-dimensional features are spliced together in sequence to obtain a 2048-dimensional feature which is used as the depth feature of the input image, namely the complete feature expression of the pedestrian. The above features are used for later asymmetry measurement and classification.

The invention uses the idea of combining the global and local parts to extract the pedestrian characteristics, and compared with the single characteristics, the invention obtains more complete characteristic expression of the pedestrian, thereby improving the precision.

Two, asymmetric metric

The method comprises the following steps of projecting and reconstructing characteristics under different modes through different projection matrixes to reduce the mode difference between heterogeneous pedestrian data, wherein the steps are as follows:

first, for the extracted features

(B/R stands for RGB modality/IR modality, respectively), the extracted features are divided into F by a modality selector^BAnd F^RTwo groups;

next, two sets of features F are combined^BAnd F^RRespectively passing through two sparse autoencoders SAE without sharing parameters, wherein each sparse autoencoder is composed of two full-connection layers with a ReLU activation function and respectively used as an encoder E and a decoder D, the encoder E is responsible for projecting different modal characteristics to a shared space, the decoder D is responsible for remapping the encoded characteristics to a space with the same size as an input characteristic space, and then reconstruction loss is constructed and recorded as l_rAnd is used to constrain the output and input of the SAE to be as consistent as possible. l_r＝||f^B，D^B(E^B(f^B))||₂+||f^R，D^R(E^R(f^R))||₂，f^B，f^R，E^B，E^R，D^B，D^RRepresenting the characteristics of modality B and modality R, respectively, an encoder and a decoder.

Meanwhile, in order to make the shared space more effective, a sparse loss l is constructed_sparseUnconstrained hidden layer output,/_sparse＝||E^B(f^B)||₁+||E^R(f^R)||₁；

Finally, in the shared space, a global constraint l is introduced_globalAnd local constraint l_localAnd constraining the distance between different modal characteristics, and reversely transmitting the distance as a supervision signal to other modules of the network to correct parameters of the characteristic extraction step, so that the characteristic extraction ignores modal information as much as possible and pays attention to pedestrian identity information, thereby improving the expression of image characteristics.

Global constraint is used to constrain the difference between different modal feature distributions, denoted as l_global＝W(E^B(f^B)，E^R(f^R))²Where W satisfies X-N (m) for any given two distributions_X，C_X) And Y ═ N (m)_Y，C_Y) M, C represent the mean and variance of the X and Y distributions, respectively, having

Local constraints are used to reduce the intra-class distance and increase the inter-class distance between different modal features, denoted as l_localThe distance between the two features is represented by (max (d (f, p)) -min (d (f, n)) + α), p ∈ a (f), n ∈ b (f), a (f)/b (f) respectively represent feature sets with the same identity information and different identity information as the feature f, d (·,) represents the euclidean distance between the two features, and α is a hyper-parameter for controlling the distance between the positive sample pairs and the negative sample pairs.

According to the invention, two sparse self-encoders which do not share parameters are used for projecting different modal characteristics to a shared space respectively. Introducing global constraint l simultaneously_globalAnd local constraint l_localThe distance between different modal features is constrained, so that the intra-class distance and the inter-class distance between the different modal features are reduced, modal information is effectively ignored in the feature extraction process to pay attention to ID information, and the expression of image features is improved.

Three, classification

The step is that different loss collaborative training is adopted to distinguish input features so as to effectively restrain pedestrian features, and the step is as follows: for the depth features extracted from the feature extraction step, the triple Loss function is used for calculating the Loss of 3 global features, the Softmax function is used for calculating the Loss of five local features, and then the model is optimized by jointly minimizing the global Loss, the local Loss and the sum of reconstruction Loss, sparse Loss, global constraint and local constraint in the asymmetric measurement module, so that the model feature expression is improved.

In the invention, different loss functions are adopted for different characteristics in a differentiated mode for collaborative training, and compared with single loss function training, the method purposefully enables the network to ignore modal information and pay more attention to pedestrian identity information as far as possible, so that more complete characteristic expression of pedestrians is obtained, and the method is far better than the existing method in precision.

Experimental results show that the Rank1 and the mAP of the invention on the current maximum cross-modal pedestrian recognition data set SYSU-MM01 are respectively improved from 24.43% and 26.92% to 66.26% and 66.7%, and the performance is greatly improved compared with other methods.

The techniques described herein may be implemented by various means. For example, these techniques may be implemented in hardware, firmware, software, or a combination thereof. For a hardware implementation, the processing modules may be implemented within one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Programmable Logic Devices (PLDs), field-programmable gate arrays (FPGAs), processors, controllers, micro-controllers, electronic devices, other electronic units designed to perform the functions described herein, or a combination thereof.

For a firmware and/or software implementation, the techniques may be implemented with modules (e.g., procedures, steps, flows, and so on) that perform the functions described herein. The firmware and/or software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.

Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A heterogeneous pedestrian re-identification method based on asymmetric metric learning is characterized by comprising the following steps:

calculating the loss of the global characteristic and the local characteristic according to the depth characteristic, and optimizing the training model by using the global loss, the local loss and the minimization of the sum of the global constraint and the local constraint in the asymmetry measurement;

carrying out asymmetric measurement on depth features under different modes, and the method comprises the following steps:

b, R represents RGB mode, IR mode, and f_i ^mRepresenting an ith depth feature vector;

next, two sets of features F are combined^BAnd F^RRespectively through two notEach sparse self-encoder is composed of two fully-connected layers and serves as an encoder E and a decoder D respectively, the encoder E is responsible for projecting different modal characteristics to a shared space, and the decoder D is responsible for remapping the encoded characteristics to a space with the same size as an input characteristic space;

l_r＝||f^B，D^B(E^B(f^B))||₂+||f^R，D^R(E^R(f^R))||₂；

f^B，E^B，D^Bfeatures representing modality B, an encoder and a decoder; f. of^R，E^R，D^RFeatures representing the modality R, encoder and decoder;

finally, introducing global constraint for constraining the difference between different modal characteristic distributions, introducing local constraint for reducing the intra-class distance and increasing the inter-class distance between different modal characteristics, and reversely transmitting the constraint result serving as a supervision signal back to the training model to correct each parameter in the shared space;

Local constraints are used to reduce the intra-class distance and increase the inter-class distance between different modal features, denoted as l_local(max (d (f, p)) -min (d (f, n)) + α), p ∈ a (f), n ∈ b (f), a (f), b (f) respectively represent feature sets having the same identity information as feature f and different identity information, d (·,) represents the euclidean distance between the two features,α is a hyperparameter for controlling the spacing of the positive and negative sample pairs.

2. The asymmetric metric learning-based heterogeneous pedestrian re-identification method according to claim 1, wherein depth features of images in different modalities are extracted, and the method comprises the following steps:

3. The asymmetric metric learning-based heterogeneous pedestrian re-identification method according to claim 1, wherein a sparse loss l is constructed_sparseUnconstrained hidden layer output,/_sparse＝||E^B(f^B)||₁+||E^R(f^R)||₁。

4. The asymmetric metric learning-based heterogeneous pedestrian re-identification method according to claim 1, wherein each sparse self-encoder is composed of two fully-connected layers with a ReLU activation function.

5. The asymmetric metric learning-based heterogeneous pedestrian re-identification method as claimed in claim 1, wherein for the extracted depth features, a Triplet Loss function is used to calculate the Loss of global features, a Softmax function is used to calculate the Loss of local features, and the aim of optimizing the training model is to minimize the global Loss, the local Loss and the sum of reconstruction Loss, sparse Loss, global constraint and local constraint in the asymmetric metric.