CN111709364A

CN111709364A - Pedestrian re-identification method based on visual angle information and batch characteristic erasing

Info

Publication number: CN111709364A
Application number: CN202010549985.9A
Authority: CN
Inventors: 张红; 李建华; 徐志刚; 曹洁; 任伟
Original assignee: Lanzhou University of Technology
Current assignee: Lanzhou University of Technology
Priority date: 2020-06-16
Filing date: 2020-06-16
Publication date: 2020-09-25

Abstract

The invention discloses a pedestrian re-identification method based on visual angle information and batch characteristic erasure, and belongs to the technical field of computer vision and pattern identification. The method mainly realizes pedestrian re-identification through the construction of a PSE network model, the training of the PSE model and the construction of a BFE network model. The invention has good generalization capability and robustness on three data sets by a method of combining visual angle information and batch characteristic erasure. The method can obtain better identification effect by using three attention mechanisms, and the precision of the Rank-1 is improved by 0.5% compared with the independent visual angle characteristic attention and is improved by 0.2% compared with the independent CBAM attention; the mAP result is improved by 1.3% compared with the attention of the individual visual angle characteristic and 1.2% compared with the attention of the individual CBAM, and the method has a good using effect.

Description

Pedestrian re-identification method based on visual angle information and batch characteristic erasing

Technical Field

The invention belongs to the technical field of computer vision and pattern recognition, and particularly relates to a pedestrian re-recognition method based on visual angle information and batch characteristic erasing.

Background

The pedestrian re-identification means that pedestrian pictures shot by the non-overlapping cameras are matched. In practical application, the number of the monitoring cameras is large, and the pedestrian target is influenced by the hardware difference and the environment difference of different cameras, so that the image of the same pedestrian is influenced by problems such as background change, illumination change, posture change and shielding. Therefore, how to select image features with strong discrimination performance aiming at the influences and how to establish a proper model to provide a more efficient and robust method for pedestrian re-identification is the problem to be solved by the invention. The current pedestrian re-identification research usually only aims at one feature to carry out model optimization, and global features or local features. Therefore, the invention provides a method capable of fully utilizing global features and fine-grained features of a pedestrian image, visual angle features of the pedestrian image are extracted as global features, fine-grained features of the pedestrian image are extracted by a batch feature erasing method, and the model can extract more discriminative image features by combined learning of the two features. In the process of establishing the model, the complex structure is optimized by adopting an attention mechanism, and a mechanism with the combined action of a plurality of attentions is formed, so that the accuracy of pedestrian re-identification can be ensured, the parameter quantity of the model can be reduced, and the generalization performance of the model is improved.

Disclosure of Invention

The invention aims to provide a pedestrian re-identification method based on visual angle information and batch feature erasure, which can fully utilize the global features and fine-grained features of a pedestrian image and extract the visual angle features of the pedestrian image as the global features.

The pedestrian re-identification method based on visual angle information and batch characteristic erasing comprises the following steps:

1) construction of a PSE network model: with Resnet50 as a basic structure, each of Block1, Block2, Block3 and Block4 corresponds to a corresponding Block structure of Resnet50, a view classifier branch is added after Block1, after a series of convolution operations are carried out on a pedestrian image, probability values of the directions of the front, back and side of the pedestrian image are obtained by utilizing softmax, and the values predict the view direction of the image;

2) PSE model training: the model training is mainly realized by the following method:

firstly, loading ImageNet pre-trained weight parameters for a related structure of Resnet50 as initialization;

training a view classifier using the RAP dataset containing the orientation label;

migrating the trained view classifier to a PSE network, fixing parameters of the view classifier, Block1, Block2 and Block3, and training a visual angle unit by using a pedestrian to recognize a data set so as to initialize parameters of the visual angle unit;

fourthly, extracting 14 joint point information of the whole body of the pedestrian from all the pedestrian images by adopting a DeeperCut model;

fifthly, taking the extracted 14 joint points as input, wherein the input is 17 channels, fixing all Block structures, and finely adjusting the first layer and the last classification layer of the network to enable the network to adapt to the input of the 17 channels;

fine tuning the view classifier by using the joint point information extracted by the RAP data set;

seventhly, training a network model by adopting a pedestrian re-recognition data set;

3) constructing a BFE network model: extracting two branches of a Global branch and a featurerasing branch by taking Resnet50 as a basic structure, adding a Bottleneck structure in the featurerasing branch, introducing a Mask structure, randomly erasing the features in the same batch by the structure, obtaining fine-grained features of an image by performing maximum pooling and dimension reduction on the remaining features, and finally fusing the Feature vectors extracted by the two branches to be used as a final Feature vector of an input pedestrian picture.

Further, when the view direction of the predicted image is "front" in step 1), the probability value of "front" obtained by softmax is large, and the probability values of "rear" and "side" are small.

Further, when the batch features of the BFE are erased in step 3), the method may erase the same semantic section for the same batch of features when all input images re-identified by the pedestrian are approximately aligned.

Further, in the step 3), a view unit structure is added behind a Block3, different view units are used for learning different azimuth information of the pedestrian image, a batch feature erasing branch is added behind a Block3 structure, the branch continues to extract depth features from a Block4 structure and a Bottleneck structure, and then operations such as pooling and dimension reduction are performed through a batch feature erasing method, so that fine-grained features of the image are obtained finally.

Further, each view unit is composed of a 1 × 1 convolution layer, a convolution block attention module, i.e., CBAM, a 1 × 1 convolution layer, a Batch normalization layer, and a ReLu layer, and the CBAM module performs attention mapping on the input feature map according to two independent dimensions, i.e., channel and space.

Further, the another network model training process includes the following steps:

1) loading the weight parameters trained on ImageNet of the Resnet50 network part, and initializing the skeleton network part;

2) training a view classifier using a RAP data set, with a learning rate set to 0.0001;

3) fixing the view classifier and the Resnet50 related structure parameters, using a pedestrian re-identification data set, and only finely adjusting the visual angle unit part and the final pedestrian identity label classification layer;

4) the entire network is trained using a pedestrian re-identification dataset.

Compared with the prior art, the invention has the beneficial effects that:

1) the batch feature erasing method of BFE in the invention has the advantages that when all input images identified by pedestrians are approximately aligned, the method can erase the same semantic interval from the same batch of features, so that the model has a good learning effect on the remaining fine-grained features.

2) According to the invention, the CBAM module is introduced, and attention on the channel and space dimensions is performed on the features of each visual angle unit, and the three attention mechanisms act together, so that the extraction of high-level features is greatly enriched, and the three different visual angle units can learn more different features to the greatest extent. Thus, the improved view angle unit can ensure the accuracy of pedestrian re-identification while simplifying the network structure.

3) The invention has good generalization capability and robustness on three data sets by a method of combining visual angle information and batch characteristic erasure.

4) The method can obtain better identification effect by using three attention mechanisms, and the precision of the Rank-1 is improved by 0.5% compared with the independent visual angle characteristic attention and is improved by 0.2% compared with the independent CBAM attention; the mAP result is 1.3% higher than the visual angle characteristic attention alone and 1.2% higher than the CBAM attention alone.

Detailed Description

The following examples further describe in detail specific embodiments of the present invention. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

Example 1

In the step 1), when the view direction of the predicted image is "front", the probability value of "front" obtained by softmax is large, and the probability values of "rear" and "side" are small.

When the batch features of the BFE are erased in the step 3), when all input images re-identified by pedestrians are approximately aligned, the method can enable the features of the same batch to erase the same semantic interval. Adding a view angle unit structure behind Block3, wherein different view angle units are used for learning different azimuth information of a pedestrian image, adding a batch feature erasing branch behind a Block3 structure, continuously extracting depth features from a Block4 structure and a Bottleneck structure, performing pooling, dimension reduction and other operations by a batch feature erasing method, and finally obtaining fine-grained features of the image. Each view angle unit is composed of a 1 × 1 convolution layer, a convolution block attention module, namely a CBAM, a 1 × 1 convolution layer, a Batchnormalization layer and a ReLu layer, and the CBAM module performs attention mapping on an input feature map according to two independent dimensions, namely a channel and a space.

The other network model training process comprises the following steps:

4) the entire network is trained using a pedestrian re-identification dataset.

During training, all input images were resized to 384 × 128 and data enhancement was performed using normalization and random horizontal flipping. Because horizontal overturning is adopted, the left and right directions are uniformly classified as the sides in the experiment of the invention. In the batch feature erase branch, the erase aspect ratio is set to rh 0.5 and rw 1.0, respectively. In the whole network training process, the batch size is set to be 128, and 700 epochs are trained by adopting an Adam optimizer. The learning rate is adjusted as the epoch increases, the learning rate is set to 1e-4 × (epoch/5+1) when the epoch is less than 50, 1e-3 when the epoch is greater than 50 and less than 200, 1e-4 when the epoch is greater than 200 and less than 300, and 1e-5 after 300 epochs. The loss function is the loss function in the BFE model training, namely the sum of the Soft margin batch-hard triple loss and the softmax loss. Wherein Soft margin batch-hard tripletloss is defined as follows:

in the formula, P is the number of different pedestrians in one batch, and K is the number of pictures of each pedestrian. For anchor samples, positive samples, and negative samples, these three samples constitute a triplet. Here, the image features with the same pedestrian identity as the anchor sample but the farthest similarity are selected as the positive samples: and selecting the image features which have different pedestrian identities from the anchor samples and have the closest similarity as the negative samples. The feature vectors thus learned are expressed, and the euclidean distances between samples are calculated.

During the test, the pedestrian image was also reset to 384 × 128 and normalized. It is worth noting that in the testing process, all the images of the query set Xquery and the candidate set Xgalery are horizontally inverted, and the feature vectors learned by the inverted images and the feature vectors learned by the original images are added to calculate the average to be used as the feature vectors of the final pedestrian images. And calculating and sorting Euclidean distance between the feature vector f (q) of each image q in the query set and the feature vector f (g) of the image g in the candidate set. According to the similarity obtained by the Euclidean distance, if the pedestrian IDs are the same and the camera IDs are different, the identification is correct.

1. Results and analysis of the experiments

The experiment adopts a deep learning frame Python 1.0.1 based on Python3.7, and relevant experiments are carried out on three public pedestrian re-identification data sets, Market1501, DukeMTMC-reiD and CUHK 03. Compared with other advanced methods, the method achieves good identification precision on Rank-1 and mAP indexes, and shows that the method has good pedestrian re-identification performance.

1.1 comparison of the recognition Performance of the method of the invention with other ReiD methods

In order to prove that the method has advanced pedestrian re-identification performance, nine latest pedestrian re-identification methods are selected for comparison experiments, and the experimental results are shown in table 1.1.

As can be seen from Table 1.1, the Rank-1 precision and the mAP precision of the method of the invention on three data sets are obviously improved. On the data set Market1501, the Rank-1 precision is 3.6% higher than the Rank-1 precision of the algorithm HA-CNN, and the mAP precision is 11.1% higher; compared with the Rank-1 precision of the algorithm PCB, the accuracy is higher by 2.4 percent, and the mAP precision is higher by 9.5 percent. On a data set DukeMTMC-reiD, the Rank-1 precision of the method is improved by 6.9 percent compared with the Rank-1 of an algorithm PCB, and the mAP precision is improved by 11.9 percent; compared with the Rank-1 precision of the algorithm HA-CNN, the accuracy is improved by 8.3 percent, and the mAP precision is improved by 13.4 percent. On the data set CUHK03-label, the Rank-1 precision of the invention is 14.7% higher than that of the Rank-1 of the algorithm DaRe, and the mAP precision is 16.1% higher; the precision is 1.2% higher than that of Rank-1 of the algorithm BFE, and the precision of mAP is 1.3% higher. On a data set CUHK03-detect, the precision of the Rank-1 is 1.8% higher than that of BFE, and the precision of mAP is 2.4% higher; the precision is 14.0 percent higher than that of Rank-1 of the algorithm DaRe, and the precision of mAP is 15.7 percent higher. When compared with the MGN method, the invention has less than optimal Rank-1 precision and mAP precision in the data sets Market1501 and DukeMTMC-reiD, but on the data set CUHK03-label, the Rank-1 precision and the mAP precision respectively exceed 12.8 percent and 10.3 percent of MGN, on the data set CUHK03-detect, the Rank-1 precision exceeds 10.5 percent of MGN, and the mAP precision exceeds 8.7 percent. Because training samples of the CUHK03 data set are fewer, difficulty is increased for training of a network model, and an overfitting phenomenon is easy to generate, the method has better recognition performance and generalization capability. In addition, the MGN model has 8 feature extraction branches, 11 loss functions, and its network structure is very complex. When compared with the BFE model, the Rank-1 precision of the data sets Market1501 and DukeMTMC-reiD is only slightly improved (+ 0.4%, + 0.1%), but the mAP precision is obviously improved (+ 1.8%, + 1.4%). Experiments prove that the method combining the visual angle information and the batch characteristic erasing has good generalization capability and robustness on three data sets.

1.2 identification Performance comparison Using different View Unit modules

Since the view unit of the PSE network is composed of three blocks 4, each Block4 has more than 20 layers, the structure is complex, and the parameter amount is too large. If only a simple combination of the view branch of the PSE network and the batch erasure branch of the BFE network is performed, this will result in insufficient computational memory. The invention thus proposes a simple and effective view cell structure. To verify the performance of the structure, we performed comparative experiments on the branch of view information, and the experimental results are shown in table 1.2:

1.3 identification Performance comparison with different attention mechanisms

The process of weighting the scores predicted by the view classifier and the view units is essentially an attention mechanism to the characteristic attributes, and on the basis of the attention mechanism, by introducing a CBAM module, each view unit carries out channel attention and space attention, so that three attention mechanisms are formed. In order to verify the effectiveness of the three-attention mechanism, the present invention was verified experimentally. The results of the experiment are shown in table 1.3.

In the experiment, the view angle information branch of the model provided by the invention is also used as a basic structure, and a Market1501 is adopted as a data set. Wherein, only the view classifier is introduced to predict the image orientation for feature attention, without adding a CBAM module in a view unit structure; only CBAM attribute indicates that only one view cell structure with CBAM module is used without using a view classifier; view and CBAM attention represents the complete branch of improved view information, both the CBAM attention module and the characteristic attention of the view information. According to experimental results, the method of using the CBAM attention module alone has a little advantage over the method of using the visual angle characteristic attention alone, and the Rank-1 precision is 0.3% higher than the visual angle characteristic attention alone, and the mAP precision is 0.1% higher than the mAP precision. The method using the three attention mechanisms can achieve a better identification effect, and the accuracy of the Rank-1 is improved by 0.5% compared with the attention of individual visual angle features and is improved by 0.2% compared with the attention of individual CBAM; the mAP result is 1.3% higher than the visual angle characteristic attention alone and 1.2% higher than the CBAM attention alone. Therefore, the performance of the model is better improved by using the three attention mechanism method, and the effectiveness of the three attention mechanism is verified.

1.4 multiple feature fused recognition Performance comparison

The invention combines the visual angle information and the batch characteristic erasing method, the visual angle information is used as the global characteristic, and the batch characteristic erasing method learns the fine-grained characteristic. In order to verify that the two characteristics are fused to enable the network model to have better discrimination performance, the invention is verified through experiments, and the experimental results are shown in table 1.4.

The feature fusion contrast experiment uses a data set Market 1501. The View information branch is a View information branch of the model of the invention, and the branch learns the global characteristics of the pedestrian image. Feature erasing branch of the present invention, which learns fine-grained features of pedestrian images. And All is the whole network structure, namely the learned global features and the learned fine-grained features are fused. As can be seen from Table 1.4, in the data set Market1501, the accuracy of the feature erasure branch is improved by 0.1% compared with the Rank-1 accuracy of the view information branch, and the mAP accuracy is improved by 1.7%, which is slightly superior. However, when the two branches act simultaneously, the Rank-1 precision and the mAP precision are optimal and respectively reach 94.8 percent and 86.8 percent. Compared with the single view information branch Rank-1, the view information branch Rank-1 is improved by 1.9 percent, and the mAP is improved by 4.4 percent; compared with the single characteristic erasure branch Rank-1, the erasure branch Rank-1 is improved by 1.0 percent, and the mAP is improved by 2.7 percent.

The analysis shows that the method for combining the global features learned by the view information branch and the fine-grained features learned by the feature erasure branch has stronger characterization performance and higher identification precision compared with the single view features and the single fine-grained features, and proves the effectiveness of the two feature fusion algorithms.

1.5 batch feature erase Module recognition Performance comparison on other models

The invention improves the visual angle information structure of the PSE model, and adds BFE module branches to the structure, thereby extracting pedestrian features with more discriminative power. Experiments show that the BFE module can effectively improve the re-identification accuracy of the model and has good generalization capability on a plurality of data sets. In order to verify whether the BFE module can improve the identification precision on other network models, the method adopts another two network models for experiment, and the experimental result is shown in table 1.5.

The experiment adopts a data set Market1501, and the network model adopts IDE and PCB. The IDE is a common pedestrian re-identification infrastructure, and slightly changes the Resnet50, and mainly extracts the global features of the pedestrian image. The PCB network uniformly divides the pedestrian features into 6 blocks, each block of features adopts respective loss to train the model, and the local features of the pedestrian image are mainly excavated. In the experiment using the IDE model, the loss function selects triplet loss, margin is set to 1.0, and the learning rate and optimizer are the same as those set in the experiment of the present invention. The IDE + BFE is based on the IDE model, and a BFE branch is added to the IDE model. According to the experimental result, the Rank-1 precision and the mAP precision of the BFE branches are respectively improved by 2.1% and 3.0% compared with those of an IDE model. The learning rate was reduced to 0.01 after 40 epochs, on a 0.1 basis. The Rank-1 precision is 92.2%, and the mAP precision is 77.8%. After the BFE module is added to the PCB model, the precision is reduced inversely, the Rank-1 precision is only 88.3%, the mAP precision is 70.4%, and the precision is far inferior to the experimental result of the PCB. The PCB model extracts the local features of the pedestrian image, the BFE module is added to extract the fine-grained features, the identification precision cannot be improved, the fine-grained features become interference features, the IDE model mainly extracts the global features, and the fine-grained features can improve the identification precision of the model under the supervision and supplement of the global features.

From the above analysis, in the comparison of the batch feature erasing module and the two network models, the identification precision is improved on the IDE model for extracting the global features, and no good experimental effect is obtained on the PCB model for extracting the local features, so that the BFE module can improve the identification precision of the models under the supervision and supplement of the global features.

1.6 identification Performance comparison with different loss functions

The method adopts a loss function combining Soft margin batch-hard triple loss and Soft max loss in the training process. The identification capability of the model is improved by using a strategy of combining two loss functions, and the accuracy of pedestrian re-identification is effectively improved. In order to verify the superior performance of the Soft margin batch-hard triplet loss function, the invention performs a comparison experiment with the reference triplet loss function, compares the combination of the triplet loss and the softmax loss function with the loss function of the invention, and the experimental result is shown in table 1.6.

In the comparison with the benchmark triplet loss, the Soft margin batch-hard triplet loss function is adopted to achieve the optimal solution of the recognition precision, the loss function avoids the setting of a margin threshold, and the model training can be fast and effective by using the loss function.

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments or portions thereof without departing from the spirit and scope of the invention.

Claims

1. The pedestrian re-identification method based on visual angle information and batch characteristic erasing is characterized by comprising the following steps of:

3) constructing a BFE network model: taking Resnet50 as a basic structure, extracting two branches of a Globalbranch and a featurerasing branch, adding a Bottleneck structure in the featurerasing branch, introducing a Mask structure, randomly erasing the features in the same batch by the structure, obtaining fine-grained features of an image by performing maximum pooling and dimension reduction on the remaining features, and finally fusing the Feature vectors extracted by the two branches to be used as a final Feature vector of an input pedestrian picture.

2. The method for re-identifying pedestrians based on view information and batch feature erasures as claimed in claim 1, wherein in step 1), when the view direction of the predicted image is "front", then probability value of "front" obtained by softmax is large, and probability value of "back" and "side" is small.

3. The pedestrian re-identification method based on perspective information and batch feature erasure as claimed in claim 1, wherein when erasing the batch features of BFE in step 3), the method can erase the same semantic interval for the same batch of features when all input images for pedestrian re-identification are roughly aligned.

4. The pedestrian re-identification method based on perspective information and batch feature erasure of claim 1, wherein in step 3), a perspective unit structure is added after Block3, different perspective units are used for learning different orientation information of pedestrian images, a batch feature erasure branch is added after Block3 structure, the branch continues to extract depth features from a Block4 structure and a boltleeck structure, and then pooling, dimension reduction and other operations are performed through the batch feature erasure method, and finally fine-grained features of the images are obtained.

5. The pedestrian re-identification method based on perspective information and Batch feature erasure of claim 1, wherein each perspective unit is composed of a 1 x 1 convolution layer, a Convolution Block Attention Module (CBAM), a 1 x 1 convolution layer, a Batch normalization layer and a ReLu layer, and the CBAM module performs attention mapping on the input feature map according to two independent dimensions, i.e. channel and space.

6. The pedestrian re-identification method based on perspective information and batch feature erasure as claimed in claim 1 wherein said another network model training process comprises the steps of:

4) the entire network is trained using a pedestrian re-identification dataset.