CN113591545B

CN113591545B - Deep learning-based multi-level feature extraction network pedestrian re-identification method

Info

Publication number: CN113591545B
Application number: CN202110652283.8A
Authority: CN
Inventors: 杨戈; 丁鑫
Original assignee: Zhuhai Campus Of Beijing Normal University
Current assignee: Zhuhai Campus Of Beijing Normal University
Priority date: 2021-06-11
Filing date: 2021-06-11
Publication date: 2024-05-24
Anticipated expiration: 2041-06-11
Also published as: CN113591545A

Abstract

The invention discloses a multi-level feature extraction network pedestrian re-identification method based on deep learning, which comprises the following steps: 1) When each round of training is carried out on the pedestrian re-recognition network, extracting secondary features and global features corresponding to each image; 2) The global feature and each secondary feature are processed through BN, a full connection layer, circle Loss, label smoothing and Softmax Loss in sequence to obtain a corresponding Loss value; calculating a Loss value of the global feature through a Center Loss and a triple Loss WITH ADAPTIVEWEIGHTS respectively; 3) Calculating a total loss value according to each loss value, and adjusting network parameters; 4) Acquiring the image characteristics to be identified by using the trained network; and calculating the similarity between each image feature to be identified and the target image feature, and then carrying out similarity sorting to obtain a similar picture list with the target image.

Description

Deep learning-based multi-level feature extraction network pedestrian re-identification method

Technical Field

The invention relates to a pedestrian re-identification algorithm based on deep learning. The pedestrian Re-identification (ReID) can acquire the position and the corresponding time of the target pedestrian in a short period in the data generated by the multiple cameras, so as to construct the activity track and the behavior analysis of the target pedestrian.

Background

The nature of the convolution is to perform feature fusion, which is local at the spatial level and default to full fusion at the channel level, and obviously, the effect of information of different channels on the result is also different. A SE (Squeeze and Excitation) module is then proposed (reference Hu J,Shen L,Sun G.Squeeze-and-excitation networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition.2018:7132-7141).SE module moves the view away from the spatial level, focusing on the relationship between feature channels, so that features focus on more important channels.

The SE module can be considered as a channel attention mechanism, and the literature considers that the combination of a channel domain is not comprehensive enough, so that the channel attention mechanism and a space attention mechanism are combined, CBAM (Convolutional Block Attention Module) is proposed, which finds that the series connection of the channel attention mechanism and the space attention mechanism is better than the parallel connection effect through experiments (referring to Woo S,Park J,Lee J Y,et al.Cbam:Convolutional block attention module[C]//Proceedings of the European conference on computer vision(ECCV).2018:3-19)., the extracted features of a neural network are more focused on significant features, the global features obtained by the SE module can possibly ignore some very fine features with identification property, so researchers start focusing on how to obtain the features with finer granularity, the main flow direction of the local features are obtained at present and the human body posture is divided, the human body posture dividing method needs to be independently trained after the human body posture is recognized by a human body posture recognition network, the features obtained through the horizontal slicing method seem to only represent the part of a picture, but the field is actually covered with full graph after the extraction of a deep network, only emphasis is different, the literature shows that even though the local features obtained through the horizontal slice are used, good performance can still be obtained (referring to Sun Y,Zheng L,Yang Y,et al.Beyond part models:Person retrieval with refined part pooling(and a strong convolutional baseline)[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:480-496)., the literature shows that the global features and the local features can obtain better effects in training (refer to Sun Y,Zheng L,Yang Y,et al.Beyond part models:Person retrieval with refined part pooling(and a strong convolutional baseline)[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:480-496). Luo H,Jiang W,Zhang X,et al.AlignedReID++:Dynamically matching local information for person re-identification[J].Pattern Recognition,2019,94:53-61).

There is also proposed a ALIGNEDREID method (refer to Fan X,Jiang W,Luo H,et al.Spherereid:Deep hypersphere manifold embedding for person re-identification[J].Journal of Visual Communication and Image Representation,2019,60:51-58).. After extracting features, the algorithm is divided into two branches, firstly, the global features obtained by global average pooling are used for carrying out difficult sample mining, and secondly, 8 local features obtained by horizontal average pooling of feature images are used for carrying out horizontal slicing, so that the problem of misalignment of local features obtained by direct horizontal slicing can occur.

The literature proposes a circular Loss function that formally unifies classification and metric losses. The loss function obtains better reverberation on the tasks of face recognition, vehicle re-recognition, pedestrian re-recognition and the like (reference Sun Y,Cheng C,Zhang Y,et al.Circle loss:Aunified perspective of pair similarity optimization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:6398-6407).

The classification loss in combination with the metric loss may achieve better results, however in the case of multi-loss combining, situations where different losses converge out of sync are prone to occur. The literature proposes a method named BNNeck (reference Luo H,Gu Y,Liao X,et al.Bag of Tricks and a Strong Baseline for Deep Person Re-Identification[C]//2019IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops(CVPRW).IEEE,2019:1487-1495),, i.e. batch normalization of the features for the calculation of the loss of classification, and no batch normalization of the features for the calculation of the loss of triplets, the classification loss is better achieved at this time, and the situation that the convergence of different losses is not synchronous when using multiple loss functions is improved.

However, in practice it has been found that the effect of the channel attention mechanism in tandem with the spatial attention mechanism is not ideal in the task of pedestrian re-identification.

Disclosure of Invention

In order to solve the problems, the invention aims to provide a multi-level feature extraction network pedestrian re-identification method based on deep learning. The FRM method is designed for the feature extraction stage, and only a space attention mechanism is partially utilized, so that the network obtains multi-level and various features, and various loss functions are combined, so that the algorithm obtains a good effect.

The technical scheme of the invention is as follows:

a multi-level feature extraction network pedestrian re-identification method based on deep learning comprises the following steps:

1) When each round of training is carried out on the pedestrian re-identification network, extracting a batch of data from the pedestrian re-identification data set each time and expanding the data by adopting data enhancement measures;

2) Training the pedestrian re-recognition network by using the expanded data; the data after the expansion of the current batch is subjected to feature extraction by Block1 and Block2 in sequence to obtain a feature vector F= { x ₁,…,x_b }, and then the feature vector F is respectively input into a first branch and a second branch of a pedestrian re-identification network; x _b represents the feature data of the b-th image, c represents the channel number of the image corresponding feature map, h represents the height of the image corresponding feature map, and w represents the width of the image corresponding feature map; block1 and Block2 are feature extraction modules in SEResNet-50, SFGM is a secondary feature generation module, geM is generalized mean pooling, and BN is batch normalization;

3) The first branch extracts global features corresponding to each image from the feature vector F; the second branch extracts secondary features corresponding to each image from the feature vector F;

4) The global feature and each secondary feature are processed by BN operation, a full connection layer, circle Loss, label smoothing and Softmax Loss in sequence to obtain a corresponding Loss value; calculating the global feature through an auxiliary Loss function Center Loss to obtain a Loss value, and calculating the global feature through a Loss function triple Loss WITH ADAPTIVE WEIGHTS to obtain a Loss value;

5) Calculating a total loss value according to the loss values obtained in the step 4), and adjusting and optimizing network parameters of the pedestrian re-recognition network according to the total loss value to obtain a pedestrian re-recognition network after a batch of training is completed;

6) Repeating the steps 1-5 until the training of the set round (namely, the whole training set is trained once during each round of training) is completed, and obtaining a pedestrian re-identification network after the training is completed;

7) For each image to be identified, extracting global features and secondary features of the image to be identified by utilizing a pedestrian re-identification network after training is completed, performing BN (binary sequence) processing on the global features and the secondary features, and then performing feature fusion to obtain features of the image to be identified; and calculating the similarity between each image feature to be identified and the target image feature, and then carrying out similarity sorting to obtain a similar picture list with the target image.

Further, the method for extracting the secondary features by the branch II comprises the following steps: firstly generating a secondary characteristic vector to be extracted through SFGM, then extracting the secondary characteristic vector to be extracted through Block3-1, and inputting the secondary characteristic vector to be extracted into two branches, namely a first second branch and a second branch, wherein the first second branch sequentially passes through the input characteristics through Block4-1 and GeM to obtain secondary characteristics; the second level branch II obtains a secondary characteristic vector to be extracted from the input characteristic through SFGM; dividing the secondary characteristic vector to be extracted into two branches after passing through Block4-2, namely a first third branch and a second third branch, wherein the first third branch obtains secondary characteristics from the secondary characteristic vector to be extracted through GeM; the second-stage branch sequentially passes SFGM and GeM to obtain a second-stage characteristic; wherein Block3-1, block4-1 and Block4-2 are feature extraction modules in SEResNet-50.

Further, the data processing method of the secondary feature generating module SFGM is as follows: setting an input feature vector as OF, carrying out maximum value pooling and mean pooling on the input feature vector OF in a channel dimension respectively, and then splicing the pooled two feature vectors in the channel dimension to obtain a feature vector F1; performing convolution operation on the feature vector F1 to obtain a feature vector F2; performing batch normalization on the feature vector F2 to obtain BN operation; finally, a space weight matrix M with the interval of [0,1] is obtained by utilizing a Sigmoid activation function; generating a Mask with a weight value larger than P in the space weight matrix M, setting the value with the weight value larger than P in the space weight matrix M as W through the Mask, and setting the other value as W to obtain a space weight matrix M1; expanding the space weight matrix M1 into a space weight matrix M2 with the same channel dimension as the feature vector OF; then, the space weight matrix M2 and the feature vector OF are added to the feature vector OF after being multiplied by points to obtain a feature vector F3; the feature vector F3 is activated by the ReLU function and then outputted.

Further, the following processing is performed on each image by adopting data enhancement measures:

11 Adjusting the height-width ratio of the image to 2:1 to generate a new image;

12 Random horizontal overturn is carried out on the image to generate a new image;

13 Randomly clipping the image to generate a new image;

14 Normalizing the image to generate a new image;

15 Randomly erasing the image to generate a new image.

Further, in step 11), the image pixels are adjusted to have a size of [ height, width ] = [256,128 ]; and the branch I sequentially extracts the global feature from the feature vector F through Block3, block4 and GeM, the step length in Block4 is adjusted from 2 to 1, and the Block3 and Block4 are feature extraction modules in SEResNet-50.

Further, in step 4), the Circle Loss is used to adjust the inter-class distance, and then Softmax Loss is used to make all classes have the maximum log likelihood in the probability space.

Further, the loss value calculated by the Triplet loss WITH ADAPTIVE WEIGHTS is Wherein/>D (x _a,x_p)、d(x_a,x_n) represents the similarity between the target x _a and the positive sample x _p and negative sample x _n, respectively; p (a) and N (a) respectively represent positive and negative sample sets of the target a.

Further, the Loss value calculated by Center Loss isWherein B is the number of samples, x _i represents the eigenvalue of sample i,/>Representing the center of the class of sample i.

Compared with the prior art, the invention has the following positive effects:

In a Person Re-identification (ReID) task, the motion range of a pedestrian often spans multiple imaging areas, the motion direction and behavior of the pedestrian cannot be constrained, and irrelevant people or objects existing in different scenes also cause interference in acquiring the information of the target pedestrian. Meanwhile, the monitoring system has the characteristics of fixed shooting angle of a single camera, different angles among different cameras, lower image resolution and the like. These characteristics make the task of pedestrian re-identification a number of difficulties. The invention proposes a SEResNet-50 based multi-level feature extraction network (Multistage Feature Extraction Network, MFEN). Extracting more abundant and diverse pedestrian features from images with poor quality can effectively improve the Re-recognition capability of the network, and MFEN can acquire multi-level key features in the images through the Feature Re-extraction Method (FRM) method provided by the invention. Experiments show that compared with AANet-50, on the Market1501 dataset, mAP is improved by 3.85%, and Rank-1 is improved by 0.71%; mAP was increased by 2.74% and Rank-1 was increased by 1.28% on DukeMTMC-reID dataset.

Drawings

Fig. 1 is a block diagram of a multi-level feature extraction network as a whole.

Fig. 2 is a block diagram of the secondary feature generation module.

Fig. 3 is a graph of a subject's operating characteristics.

Detailed Description

In order to further explain the objects, features and advantages of the technical solution of the present invention, the present invention will be further described in detail with reference to the accompanying drawings. The specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In order to extract fine-grained features more efficiently, the invention proposes a SEResNet-50 based multi-level feature extraction network (Multistage Feature Extraction Network, MFEN). MFEN can be divided into three parts, namely data enhancement, FRM and multi-level feature fusion and multi-loss function joint training.

MFEN the main body structure is shown in figure 1. Wherein the virtual connection line is a test route; SFGM: a secondary feature generation module; FRM: a feature re-extraction method; geM: generalized mean pooling; BN: normalizing in batches; block: the extraction modules SEResNet-50 have the same structures as blocks 1-4, but parameters such as channel number are set differently according to different input data, for example: block1 is input data "1*2" - > a function (attribute=1) - > output data "4*2", then Block2 is input data "4*2" - > a function (attribute=4) - > output data "16×2", and so on; loss refers to the Loss function.

1.1 Data enhancement

Because the public pedestrian re-recognition data set is smaller, the pedestrian re-recognition network is easy to be subjected to the fitting phenomenon in training. In order to alleviate the overfitting phenomenon, when each batch of training is carried out on the pedestrian re-identification network, the invention respectively adopts the following data enhancement measures to process each image in the pedestrian re-identification data set:

(1) The image pixels are adjusted to a [ height, width ] = [256,128] size. The ratio of the height to the width of the image is 2:1 so as to correspond to the shape of a human body and prevent the image from generating distortion. For the selection of the image size value, firstly, the calculation resource is fully utilized in consideration of the fact that the length and width of the image are exponential multiple of 2; the second is to consider that if the adjusted image is larger than the original image, the gain of increasing the length and width is not great (refer to Luo H,Gu Y,Liao X,et al.Bag of Tricks and a Strong Baseline for Deep Person Re-Identification[C]//2019IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops(CVPRW).IEEE,2019:1487-1495.);, and the third is to consider that the excessive value can cause no small burden on the running speed of the network.

(2) And (3) carrying out random horizontal overturn on the image with the probability of 0.5 so as to enhance the generalization capability of the model and relieve the overfitting.

(3) The images are randomly cut with the gap of 10 so as to enhance the generalization capability of the model and relieve the overfitting.

(4) The images are normalized to reduce the impact of affine transformations. The normalization uses two sets of values calculated from sampling in the ImageNet training set with a mean value of [0.485,0.456,0.406] and a standard deviation of [0.229,0.224,0.225 ].

(5) The image is randomly erased with a probability of 0.5. A rectangular area is randomly selected in the original image by the random erasing method, pixels in the area are replaced by random values, and a shielding-like effect is formed, so that the generalization capability of the model is enhanced, and the occurrence of the over-fitting phenomenon is relieved.

The complete input of the training set in the training process is called one round (epoch), and the pedestrian re-recognition network can perform training of a specific round. Because of randomness of the part of methods, the expansion data generated during each training of the same image changes, so that the effect of expanding the data is achieved, and the calculation amount is reduced, and the training is facilitated.

1.2FRM and feature fusion

Extracting more abundant and diverse pedestrian features from images with poor quality effectively improves the re-recognition capability of the network. In order to obtain more effective and various characteristics, the invention provides a characteristic Re-extraction method (FRM). In order to better extract key features, the step size in the Block4 module SEResNet-50 is adjusted from 2 to 1, so that the size of the finally obtained feature map is enlarged from original 8*4 to 16 x 8, which will enable the network to obtain features with finer granularity.

The channel attention mechanism is connected with the space attention mechanism in series, so that the recognition capability of the network is improved theoretically. However, in the task of pedestrian re-identification, the present invention has found that this is not the case through experimentation. This is probably because the images of the pedestrian re-recognition task are poor in quality due to occlusion, excessive information due to shooting angle or distance, low resolution, etc., so that the enhancement of this recognition ability after joining the spatial attention mechanism instead makes a wrong judgment of the network. Better results are obtained by the global feature and local feature combined training network, but the problem of misalignment occurs when the local features are extracted through horizontal slices.

In combination with the above problems, the present invention contemplates a "partial use" spatial attention mechanism that only uses the spatial attention mechanism to determine the importance of a feature from the global and then strips out a portion of the feature instead of a mechanical "hard cut". And re-extracting the feature vector after the stripping part, we can obtain more effective features.

Initially FRM was attempting to remove non-critical features because it was considered that the presence of some irrelevant objects would interfere with the judgment of the network, but experiments showed that this was not effective. This may be because the network's decisions on important features are not necessarily correct. It is possible that the non-key feature we remove is a portion of the target pedestrian, leaving the tree in the background.

The FRM is then modified to remove key features, and the secondary features thus extracted are more focused on detail features that are relatively not particularly apparent. These features, by layer-by-layer stripping, will appear more hierarchical than a single global feature and the network will thus obtain more information. The self-learning characteristic stripping behavior also has the effect of data augmentation, and the network learns more various characteristic graphs than the original data set, so that the generalization capability of the network is improved. Experiments have shown that FRM is a significant improvement in network performance.

FRM is a combination of a secondary feature generation module (Secondary Feature Generation Module, SFGM) and a feature extraction section.

If the input data of a Batch (Batch) is f= { x ₁,…,x_b},F∈R^b×c×h×w,x∈R^c×h×w representing the data of a single picture, b e N ₊ representing the number of pictures (Batch Size) processed simultaneously in a Batch, xb representing the data of a b-th picture, c e N ₊ representing the number of channels of the feature map, h e N ₊ representing the high of the feature map, and w e N ₊ representing the width of the feature map.

As shown in fig. 1, after the image is enhanced, the image is processed by obtaining a feature vector F e R ^{b×512×32×16} through blocks 1 and 2 of SEResNet-50 and inputting the feature vector F e R ^{b×512×32×16} to SFGM. At this time, the pedestrian re-recognition network is divided into two branches, and the first-stage branch is always connected with the blocks 3, 4 and GeM to extract global features; the second stage branch extracts secondary features through FRM. In the FRM method, the feature vector firstly generates a secondary feature vector to be extracted through SFGM, then the secondary feature vector is extracted through Block3-1 and then is divided into two branches again, and the secondary feature is obtained through Block4-1 and GeM in the first branch; the second level branch two obtains the secondary characteristic vector to be extracted through SFGM. The secondary characteristic vector to be extracted is divided into two branches after passing through Block4-2, and the secondary characteristic is obtained after passing through GeM in the third-stage branch; and the third-stage branch passes through SFGM and GeM to obtain the secondary third-stage characteristic.

In SFGM, a feature vector OF e R ^b×c×h×w is input, the OF is first subjected to maximum pooling and mean pooling in the channel dimension, and then the pooled two feature vectors are spliced in the channel dimension to obtain a feature vector F1 e R ^b×2×h×w. And performing convolution operation with a convolution kernel 7*7 on the feature vector F1 to obtain a feature vector F2 epsilon R ^b×1×h×w. And performing batch normalization, namely BN operation, on the feature vector F2. Finally, a space weight matrix M epsilon R ^b×1×h×w with the interval of 0 and 1 is obtained by using a Sigmoid activation function. The BN operation is performed to prevent the input data from entering the unsaturated area, so as to avoid the gradient disappearance phenomenon possibly occurring in the Sigmoid activation function, and accelerate the network convergence.

Generating a Mask epsilon R ^b×1×h×w with a weight value larger than P in the space weight matrix M, judging through the Mask, setting the value of the weight value larger than P in the space weight matrix M as W, W=0, and setting the other values as W and W=1 to obtain a space weight matrix M1 epsilon R ^b ^×1×h×w. The spatial weight matrix M1 is expanded into a spatial weight matrix M2 epsilon R ^b×c×h×w with the same channel dimension as the feature vector OF. The space weight matrix M2 is added with the feature vector OF after being multiplied by the feature vector OF to complete a residual unit, and the feature vector F3 epsilon R ^b×c×h×w is obtained. Feature vector F3 is activated by the ReLU function. The SFGM flow is shown in figure 2.

SFGM when no residual unit is added, the performance of the network is not stable enough, and after the residual unit is added, the instability is improved. The invention is considered to be due to the fact that the residual error unit relieves the problems of a large number of key feature deletions and network degradation possibly occurring in the process of re-extraction of the feature map.

Generalized mean pooling GeM (generalized mean pooling) is also performed to obtain the vector F4E R ^b×c×1×1 before final feature extraction is completed. GeM formula is shown as (1):

Wherein f _c∈R^h×w denotes a single channel feature map of the c-th channel; x denotes a feature value, n=h×w; p is the specified parameter value. What is called generalized means global average pooling when the parameter p=1 and maximum pooling when p= infinity. When p >1, the image contrast will be enhanced, making the image more focused on local features.

Thereby MFEN obtain four features of global features, secondary features and secondary features. In the test, four features are subjected to cosine similarity calculation after channel dimension fusion. The cosine similarity matrix is obtained and then ordered, and the larger the similarity is, the more similar the pictures are. And obtaining the relevant information of the corresponding picture according to the label corresponding to the characteristic.

1.3 Multi-loss function Joint training

The invention combines 4 loss functions in the network, which are respectively Triplet loss WITH ADAPTIVE WEIGHTS (reference Lawen H,Ben-Cohen A,Protter M,et al.Attention network robustification for person reid[J].arXiv preprint arXiv:1910.07038,2019)、Softmax Loss、Center Loss( Wen Y,Zhang K,Li Z,et al.Adiscriminative feature learning approach for deep face recognition[C]//European conference on computer vision.Springer,Cham,2016:499-515)、Circle Loss.

Considering that the classification Loss effect is better after feature normalization, the network performs BN operation on the features before calculating Softmax Loss. The secondary features may be subject to errors due to missing portions of the key features, so only the global features calculate the Triplet Loss WITH ADAPTIVE WEIGHTS and the Center Loss.

Softmax Loss is a common classification Loss function, and if there is a target score vector d=f (x) ∈r ^C, C is a classification number, y is a true label position, the formula is shown in (2):

In full supervised learning, the labels are one-hot vectors, which form a strong supervised constraint, so that the overfitting phenomenon is deepened, and in order to alleviate this situation, the present network uses a label smoothing technique when calculating the classification loss (refer to Szegedy C,Vanhoucke V,Ioffe S,et al.Rethinking the inception architecture for computer vision[C]//Proceedings of the IEEE conference on computer vision and pattern recognition.2016:2818-2826), whose formula is shown in (3):

Where q _i∈R^C refers to the vector of sample i; n epsilon N ₊ refers to the category number; epsilon=0.1 refers to the smoothing parameter.

The goal of Softmax Loss is to have all categories have the greatest log likelihood in probability space, however the task of pedestrian re-identification does not merely consider classification problems. The classification between features may be clear, but their inter-class distances may be smaller than the intra-class distances, thus requiring better metrology space. Better results can be obtained by adjusting the inter-class distance using the Circle Loss followed by Softmax Loss. The goal of Circle Loss can be understood to be to have the smallest homogeneous degree of similarity greater than the largest heterogeneous degree of similarity. Circle Loss is shown in formula (4):

Wherein K ε N ₊ refers to the positive number of samples, L ε N ₊ refers to the negative number of samples, gamma ε R is the scaling factor, Refers to the distance between a picture j which is non-homogeneous with the target and the target,/>Refers to the distance between the picture i similar to the target and the target, and m epsilon R is the distance between classes. If a lot includes 24 pictures, it constitutes 4 pictures of 6 pedestrians each. 5*4 pictures of other 5 persons are of the same kind, namely 3 pictures of the same pedestrian except the picture.

Triple loss WITH ADAPTIVE WEIGHTS focuses only on the most difficult samples than the difficult sample Triplet loss, which is considered to be inadequate to consider only extreme cases, and if there are some number of data annotation errors, the use of the difficult sample Triplet loss will result in the network being difficult to converge. Since the hard spacing (HARD MARGIN) loses power to continue pushing the inter-class distance once it reaches the set point, the invention uses a soft spacing (soft margin), which continuously pulls the positive sample distance and pushes the negative sample away. The formulas are shown as (5), (6) and (7):

Wherein d (x _a,x_p)、d(x_a,x_n) represents the similarity between the target x _a and the positive sample x _p and negative sample x _n, respectively; p (a) and N (a) respectively represent positive and negative sample sets of the target a. As shown in (6) and (7), this makes the larger the positive sample distance, the larger the weight; the greater the negative sample distance, the less weight.

Center Loss is an auxiliary Loss function, and is characterized in that the inter-class spacing can be reduced to achieve the aggregation effect, but the inter-class spacing cannot be changed. The formula is shown as (8):

Wherein B ε R is the number of samples, x _i ε R represents the eigenvalue of sample i, Representing the center of the class of sample i.

In the invention, the circular Loss is connected in series with the Softmax Loss to obtain the classification Loss, and the circular Loss is connected in parallel with the Triplet Loss WITH ADAPTIVE WEIGHTS and the Center Loss to obtain the total Loss. The structure can be seen in fig. 1. The specific composition ratios are shown in tables 3 and 4. The trained parameters are stored in a file when the network reaches the desired performance.

2 Experimental correlation

2.1 Experimental Environment and important parameters

The experimental environment of the invention is shown in table 1. The network adopts a Ranger optimizer, which combines RAdam and LookAhead optimizers, so that higher performance is achieved. Specifically, the results of ResNet on the ImageNet dataset were used in tables 12, 13 for pre-training.

Table 1 experimental environment

Display card	NVIDIA RTX2080
		Video memory	8GB
CPU	Intercore i7 5 generation
		Memory	16GB
Operating system	Linux
		Development language	Python3.6
Integrated development environment	Pycharm
		Experiment frame	Pytorch1.3
Display card computing platform	NVIDIA CUDA10
		Deep learning acceleration library	NVIDIA CuDNN7

The important super parameters are shown in table 2. Training the Batch size to 4*4 means that 4 pictures are taken for 4 different pedestrians, respectively.

TABLE 2 important super parameter Table

In the initial stage of model training, because a large amount of new data is input, the model parameters can be quickly adjusted, the situation that the model is locally optimal is easily trapped by using a larger learning rate at the moment, and if the learning rate is kept low for a long time, the model can be too slowly converged, and in the later stage of training, the lowest point of loss is searched, and the learning rate is required to be lower. The present invention thus employs the WarmUp learning strategy. The formula is shown as (9):

The proportional composition of the loss function of the network is shown in Table 3. Considering that the value of triple Loss WITH ADAPTIVE WEIGHTS is lower than Softmax Loss, the proportion of Softmax Loss is reduced in order to make the network more sensitive to the value variation of the former.

TABLE 3 loss function proportion Table

Loss function	Composition ratio
		Softmax Loss	0.5
Triplet loss with adaptive weights	1
		Center Loss	0.0005

Wherein triple Loss WITH ADAPTIVE WEIGHTS uses only global features with Center Loss. The Loss composition ratio obtained using the global feature to the three secondary features taken by FRM is shown for example in table 4. The proportion is relatively small in view of the non-global feature missing part of the features. The results of the comparative tests show that this ratio is relatively good.

Table 4 Softmax Loss ratio

Features (e.g. a character)	Composition ratio
		Global features	1/2
Secondary primary characteristics	1/4
		Secondary features	1/8
Secondary tertiary features	1/8

2.2 Data sets

Market1501 dataset: a large single frame dataset collected in a university campus at daytime in 2015. The dataset has 32668 pictures of 1501 pedestrians acquired by 6 cameras. The training set contains 19732 pictures of 751 pedestrians; the test set contains 12936 pictures of 750 pedestrians; the picture set to be detected contains 3368 pictures. The data set picture resolutions were all 128×64. The pictures are automatically segmented by the detector. The machine labeling and the manual labeling are divided into two types of labels, and the machine labeling is mainly used in the current academy. Data set features: dirty data exist, the data set is large, the picture set to be detected is collected in summer, and a plurality of pictures of the same pedestrian exist.

DukeMTMC-reID dataset: a large single frame dataset collected in a university of duchenne campus during the daytime of 2016. The data set has 36411 pictures of 1404 pedestrians acquired by 8 cameras, and all the pictures are completed by manual segmentation and annotation. The training set contains 16522 pictures of 702 pedestrians; the testing set contains 17661 pictures of 702 pedestrians; the picture set to be detected contains 2228 pictures. The data set picture resolutions were 204×61. Data set features: the data set is large, and special attributes of pedestrians, such as gender, the presence or absence of a knapsack and the like, are provided; the weather is cold during collection, so that pedestrians hit the shirt and the shielding situation is serious.

2.3 Main index of algorithm

Average accuracy (MEAN AVERAGE Precision, mAP) and first hit rate (Rank-1 Accuracy) are the most important indicators currently measuring the performance of pedestrian re-recognition algorithms.

Rank-1 Accuracy is the hit rate of the picture with the highest confidence in the returned results. When the number of the images to be detected is plural, an average value is taken.

Wherein Ω∈n ₊ is the set of the kth query result of the query and the ID; n epsilon N ₊ is the number of all pictures with the same ID as the query picture; n _q∈N₊ is the number of pictures to be queried; TP (True Positive) is predicting the positive class as a positive class number; FP (False Positive) is to predict negative classes as positive class numbers. mAP formulas are shown as (10), (11) and (12).

2.4 Comparative experiments and analysis

Tables 5 and 6 are ablation experiments for the methods under the mark 1501 dataset and DukeMTMC-reID dataset, respectively. Wherein CA refers to the channel attention mechanism; FRM refers to the addition of the FRM method of the invention.

Table 5 ablation experimental contrast (Market 1501)

Table 6 ablation experimental comparison (DukeMTMC-reID)

For scheme α in tables 5 and 6, the present invention considers that the initial learning rate is too high, so that the network falls into a locally optimal situation, so that the mAP in the mark 1501 dataset is raised by 28.3% after circle loss is added, but the actual improvement is not so significant. The circle loss effect is that the inter-class distance is larger than that of the original softmax loss, and the feature vector is mapped into one hypersphere through normalization, so that the influence of a module on a classification result is reduced. The performance of the network is increased to a certain extent after the CA is added, so that the network can accurately identify the characteristics of the target pedestrians. GeM make the network extracted feature vectors more focused on more locally pronounced features. Scheme MFEN-A is to perform parameter sharing on secondary three-level characteristic Block4-2 and secondary two-level characteristic Block 4-1. And MFEN-B tertiary characteristic Block4-2 and secondary characteristic Block4-1 do not share parameters. As can be seen from the data in tables 5 and 6, the scheme MFEN-A works better in the mark 1501 dataset and the scheme MFEN-B works better in the DukeMTMC-reID dataset. The present invention considers that this is due to the fact that the latter is larger than the former dataset and the latter has more occlusion, the effect of the extracted secondary features is better, so that the performance is increased with increasing parameters. The former is subject to an aggravation of the overfitting phenomenon, so that the performance is slightly degraded.

After the FRM method is added, mAP in the mark 1501 dataset is improved by 1.4%, and Rank-1 is improved by 0.8%. mAP was increased by 2.4% and Rank-1 was increased by 2.5% in DukeMTMC-reID dataset.

Tables 7-11 are parameter comparison experiments, all using a mark 1501 dataset.

Comparative tests of the W values in FRM are shown in table 7, which shows that complete removal of selected features is a good choice.

Comparison of W values in Table 7 FRM

W value	mAP	Rank-1
			0.5	85.6％	94.0％
0.1	84.7％	94.0％
			0	86.1％	94.4％

Table 8 is a comparison of the BN layer parameter sharing scheme after GeM. Scheme BN (α) indicates that BN layer parameters are not all shared; scheme BN (β) represents all sharing; scheme BN (γ) represents BN layer parameter sharing between re-extracted features, but not with global feature BN layer parameters. In experiments, the invention discovers that when the characteristics are normalized in batches, the scheme BN (alpha) or the scheme BN (beta) can lead to the reduction of network performance. While scheme BN (γ) gives better results.

The invention considers that after the salient features are erased for three times, the obtained feature vectors are insufficient to reflect the numerical distribution of the whole data set, the batch normalization cannot be well carried out, the situation of sharing is relieved, and the global features are not influenced.

Table 8 GeM post BN layer parameter sharing scheme comparison

Parameter sharing scheme	mAP	Rank-1
			BN(α)	84.7％	94.0％
BN(β)	84.2％	94.0％
			BN(γ)	86.1％	94.4％

Table 9 shows comparison of circle loss parameter sharing schemes. The scheme C (α) indicates that all the loop loss parameters are shared, and the scheme C (β) indicates that the parameters of the global feature and the parameters of the secondary feature are not shared, and the loop loss parameters are shared between the secondary features. Scheme C (γ) indicates that the circular loss parameters are not all shared. It can be seen that scheme C (γ) works better.

Table 9 circle loss parameter sharing scheme

Modeling method	mAP	Rank-1
			C(α)	85.5％	94.3％
C(β)	85.0％	94.2％
			C(γ)	86.1％	94.4％

As shown in table 10, wherein P value group= [ P value (secondary characteristic), P value (secondary characteristic) ]. The present network uses P-value set= [0.85,0.90,0.95]. From table 4.10 we can get a rough rule: i.e. the deeper the network the greater the P value of the extracted feature should be. Because the deeper the network layer number is, the smaller the extracted feature map is, the more abstract and accurate the extracted features are. The smaller the feature map, the larger the receptive field of the feature vector, and the lower the P value, the easier it is to remove the excessive key features.

Comparison of the P-value sets in Table 10 FRM

/>

Table 11 shows a comparison of network performance at different Batch sizes, where Batch sizes of 32 and above were trained using 4 GTX1080Ti (11G) graphics cards. It can be seen from Table 11 that an increase in the Batch size over a certain range improves the performance of the network, but that too high a level of performance decreases. Wherein P is the number of different pedestrians, and K is the number of pictures of the same pedestrian. Wherein the mark 1501 dataset is compared using MFEN-A scheme; dukeMTMC-reID were compared using the MFEN-B protocol. It can be seen that for MFEN, the Batch size setting is 6*4 optimal.

Table 11 comparison of network performance at different Batch sizes

MFEN a graph of subject operating characteristics (ROC) on a Market1501 dataset is shown in fig. 3. The horizontal axis is the false recognition rate, that is, the rate of the non-target judged as the target; the vertical axis is hit rate, i.e., the number of target pictures extracted/the number of target pictures in the dataset.

Tests were performed on the marker 1501 dataset with RK indicating the reordered index. MFEN training runs (epochs) were 120, which took 4.5 hours. The final experimental results are shown in table 12.

Table 12 Market1501 dataset

Table 13 shows the performance of MFEN at DukeMTMC-reID dataset and a comparison with other pedestrian re-identification algorithms. MFEN training runs (epochs) were 140, taking 6.5 hours.

Table 13 DukeMTMC-reID dataset

The secondary features extracted by MFEN may be more focused on detail features that are not particularly apparent. These features, by stripping layer by layer, will appear more hierarchical than a single global feature, and the network will therefore obtain more information. The network also learns more images than the original dataset, which results in an improved generalization capability of the network. Meanwhile, by combining multiple loss functions, MFEN achieves better performance in full supervision, single labeling and single frame tasks.

The specific embodiments of the invention and the accompanying drawings disclosed above are intended to assist in understanding the content and spirit of the invention and are not intended to limit the invention. Any modification, substitution, improvement, etc. made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A multi-level feature extraction network pedestrian re-identification method based on deep learning comprises the following steps:

2) Training the pedestrian re-recognition network by using the expanded data; wherein the data of the current batch after expansion is sequentially transmitted through

The method comprises the steps that Block1 and Block2 extract features of the pedestrian and then obtain feature vectors F= { x ₁,…,x_b }, and then the feature vectors F are respectively input into a first branch and a second branch of a pedestrian re-identification network; x _b represents the feature data of the b-th image; block1 and Block2 are feature extraction modules in SEResNet-50, SFGM is a secondary feature generation module, geM is generalized mean pooling, and BN is batch normalization;

3) The first branch extracts global features corresponding to each image from the feature vector F; the second branch extracts secondary features corresponding to each image from the feature vector F; the method for extracting the secondary features by the branch II comprises the following steps: firstly generating a secondary characteristic vector to be extracted through SFGM, then extracting the secondary characteristic vector to be extracted through Block3-1, and inputting the secondary characteristic vector to be extracted into two branches, namely a first second branch and a second branch, wherein the first second branch sequentially passes through the input characteristics through Block4-1 and GeM to obtain secondary characteristics; the second level branch II obtains a secondary characteristic vector to be extracted from the input characteristic through SFGM; dividing the secondary characteristic vector to be extracted into two branches after passing through Block4-2, namely a first third branch and a second third branch, wherein the first third branch obtains secondary characteristics from the secondary characteristic vector to be extracted through GeM; the second-stage branch sequentially passes SFGM and GeM to obtain a second-stage characteristic; wherein Block3-1, block4-1 and Block4-2 are feature extraction modules in SEResNet-50; the data processing method of the secondary feature generating module SFGM is as follows: setting an input feature vector as OF, carrying out maximum value pooling and mean pooling on the input feature vector OF in a channel dimension respectively, and then splicing the pooled two feature vectors in the channel dimension to obtain a feature vector F1; performing convolution operation on the feature vector F1 to obtain a feature vector F2; performing batch normalization on the feature vector F2 to obtain BN operation; finally, a space weight matrix M with the interval of [0,1] is obtained by utilizing a Sigmoid activation function; then generating a Mask with the weight value larger than P in the space weight matrix M, setting the value with the weight value larger than P in the space weight matrix M as W through the Mask, and setting the other value as W to obtain a space weight matrix M1; expanding the space weight matrix M1 into a space weight matrix M2 with the same channel dimension as the feature vector OF; then, the space weight matrix M2 and the feature vector OF are added to the feature vector OF after being multiplied by points to obtain a feature vector F3; the feature vector F3 is output after being activated by a ReLU function;

4) The global feature and each secondary feature are processed by BN operation, a full connection layer, circle Loss, label smoothing and Softmax Loss in sequence to obtain a corresponding Loss value; calculating the global feature through an auxiliary Loss function Center Loss to obtain a Loss value, and calculating the global feature through a Loss function triple Loss WITHADAPTIVE WEIGHTS to obtain a Loss value; the loss value calculated by the triple loss WITH ADAPTIVE WEIGHTS is Wherein,

D (x _a,x_p)、d(x_a,x_n) represents the similarity between the target x _a and the positive sample x _p and negative sample x _n, respectively; p (a) and N (a) respectively represent positive and negative sample sets of a target a; loss value calculated by Center Loss is/>Wherein B is the number of samples, x _i represents the eigenvalue of sample i,/>A center representing the class of sample i;

6) Repeating the steps 1) to 5) until the training of the set round is completed, and obtaining a pedestrian re-identification network after the training is completed;

2. The method of claim 1, wherein the data enhancement is applied to each image separately by:

13 Randomly clipping the image to generate a new image;

14 Normalizing the image to generate a new image;

15 Randomly erasing the image to generate a new image.

3. The method of claim 2, wherein in step 11), the image pixels are adjusted to a [ height, width ] = [256,128] size; and the branch I sequentially extracts the global feature from the feature vector F through Block3, block4 and GeM, the step length in Block4 is adjusted from 2 to 1, and the Block3 and Block4 are feature extraction modules in SEResNet-50.

4. The method of claim 1, wherein in step 4), the Circle Loss is used to adjust the inter-class distances, and then Softmax Loss is used to maximize the log likelihood of all classes in the probability space.