CN110728221A

CN110728221A - Multi-attribute constrained pedestrian re-identification method

Info

Publication number: CN110728221A
Application number: CN201910941997.3A
Authority: CN
Inventors: 全红艳; 刘超
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2019-09-30
Filing date: 2019-09-30
Publication date: 2020-01-24

Abstract

The invention discloses a multi-attribute constrained pedestrian re-identification method, which combines global features and local features to learn and train to obtain a more accurate identification result.

Description

Multi-attribute constrained pedestrian re-identification method

Technical Field

The invention relates to the technical field of pedestrian re-recognition, in particular to a multi-attribute constraint pedestrian re-recognition method.

Background

In recent years, with the rise of internet technology, intelligent video monitoring has made rapid development on social public security guarantee work, and plays an important role in the practical process of maintaining social security and stability, pedestrian re-identification is dedicated to judging whether a specific pedestrian appears in a given image or video sequence, the traditional pedestrian re-identification technology realizes the expression of pedestrian features through color, texture features and other low-level visual features, however, the manually extracted low-level features are difficult to deal with the problems of large illumination difference, low resolution, local shielding among pedestrians, and the like, the appearance of the pedestrian re-identification technology based on deep learning overcomes the defect of the traditional manual feature extraction mode, and the technology faces the following difficulties: the environment where the pedestrian is located contains complex and redundant background interference, and the pedestrian postures caused by camera shooting are different, however, the existing method rarely utilizes the local attribute information of the pedestrian, so how to efficiently utilize the attribute information to learn the local distinguishable characteristics of the pedestrian, and establish an effective deep learning model, and realizing pedestrian re-identification is an actual problem to be solved urgently;

disclosure of Invention

The invention aims to provide an efficient pedestrian re-identification method by constructing a convolutional neural network by utilizing a deep learning strategy aiming at the defects of the prior art and the practical problems of low resolution, occlusion, non-uniform pedestrian posture and the like in pedestrian re-identification;

the specific technical scheme for realizing the purpose of the invention is as follows:

a multi-attribute constraint pedestrian re-identification method is characterized in that a single RGB image with the resolution of H multiplied by W is input, H is 128, 256 or 384, and W is H/2, and the method specifically comprises the following steps:

step 1: data set construction

Downloading a data set Market1501 of a single-person image from a website http:// blog. fangchengjin. cn/reid-marker-1501. html, and selecting M images from the Market1501 to construct a data set R ═ h { (h)_ηEta is more than or equal to |0 and less than or equal to M-1}, M is more than or equal to 20000 and less than or equal to 40000, each image in R comprises X kinds of pedestrian attributes, X is more than or equal to 6 and less than or equal to 30, N attributes are taken out from X, the N attributes are coded by using one-hot codes, X-4 is more than or equal to N and less than or equal to X-2, gender, age, hairstyle and the like are included, the images in R have K categories, K is more than or equal to 500 and less than or equal to 2000, each pedestrian image is endowed with a_c，1≤c≤K；

The training set alpha is constructed by adopting the following method: the epsilon images are taken out of the R,

scaling the image resolution to H multiplied by W, taking any pixel in a pedestrian area as a center, randomly disturbing the RGB intensity of three channels of the pixel in a square area around the pedestrian area, wherein the proportion of the square area to the whole image pixel is S, S is more than or equal to 2% and less than or equal to 20%, and the processed image is marked as I_iI is not less than 0 and not more than epsilon, using_iConstructing a training set alpha, and further constructing a test set gamma by using the residual images in the R;

step 2: constructing neural networks

The neural network is three sub-networks: the low-middle layer feature sub-network S-Net, the global feature sub-network G-Net and the local fine-grained feature sub-network L-Net;

for S-Net, the input is I_iThe shape of the input tensor is H multiplied by W multiplied by 3, and the output is two characteristics A with different scales_i、B_i，A_iIs a high-resolution feature with the dimension of m multiplied by n multiplied by 1024, B_iThe scale is a low-resolution feature, the scale is a multiplied by b multiplied by 2048, m is 8, 16 or 24, n is m/2, a is m/2, and b is a/2;

for G-Net, B_iAs input, output I_iClass D of (1), D is more than 0 and less than K-1;

for L-Net, A_iAs an input, the output has two results, one is I_iAnother is a sample image I_iThe probability of N attributes of a pedestrian;

for S-Net, adopting a ResNet101 structure, and removing a maximum pooling layer and a full-connection layer behind a residual module on the basis of the basic ResNet101 structure;

for G-Net, the network structure is set to the following 3 layers: the maximum pooling layer is provided with 2048 channels, the spatial domain scale of pooling is a multiplied by b, the convolutional layer is composed of convolution operation, batch normalization, ReLU activation and Dropout operation, the number of convolution kernels is 1024, the kernel shape is 1 multiplied by 1, the number of neurons of the full connection layer is P, and P is more than 0 and less than K-1;

for L-Net, the arrangement is such that there is a spatial transformation network G_s4 sub-branch structures are introduced:

G_sis input by_iThe output part is a neuron structure, the number of neurons and A_iThe number of the middle neurons is the same, and the neurons store the global spatial features e and G_sThe method comprises a positioning module, a grid generation module and a sampling module:

1) the positioning module consists of 2 convolutional layers, 1 global average pooling layer and 1 full-connection layer, wherein the number of cores in the 2 convolutional layers is 512 and 128 respectively, and the number of neurons in the full-connection layer is 6, and the parameters are used for storing space affine transformation parameters of images;

2) in the grid generation module, m × n neurons are arranged, and each neuron stores A_iMarking the coordinate of each feature corresponding to the airspace as O;

3) the sampling module consists of m × n neurons, and the input comprises A_iAnd O, each neuron receives the result of bilinear interpolation of neighborhood pixels around the position pixel;

and G_sConnected are 4 sub-branch structures:

wherein, there are 3 branch structure designs as follows: the input for the 3 branch structures comes from the result of the horizontal bisection cut for the feature e, i.e., the upper, middle, and lower trisections of e, feature Q₁、Q₂、Q₃Each branch structure, except one and G_sBesides the spatial transformation networks with completely same structures, the spatial transformation networks also respectively comprise 1 maximum pooling layer, 1 convolution layer and 2 full-connection layers: the number of convolution kernels is 256, the kernel shape is 1 × 1, the tensor shape after convolution is converted into 1 × 256, then the tensor shape is input into a first full connection layer, the number of neurons in the layer is 256, a 2 nd full connection layer is arranged behind the 1 st full connection layer, the 2 nd full connection layer is divided into 3 groups of neurons, and the number of the neurons is 2, 2 and P respectively;

in addition to the 3 branch structures, there is a 4 th branch structure; for outputting from said 3 branch structuresThe features are concatenated to form a tensor shape of

R as input to a 4 th branching structure, the 4 th branching structure consisting of a max pooling layer, a convolutional layer, and 2 full connection layers: the maximum pooling layer has 1024 channels, and the spatial scale of pooling is

The convolution layer consists of convolution operation, batch normalization, ReLU activation and Dropout operation, the number of convolution kernels is 1024, the kernel shape is 1 multiplied by 1, the tensor shape after convolution is converted into 1 multiplied by 1024, and then the kernel shape is input into a first full connection layer, the number of neurons in the layer is 1024, the structure of 2 full connection layers is the same as the full connection structure in the 3 branches, only the 2 nd full connection layer is a structure divided into 2 groups of neurons, and the number of the neurons is 2 and P respectively;

and step 3: training of neural networks

Dividing sample images in a test set gamma into a sample data set beta and a test data set delta according to the proportion of 1:4, training a network model by using a training set alpha, taking a pedestrian sample to be identified by using the beta data set, and evaluating and testing the performance of the network by using the beta and the delta;

during training, training S-Net and G-Net simultaneously, wherein the training round is 50 times, then training S-Net, G-Net and L-Net simultaneously, the training round is 200 times, finally fixing parameters of the first two layers of S-Net, and finely adjusting parameters of other layers of S-Net, G-Net and L-Net, wherein the training round is 100 times;

the loss to G-Net is defined as:

wherein p is_dIs represented by_iProbability of belonging to class d, q_dIs represented by_iA tag value belonging to class d;

the loss to L-Net is defined as:

wherein L is_ERepresents a class loss, and is defined in the same manner as (1), where p_dIs represented by_iProbability of a local feature belonging to class d, q_dIs represented by_iLocal features belong to the d class of labels, L_TIs the same as (1), and L is calculated_TThere are 2 categories: belonging to and not belonging to classes, p_dWhether the local characteristics of L-Net belong to I or not_iThe binary classification probability of the attribute, rho is a hyper-parameter constant, and 0.25 is taken;

when S-Net, G-Net and L-Net are trained simultaneously, the loss is defined as:

in the same batch of training samples, the characteristics obtained after convolution operation of G-Net of two different pedestrian sample images belonging to the same class x in training are expressed as

And

and the feature representation obtained by convolution operation of G-Net on a certain pedestrian sample image not belonging to the x class

Theta is an over-parameter constant and is 1.0, Z is the number of the same class samples in the same batch, and U represents the type of the samples in the same batch;

when fine-tuning other layer parameters for S-Net and G-Net, L-Net, the total loss defined is:

L_Y＝L_g+L_l+L_t(4)

and 4, step 4: pedestrian re-identification method

Selecting a sample from a sample data set beta, inputting the sample into a network, constructing the network by using trained model parameters, predicting by using the network, connecting the characteristics obtained by G-Net and L-Net in series to obtain a predicted characteristic e, predicting the characteristics of each sample in delta according to the same method, measuring each characteristic in the characteristics and the characteristic e by using Euclidean distance, and when the Euclidean distance obtains a minimum value, obtaining the result of pedestrian re-identification by using the corresponding sample in delta.

The invention has the characteristics of high efficiency and practicability, the invention combines the global characteristics and the local characteristics to learn and train, and obtains a more accurate recognition result, the network structure of the method comprises a middle-low layer characteristic sub-network, a global characteristic sub-network and a local fine-grained characteristic sub-network, in the middle-low layer characteristic sub-network, the middle-layer and low-layer characteristics of pedestrians are learned, the semantic characteristics of high-layer pedestrians are further learned in the global characteristic sub-network, the integral characteristics of the pedestrians can be effectively distinguished, the local fine-grained characteristic sub-network is designed by adopting a space transformation network, and the accuracy of the local appearance recognition of the pedestrians is effectively improved by combining the middle-low layer characteristics and the local appearance information learning of human bodies, and the invented pedestrian re-recognition method can obtain a higher recognition rate under a complex background environment.

The invention is applied to the fields of intelligent security, video monitoring, pedestrian retrieval and the like, can be used as the support of a face recognition technology, and has high application value in reconnaissance work of public security and image retrieval and other scenes.

Drawings

FIG. 1 is a pedestrian search result diagram of Rank-10 according to the present invention;

fig. 2 is a diagram illustrating the result of the pedestrian attribute prediction according to the present invention.

Detailed Description

Examples

The invention is further described below with reference to the accompanying drawings;

the embodiment is implemented under a Windows 1064-bit operating system on a PC, and the hardware configuration thereof is

Core^TMi5-75003.4GHz, a video card NVIDIA GeForce GTX 10606G, a deep learning frame Pythrch 0.4, a programming language adopting Python 3.6, and mainly utilizing Python libraries of OpenCV 3.3.1 and NumPy 1.14.3;

the specific technical scheme for realizing the invention is as follows: a pedestrian re-identification method with multi-attribute constraint is characterized in that a single RGB image with H multiplied by W resolution is input, H is 256, W is 128, a convolutional neural network is constructed, and meanwhile, the difficulties of low image resolution, shielding, inconsistent pedestrian postures and the like are considered, and the method specifically comprises the following steps:

step 1: data set construction

Downloading a data set Market1501 of a single-person image from a website http:// blog. fangchengjin. cn/reid-marker-1501. html, and selecting M images from the Market1501 to construct a data set R ═ h { (h)_ηEta is more than or equal to |0 and less than or equal to M-1}, M is 29419, each image in R comprises X kinds of pedestrian attributes, X is 27, N attributes are taken out of X, N is 7, the N attributes are coded by using one-hot codes, X-4 is more than or equal to N and less than or equal to X-2, gender, age, hairstyle and the like are included, the images in R have K categories, K is 1501, and each pedestrian image is endowed with a category J_c，1≤c≤K；

step 2: constructing neural networks

for S-Net, the input is I_iThe shape of the input tensor is H multiplied by W multiplied by 3, and the output is two characteristics A with different scales_i、B_i，A_iIs a high-resolution feature with the scale of m multiplied by n multiplied by 1024, m is 16, n is 8, B_iThe scale is a low-resolution feature, the scale is a multiplied by b multiplied by 2048, a is 8, and b is 4;

for G-Net, the network structure is set to the following 3 layers: the method comprises the following steps of a maximum pooling layer, a convolution layer and a full-connection layer, wherein the maximum pooling layer is provided with 2048 channels, the spatial scale of pooling is a multiplied by b, the convolution layer is composed of convolution operation, batch normalization, ReLU activation and Dropout operation, convolution kernels are set to be 1024, kernel shapes are 1 multiplied by 1, the number of neurons in the full-connection layer is P, and P is 751;

3) samplingThe module consists of m × n neurons, and the input comprises A_iAnd O, each neuron receives the result of bilinear interpolation of neighborhood pixels around the position pixel;

and G_sConnected are 4 sub-branch structures:

in addition to the 3 branch structures, there is a 4 th branch structure; the characteristics output by the 3 branch structures are connected in series to form a tensor shape

and step 3: training of neural networks

the loss to G-Net is defined as:

the loss to L-Net is defined as:

when S-Net, G-Net and L-Net are trained simultaneously, the loss is defined as:

wherein, in the same batch of training samples, belong toWhen two different pedestrian sample images of the same class x are trained, the characteristics obtained after convolution operation of G-Net are expressed as

And

L_Y＝L_g+L_l+L_t(4)

and 4, step 4: pedestrian re-identification method

The hyper-parameters in the training process are set as follows: the Dropout rate is 0.5, the optimizer selects a random gradient descent (SGD) optimizer, the number of batch samples is 32, the S-Net and G-Net training stages are simultaneously carried out, the learning rate is set to be 0.01, the training period is 50 rounds, the S-Net, G-Net and L-Net training stages are simultaneously carried out, the initial value of the learning rate is 0.01, the training period is 200 rounds, the learning rate is reduced by 10% in every 50 rounds, the fine adjustment stage is carried out, the learning rate is set to be 0.001, and the training period is 50 rounds.

Fig. 1 shows a visual search result, 6 images of pedestrians to be searched are randomly selected from β, the selected images of pedestrians cover the situations of occlusion, low pixels, inconsistent postures of pedestrians, and the like, each row is a group of search results, the first column of each row is the image of the pedestrian to be searched, it can be seen that although the searched image has low resolution, contains partial occlusion, and is inconsistent in posture of pedestrians, the model can still achieve an accurate search result, and the accuracy of the search result of the 6 randomly selected images reaches 100%.

Fig. 2 shows the result of the pedestrian attribute prediction, wherein 4 pedestrian images are randomly selected from β, each example comprises an input pedestrian and the result of the attribute prediction, and as can be seen from the result, the model can accurately predict the attribute of the pedestrian, and in addition, although the input image comprises redundant background, the model can accurately locate the attribute of the pedestrian, thereby improving the accuracy of attribute identification.

Claims

1. A multi-attribute constraint pedestrian re-identification method is characterized in that a single RGB image with the resolution of H multiplied by W is input, H is 128, 256 or 384, and W is H/2, and the method specifically comprises the following steps:

step 1: data set construction

scaling the image resolution to H × W, centering on any pixel in the pedestrian region, and performing three-channel RG on the pixel in the surrounding square regionThe intensity of B is randomly disturbed, the proportion of the square area to the whole image pixel is S, S is more than or equal to 2% and less than or equal to 20%, and the processed image is marked as I_iI is not less than 0 and not more than epsilon, using_iConstructing a training set alpha, and further constructing a test set gamma by using the residual images in the R;

step 2: constructing neural networks

and G_sConnected are 4 sub-branch structures:

and step 3: training of neural networks

the loss to G-Net is defined as:

the loss to L-Net is defined as:

wherein L is_ERepresents a class loss, and is defined in the same manner as in the formula (1), where p_dIs represented by_iProbability of a local feature belonging to class d, q_dIs represented by_iLabels with local features belonging to class d，L_TIs the same as the formula (1), and L is calculated_TThere are 2 categories: belonging to and not belonging to classes, p_dWhether the local characteristics of L-Net belong to I or not_iThe binary classification probability of the attribute, rho is a hyper-parameter constant, and 0.25 is taken;

when S-Net, G-Net and L-Net are trained simultaneously, the loss is defined as:

And

L_Y＝L_g+L_l+L_t(4)

and 4, step 4: pedestrian re-identification method