CN110728221A - Multi-attribute constrained pedestrian re-identification method - Google Patents

Multi-attribute constrained pedestrian re-identification method Download PDF

Info

Publication number
CN110728221A
CN110728221A CN201910941997.3A CN201910941997A CN110728221A CN 110728221 A CN110728221 A CN 110728221A CN 201910941997 A CN201910941997 A CN 201910941997A CN 110728221 A CN110728221 A CN 110728221A
Authority
CN
China
Prior art keywords
net
layer
pedestrian
neurons
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910941997.3A
Other languages
Chinese (zh)
Inventor
全红艳
刘超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Normal University
Original Assignee
East China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Normal University filed Critical East China Normal University
Priority to CN201910941997.3A priority Critical patent/CN110728221A/en
Publication of CN110728221A publication Critical patent/CN110728221A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/103Static body considered as a whole, e.g. static pedestrian or occupant recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a multi-attribute constrained pedestrian re-identification method, which combines global features and local features to learn and train to obtain a more accurate identification result.

Description

Multi-attribute constrained pedestrian re-identification method
Technical Field
The invention relates to the technical field of pedestrian re-recognition, in particular to a multi-attribute constraint pedestrian re-recognition method.
Background
In recent years, with the rise of internet technology, intelligent video monitoring has made rapid development on social public security guarantee work, and plays an important role in the practical process of maintaining social security and stability, pedestrian re-identification is dedicated to judging whether a specific pedestrian appears in a given image or video sequence, the traditional pedestrian re-identification technology realizes the expression of pedestrian features through color, texture features and other low-level visual features, however, the manually extracted low-level features are difficult to deal with the problems of large illumination difference, low resolution, local shielding among pedestrians, and the like, the appearance of the pedestrian re-identification technology based on deep learning overcomes the defect of the traditional manual feature extraction mode, and the technology faces the following difficulties: the environment where the pedestrian is located contains complex and redundant background interference, and the pedestrian postures caused by camera shooting are different, however, the existing method rarely utilizes the local attribute information of the pedestrian, so how to efficiently utilize the attribute information to learn the local distinguishable characteristics of the pedestrian, and establish an effective deep learning model, and realizing pedestrian re-identification is an actual problem to be solved urgently;
disclosure of Invention
The invention aims to provide an efficient pedestrian re-identification method by constructing a convolutional neural network by utilizing a deep learning strategy aiming at the defects of the prior art and the practical problems of low resolution, occlusion, non-uniform pedestrian posture and the like in pedestrian re-identification;
the specific technical scheme for realizing the purpose of the invention is as follows:
a multi-attribute constraint pedestrian re-identification method is characterized in that a single RGB image with the resolution of H multiplied by W is input, H is 128, 256 or 384, and W is H/2, and the method specifically comprises the following steps:
step 1: data set construction
Downloading a data set Market1501 of a single-person image from a website http:// blog. fangchengjin. cn/reid-marker-1501. html, and selecting M images from the Market1501 to construct a data set R ═ h { (h)ηEta is more than or equal to |0 and less than or equal to M-1}, M is more than or equal to 20000 and less than or equal to 40000, each image in R comprises X kinds of pedestrian attributes, X is more than or equal to 6 and less than or equal to 30, N attributes are taken out from X, the N attributes are coded by using one-hot codes, X-4 is more than or equal to N and less than or equal to X-2, gender, age, hairstyle and the like are included, the images in R have K categories, K is more than or equal to 500 and less than or equal to 2000, each pedestrian image is endowed with ac,1≤c≤K;
The training set alpha is constructed by adopting the following method: the epsilon images are taken out of the R,
Figure BDA0002223165640000021
scaling the image resolution to H multiplied by W, taking any pixel in a pedestrian area as a center, randomly disturbing the RGB intensity of three channels of the pixel in a square area around the pedestrian area, wherein the proportion of the square area to the whole image pixel is S, S is more than or equal to 2% and less than or equal to 20%, and the processed image is marked as IiI is not less than 0 and not more than epsilon, usingiConstructing a training set alpha, and further constructing a test set gamma by using the residual images in the R;
step 2: constructing neural networks
The neural network is three sub-networks: the low-middle layer feature sub-network S-Net, the global feature sub-network G-Net and the local fine-grained feature sub-network L-Net;
for S-Net, the input is IiThe shape of the input tensor is H multiplied by W multiplied by 3, and the output is two characteristics A with different scalesi、Bi,AiIs a high-resolution feature with the dimension of m multiplied by n multiplied by 1024, BiThe scale is a low-resolution feature, the scale is a multiplied by b multiplied by 2048, m is 8, 16 or 24, n is m/2, a is m/2, and b is a/2;
for G-Net, BiAs input, output IiClass D of (1), D is more than 0 and less than K-1;
for L-Net, AiAs an input, the output has two results, one is IiAnother is a sample image IiThe probability of N attributes of a pedestrian;
for S-Net, adopting a ResNet101 structure, and removing a maximum pooling layer and a full-connection layer behind a residual module on the basis of the basic ResNet101 structure;
for G-Net, the network structure is set to the following 3 layers: the maximum pooling layer is provided with 2048 channels, the spatial domain scale of pooling is a multiplied by b, the convolutional layer is composed of convolution operation, batch normalization, ReLU activation and Dropout operation, the number of convolution kernels is 1024, the kernel shape is 1 multiplied by 1, the number of neurons of the full connection layer is P, and P is more than 0 and less than K-1;
for L-Net, the arrangement is such that there is a spatial transformation network Gs4 sub-branch structures are introduced:
Gsis input byiThe output part is a neuron structure, the number of neurons and AiThe number of the middle neurons is the same, and the neurons store the global spatial features e and GsThe method comprises a positioning module, a grid generation module and a sampling module:
1) the positioning module consists of 2 convolutional layers, 1 global average pooling layer and 1 full-connection layer, wherein the number of cores in the 2 convolutional layers is 512 and 128 respectively, and the number of neurons in the full-connection layer is 6, and the parameters are used for storing space affine transformation parameters of images;
2) in the grid generation module, m × n neurons are arranged, and each neuron stores AiMarking the coordinate of each feature corresponding to the airspace as O;
3) the sampling module consists of m × n neurons, and the input comprises AiAnd O, each neuron receives the result of bilinear interpolation of neighborhood pixels around the position pixel;
and GsConnected are 4 sub-branch structures:
wherein, there are 3 branch structure designs as follows: the input for the 3 branch structures comes from the result of the horizontal bisection cut for the feature e, i.e., the upper, middle, and lower trisections of e, feature Q1、Q2、Q3Each branch structure, except one and GsBesides the spatial transformation networks with completely same structures, the spatial transformation networks also respectively comprise 1 maximum pooling layer, 1 convolution layer and 2 full-connection layers: the number of convolution kernels is 256, the kernel shape is 1 × 1, the tensor shape after convolution is converted into 1 × 256, then the tensor shape is input into a first full connection layer, the number of neurons in the layer is 256, a 2 nd full connection layer is arranged behind the 1 st full connection layer, the 2 nd full connection layer is divided into 3 groups of neurons, and the number of the neurons is 2, 2 and P respectively;
in addition to the 3 branch structures, there is a 4 th branch structure; for outputting from said 3 branch structuresThe features are concatenated to form a tensor shape of
Figure BDA0002223165640000033
R as input to a 4 th branching structure, the 4 th branching structure consisting of a max pooling layer, a convolutional layer, and 2 full connection layers: the maximum pooling layer has 1024 channels, and the spatial scale of pooling is
Figure BDA0002223165640000034
The convolution layer consists of convolution operation, batch normalization, ReLU activation and Dropout operation, the number of convolution kernels is 1024, the kernel shape is 1 multiplied by 1, the tensor shape after convolution is converted into 1 multiplied by 1024, and then the kernel shape is input into a first full connection layer, the number of neurons in the layer is 1024, the structure of 2 full connection layers is the same as the full connection structure in the 3 branches, only the 2 nd full connection layer is a structure divided into 2 groups of neurons, and the number of the neurons is 2 and P respectively;
and step 3: training of neural networks
Dividing sample images in a test set gamma into a sample data set beta and a test data set delta according to the proportion of 1:4, training a network model by using a training set alpha, taking a pedestrian sample to be identified by using the beta data set, and evaluating and testing the performance of the network by using the beta and the delta;
during training, training S-Net and G-Net simultaneously, wherein the training round is 50 times, then training S-Net, G-Net and L-Net simultaneously, the training round is 200 times, finally fixing parameters of the first two layers of S-Net, and finely adjusting parameters of other layers of S-Net, G-Net and L-Net, wherein the training round is 100 times;
the loss to G-Net is defined as:
wherein p isdIs represented byiProbability of belonging to class d, qdIs represented byiA tag value belonging to class d;
the loss to L-Net is defined as:
Figure BDA0002223165640000032
wherein L isERepresents a class loss, and is defined in the same manner as (1), where pdIs represented byiProbability of a local feature belonging to class d, qdIs represented byiLocal features belong to the d class of labels, LTIs the same as (1), and L is calculatedTThere are 2 categories: belonging to and not belonging to classes, pdWhether the local characteristics of L-Net belong to I or notiThe binary classification probability of the attribute, rho is a hyper-parameter constant, and 0.25 is taken;
when S-Net, G-Net and L-Net are trained simultaneously, the loss is defined as:
Figure BDA0002223165640000041
in the same batch of training samples, the characteristics obtained after convolution operation of G-Net of two different pedestrian sample images belonging to the same class x in training are expressed as
Figure BDA0002223165640000042
And
Figure BDA0002223165640000043
and the feature representation obtained by convolution operation of G-Net on a certain pedestrian sample image not belonging to the x class
Figure BDA0002223165640000044
Theta is an over-parameter constant and is 1.0, Z is the number of the same class samples in the same batch, and U represents the type of the samples in the same batch;
when fine-tuning other layer parameters for S-Net and G-Net, L-Net, the total loss defined is:
LY=Lg+Ll+Lt(4)
and 4, step 4: pedestrian re-identification method
Selecting a sample from a sample data set beta, inputting the sample into a network, constructing the network by using trained model parameters, predicting by using the network, connecting the characteristics obtained by G-Net and L-Net in series to obtain a predicted characteristic e, predicting the characteristics of each sample in delta according to the same method, measuring each characteristic in the characteristics and the characteristic e by using Euclidean distance, and when the Euclidean distance obtains a minimum value, obtaining the result of pedestrian re-identification by using the corresponding sample in delta.
The invention has the characteristics of high efficiency and practicability, the invention combines the global characteristics and the local characteristics to learn and train, and obtains a more accurate recognition result, the network structure of the method comprises a middle-low layer characteristic sub-network, a global characteristic sub-network and a local fine-grained characteristic sub-network, in the middle-low layer characteristic sub-network, the middle-layer and low-layer characteristics of pedestrians are learned, the semantic characteristics of high-layer pedestrians are further learned in the global characteristic sub-network, the integral characteristics of the pedestrians can be effectively distinguished, the local fine-grained characteristic sub-network is designed by adopting a space transformation network, and the accuracy of the local appearance recognition of the pedestrians is effectively improved by combining the middle-low layer characteristics and the local appearance information learning of human bodies, and the invented pedestrian re-recognition method can obtain a higher recognition rate under a complex background environment.
The invention is applied to the fields of intelligent security, video monitoring, pedestrian retrieval and the like, can be used as the support of a face recognition technology, and has high application value in reconnaissance work of public security and image retrieval and other scenes.
Drawings
FIG. 1 is a pedestrian search result diagram of Rank-10 according to the present invention;
fig. 2 is a diagram illustrating the result of the pedestrian attribute prediction according to the present invention.
Detailed Description
Examples
The invention is further described below with reference to the accompanying drawings;
the embodiment is implemented under a Windows 1064-bit operating system on a PC, and the hardware configuration thereof is
Figure BDA0002223165640000051
CoreTMi5-75003.4GHz, a video card NVIDIA GeForce GTX 10606G, a deep learning frame Pythrch 0.4, a programming language adopting Python 3.6, and mainly utilizing Python libraries of OpenCV 3.3.1 and NumPy 1.14.3;
the specific technical scheme for realizing the invention is as follows: a pedestrian re-identification method with multi-attribute constraint is characterized in that a single RGB image with H multiplied by W resolution is input, H is 256, W is 128, a convolutional neural network is constructed, and meanwhile, the difficulties of low image resolution, shielding, inconsistent pedestrian postures and the like are considered, and the method specifically comprises the following steps:
step 1: data set construction
Downloading a data set Market1501 of a single-person image from a website http:// blog. fangchengjin. cn/reid-marker-1501. html, and selecting M images from the Market1501 to construct a data set R ═ h { (h)ηEta is more than or equal to |0 and less than or equal to M-1}, M is 29419, each image in R comprises X kinds of pedestrian attributes, X is 27, N attributes are taken out of X, N is 7, the N attributes are coded by using one-hot codes, X-4 is more than or equal to N and less than or equal to X-2, gender, age, hairstyle and the like are included, the images in R have K categories, K is 1501, and each pedestrian image is endowed with a category Jc,1≤c≤K;
The training set alpha is constructed by adopting the following method: the epsilon images are taken out of the R,
Figure BDA0002223165640000052
scaling the image resolution to H multiplied by W, taking any pixel in a pedestrian area as a center, randomly disturbing the RGB intensity of three channels of the pixel in a square area around the pedestrian area, wherein the proportion of the square area to the whole image pixel is S, S is more than or equal to 2% and less than or equal to 20%, and the processed image is marked as IiI is not less than 0 and not more than epsilon, usingiConstructing a training set alpha, and further constructing a test set gamma by using the residual images in the R;
step 2: constructing neural networks
The neural network is three sub-networks: the low-middle layer feature sub-network S-Net, the global feature sub-network G-Net and the local fine-grained feature sub-network L-Net;
for S-Net, the input is IiThe shape of the input tensor is H multiplied by W multiplied by 3, and the output is two characteristics A with different scalesi、Bi,AiIs a high-resolution feature with the scale of m multiplied by n multiplied by 1024, m is 16, n is 8, BiThe scale is a low-resolution feature, the scale is a multiplied by b multiplied by 2048, a is 8, and b is 4;
for G-Net, BiAs input, output IiClass D of (1), D is more than 0 and less than K-1;
for L-Net, AiAs an input, the output has two results, one is IiAnother is a sample image IiThe probability of N attributes of a pedestrian;
for S-Net, adopting a ResNet101 structure, and removing a maximum pooling layer and a full-connection layer behind a residual module on the basis of the basic ResNet101 structure;
for G-Net, the network structure is set to the following 3 layers: the method comprises the following steps of a maximum pooling layer, a convolution layer and a full-connection layer, wherein the maximum pooling layer is provided with 2048 channels, the spatial scale of pooling is a multiplied by b, the convolution layer is composed of convolution operation, batch normalization, ReLU activation and Dropout operation, convolution kernels are set to be 1024, kernel shapes are 1 multiplied by 1, the number of neurons in the full-connection layer is P, and P is 751;
for L-Net, the arrangement is such that there is a spatial transformation network Gs4 sub-branch structures are introduced:
Gsis input byiThe output part is a neuron structure, the number of neurons and AiThe number of the middle neurons is the same, and the neurons store the global spatial features e and GsThe method comprises a positioning module, a grid generation module and a sampling module:
1) the positioning module consists of 2 convolutional layers, 1 global average pooling layer and 1 full-connection layer, wherein the number of cores in the 2 convolutional layers is 512 and 128 respectively, and the number of neurons in the full-connection layer is 6, and the parameters are used for storing space affine transformation parameters of images;
2) in the grid generation module, m × n neurons are arranged, and each neuron stores AiMarking the coordinate of each feature corresponding to the airspace as O;
3) samplingThe module consists of m × n neurons, and the input comprises AiAnd O, each neuron receives the result of bilinear interpolation of neighborhood pixels around the position pixel;
and GsConnected are 4 sub-branch structures:
wherein, there are 3 branch structure designs as follows: the input for the 3 branch structures comes from the result of the horizontal bisection cut for the feature e, i.e., the upper, middle, and lower trisections of e, feature Q1、Q2、Q3Each branch structure, except one and GsBesides the spatial transformation networks with completely same structures, the spatial transformation networks also respectively comprise 1 maximum pooling layer, 1 convolution layer and 2 full-connection layers: the number of convolution kernels is 256, the kernel shape is 1 × 1, the tensor shape after convolution is converted into 1 × 256, then the tensor shape is input into a first full connection layer, the number of neurons in the layer is 256, a 2 nd full connection layer is arranged behind the 1 st full connection layer, the 2 nd full connection layer is divided into 3 groups of neurons, and the number of the neurons is 2, 2 and P respectively;
in addition to the 3 branch structures, there is a 4 th branch structure; the characteristics output by the 3 branch structures are connected in series to form a tensor shape
Figure BDA0002223165640000061
R as input to a 4 th branching structure, the 4 th branching structure consisting of a max pooling layer, a convolutional layer, and 2 full connection layers: the maximum pooling layer has 1024 channels, and the spatial scale of pooling is
Figure BDA0002223165640000062
The convolution layer consists of convolution operation, batch normalization, ReLU activation and Dropout operation, the number of convolution kernels is 1024, the kernel shape is 1 multiplied by 1, the tensor shape after convolution is converted into 1 multiplied by 1024, and then the kernel shape is input into a first full connection layer, the number of neurons in the layer is 1024, the structure of 2 full connection layers is the same as the full connection structure in the 3 branches, only the 2 nd full connection layer is a structure divided into 2 groups of neurons, and the number of the neurons is 2 and P respectively;
and step 3: training of neural networks
Dividing sample images in a test set gamma into a sample data set beta and a test data set delta according to the proportion of 1:4, training a network model by using a training set alpha, taking a pedestrian sample to be identified by using the beta data set, and evaluating and testing the performance of the network by using the beta and the delta;
during training, training S-Net and G-Net simultaneously, wherein the training round is 50 times, then training S-Net, G-Net and L-Net simultaneously, the training round is 200 times, finally fixing parameters of the first two layers of S-Net, and finely adjusting parameters of other layers of S-Net, G-Net and L-Net, wherein the training round is 100 times;
the loss to G-Net is defined as:
Figure BDA0002223165640000071
wherein p isdIs represented byiProbability of belonging to class d, qdIs represented byiA tag value belonging to class d;
the loss to L-Net is defined as:
Figure BDA0002223165640000072
wherein L isERepresents a class loss, and is defined in the same manner as (1), where pdIs represented byiProbability of a local feature belonging to class d, qdIs represented byiLocal features belong to the d class of labels, LTIs the same as (1), and L is calculatedTThere are 2 categories: belonging to and not belonging to classes, pdWhether the local characteristics of L-Net belong to I or notiThe binary classification probability of the attribute, rho is a hyper-parameter constant, and 0.25 is taken;
when S-Net, G-Net and L-Net are trained simultaneously, the loss is defined as:
Figure BDA0002223165640000073
wherein, in the same batch of training samples, belong toWhen two different pedestrian sample images of the same class x are trained, the characteristics obtained after convolution operation of G-Net are expressed as
Figure BDA0002223165640000074
And
Figure BDA0002223165640000075
and the feature representation obtained by convolution operation of G-Net on a certain pedestrian sample image not belonging to the x class
Figure BDA0002223165640000076
Theta is an over-parameter constant and is 1.0, Z is the number of the same class samples in the same batch, and U represents the type of the samples in the same batch;
when fine-tuning other layer parameters for S-Net and G-Net, L-Net, the total loss defined is:
LY=Lg+Ll+Lt(4)
and 4, step 4: pedestrian re-identification method
Selecting a sample from a sample data set beta, inputting the sample into a network, constructing the network by using trained model parameters, predicting by using the network, connecting the characteristics obtained by G-Net and L-Net in series to obtain a predicted characteristic e, predicting the characteristics of each sample in delta according to the same method, measuring each characteristic in the characteristics and the characteristic e by using Euclidean distance, and when the Euclidean distance obtains a minimum value, obtaining the result of pedestrian re-identification by using the corresponding sample in delta.
The hyper-parameters in the training process are set as follows: the Dropout rate is 0.5, the optimizer selects a random gradient descent (SGD) optimizer, the number of batch samples is 32, the S-Net and G-Net training stages are simultaneously carried out, the learning rate is set to be 0.01, the training period is 50 rounds, the S-Net, G-Net and L-Net training stages are simultaneously carried out, the initial value of the learning rate is 0.01, the training period is 200 rounds, the learning rate is reduced by 10% in every 50 rounds, the fine adjustment stage is carried out, the learning rate is set to be 0.001, and the training period is 50 rounds.
Fig. 1 shows a visual search result, 6 images of pedestrians to be searched are randomly selected from β, the selected images of pedestrians cover the situations of occlusion, low pixels, inconsistent postures of pedestrians, and the like, each row is a group of search results, the first column of each row is the image of the pedestrian to be searched, it can be seen that although the searched image has low resolution, contains partial occlusion, and is inconsistent in posture of pedestrians, the model can still achieve an accurate search result, and the accuracy of the search result of the 6 randomly selected images reaches 100%.
Fig. 2 shows the result of the pedestrian attribute prediction, wherein 4 pedestrian images are randomly selected from β, each example comprises an input pedestrian and the result of the attribute prediction, and as can be seen from the result, the model can accurately predict the attribute of the pedestrian, and in addition, although the input image comprises redundant background, the model can accurately locate the attribute of the pedestrian, thereby improving the accuracy of attribute identification.

Claims (1)

1. A multi-attribute constraint pedestrian re-identification method is characterized in that a single RGB image with the resolution of H multiplied by W is input, H is 128, 256 or 384, and W is H/2, and the method specifically comprises the following steps:
step 1: data set construction
Downloading a data set Market1501 of a single-person image from a website http:// blog. fangchengjin. cn/reid-marker-1501. html, and selecting M images from the Market1501 to construct a data set R ═ h { (h)ηEta is more than or equal to |0 and less than or equal to M-1}, M is more than or equal to 20000 and less than or equal to 40000, each image in R comprises X kinds of pedestrian attributes, X is more than or equal to 6 and less than or equal to 30, N attributes are taken out from X, the N attributes are coded by using one-hot codes, X-4 is more than or equal to N and less than or equal to X-2, gender, age, hairstyle and the like are included, the images in R have K categories, K is more than or equal to 500 and less than or equal to 2000, each pedestrian image is endowed with ac,1≤c≤K;
The training set alpha is constructed by adopting the following method: the epsilon images are taken out of the R,
Figure FDA0002223165630000011
scaling the image resolution to H × W, centering on any pixel in the pedestrian region, and performing three-channel RG on the pixel in the surrounding square regionThe intensity of B is randomly disturbed, the proportion of the square area to the whole image pixel is S, S is more than or equal to 2% and less than or equal to 20%, and the processed image is marked as IiI is not less than 0 and not more than epsilon, usingiConstructing a training set alpha, and further constructing a test set gamma by using the residual images in the R;
step 2: constructing neural networks
The neural network is three sub-networks: the low-middle layer feature sub-network S-Net, the global feature sub-network G-Net and the local fine-grained feature sub-network L-Net;
for S-Net, the input is IiThe shape of the input tensor is H multiplied by W multiplied by 3, and the output is two characteristics A with different scalesi、Bi,AiIs a high-resolution feature with the dimension of m multiplied by n multiplied by 1024, BiThe scale is a low-resolution feature, the scale is a multiplied by b multiplied by 2048, m is 8, 16 or 24, n is m/2, a is m/2, and b is a/2;
for G-Net, BiAs input, output IiClass D of (1), D is more than 0 and less than K-1;
for L-Net, AiAs an input, the output has two results, one is IiAnother is a sample image IiThe probability of N attributes of a pedestrian;
for S-Net, adopting a ResNet101 structure, and removing a maximum pooling layer and a full-connection layer behind a residual module on the basis of the basic ResNet101 structure;
for G-Net, the network structure is set to the following 3 layers: the maximum pooling layer is provided with 2048 channels, the spatial domain scale of pooling is a multiplied by b, the convolutional layer is composed of convolution operation, batch normalization, ReLU activation and Dropout operation, the number of convolution kernels is 1024, the kernel shape is 1 multiplied by 1, the number of neurons of the full connection layer is P, and P is more than 0 and less than K-1;
for L-Net, the arrangement is such that there is a spatial transformation network Gs4 sub-branch structures are introduced:
Gsis input byiThe output part is a neuron structure, the number of neurons and AiThe number of the middle neurons is the same, and the neurons store the global spatial features e and GsThe method comprises a positioning module, a grid generation module and a sampling module:
1) the positioning module consists of 2 convolutional layers, 1 global average pooling layer and 1 full-connection layer, wherein the number of cores in the 2 convolutional layers is 512 and 128 respectively, and the number of neurons in the full-connection layer is 6, and the parameters are used for storing space affine transformation parameters of images;
2) in the grid generation module, m × n neurons are arranged, and each neuron stores AiMarking the coordinate of each feature corresponding to the airspace as O;
3) the sampling module consists of m × n neurons, and the input comprises AiAnd O, each neuron receives the result of bilinear interpolation of neighborhood pixels around the position pixel;
and GsConnected are 4 sub-branch structures:
wherein, there are 3 branch structure designs as follows: the input for the 3 branch structures comes from the result of the horizontal bisection cut for the feature e, i.e., the upper, middle, and lower trisections of e, feature Q1、Q2、Q3Each branch structure, except one and GsBesides the spatial transformation networks with completely same structures, the spatial transformation networks also respectively comprise 1 maximum pooling layer, 1 convolution layer and 2 full-connection layers: the number of convolution kernels is 256, the kernel shape is 1 × 1, the tensor shape after convolution is converted into 1 × 256, then the tensor shape is input into a first full connection layer, the number of neurons in the layer is 256, a 2 nd full connection layer is arranged behind the 1 st full connection layer, the 2 nd full connection layer is divided into 3 groups of neurons, and the number of the neurons is 2, 2 and P respectively;
in addition to the 3 branch structures, there is a 4 th branch structure; the characteristics output by the 3 branch structures are connected in series to form a tensor shape
Figure FDA0002223165630000021
R as input to a 4 th branching structure, the 4 th branching structure consisting of a max pooling layer, a convolutional layer, and 2 full connection layers: the maximum pooling layer has 1024 channels, and the spatial scale of pooling is
Figure FDA0002223165630000022
The convolution layer consists of convolution operation, batch normalization, ReLU activation and Dropout operation, the number of convolution kernels is 1024, the kernel shape is 1 multiplied by 1, the tensor shape after convolution is converted into 1 multiplied by 1024, and then the kernel shape is input into a first full connection layer, the number of neurons in the layer is 1024, the structure of 2 full connection layers is the same as the full connection structure in the 3 branches, only the 2 nd full connection layer is a structure divided into 2 groups of neurons, and the number of the neurons is 2 and P respectively;
and step 3: training of neural networks
Dividing sample images in a test set gamma into a sample data set beta and a test data set delta according to the proportion of 1:4, training a network model by using a training set alpha, taking a pedestrian sample to be identified by using the beta data set, and evaluating and testing the performance of the network by using the beta and the delta;
during training, training S-Net and G-Net simultaneously, wherein the training round is 50 times, then training S-Net, G-Net and L-Net simultaneously, the training round is 200 times, finally fixing parameters of the first two layers of S-Net, and finely adjusting parameters of other layers of S-Net, G-Net and L-Net, wherein the training round is 100 times;
the loss to G-Net is defined as:
Figure FDA0002223165630000031
wherein p isdIs represented byiProbability of belonging to class d, qdIs represented byiA tag value belonging to class d;
the loss to L-Net is defined as:
Figure FDA0002223165630000032
wherein L isERepresents a class loss, and is defined in the same manner as in the formula (1), where pdIs represented byiProbability of a local feature belonging to class d, qdIs represented byiLabels with local features belonging to class d,LTIs the same as the formula (1), and L is calculatedTThere are 2 categories: belonging to and not belonging to classes, pdWhether the local characteristics of L-Net belong to I or notiThe binary classification probability of the attribute, rho is a hyper-parameter constant, and 0.25 is taken;
when S-Net, G-Net and L-Net are trained simultaneously, the loss is defined as:
in the same batch of training samples, the characteristics obtained after convolution operation of G-Net of two different pedestrian sample images belonging to the same class x in training are expressed as
Figure FDA0002223165630000034
And
Figure FDA0002223165630000035
and the feature representation obtained by convolution operation of G-Net on a certain pedestrian sample image not belonging to the x class
Figure FDA0002223165630000036
Theta is an over-parameter constant and is 1.0, Z is the number of the same class samples in the same batch, and U represents the type of the samples in the same batch;
when fine-tuning other layer parameters for S-Net and G-Net, L-Net, the total loss defined is:
LY=Lg+Ll+Lt(4)
and 4, step 4: pedestrian re-identification method
Selecting a sample from a sample data set beta, inputting the sample into a network, constructing the network by using trained model parameters, predicting by using the network, connecting the characteristics obtained by G-Net and L-Net in series to obtain a predicted characteristic e, predicting the characteristics of each sample in delta according to the same method, measuring each characteristic in the characteristics and the characteristic e by using Euclidean distance, and when the Euclidean distance obtains a minimum value, obtaining the result of pedestrian re-identification by using the corresponding sample in delta.
CN201910941997.3A 2019-09-30 2019-09-30 Multi-attribute constrained pedestrian re-identification method Pending CN110728221A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910941997.3A CN110728221A (en) 2019-09-30 2019-09-30 Multi-attribute constrained pedestrian re-identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910941997.3A CN110728221A (en) 2019-09-30 2019-09-30 Multi-attribute constrained pedestrian re-identification method

Publications (1)

Publication Number Publication Date
CN110728221A true CN110728221A (en) 2020-01-24

Family

ID=69218671

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910941997.3A Pending CN110728221A (en) 2019-09-30 2019-09-30 Multi-attribute constrained pedestrian re-identification method

Country Status (1)

Country Link
CN (1) CN110728221A (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108960127A (en) * 2018-06-29 2018-12-07 厦门大学 Pedestrian's recognition methods again is blocked based on the study of adaptive depth measure

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108960127A (en) * 2018-06-29 2018-12-07 厦门大学 Pedestrian's recognition methods again is blocked based on the study of adaptive depth measure

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
CHAO LIU, HONGYANG QUAN: "A Global-Local Architecture Constrained by Multiple Attributes for Person Re-identification" *

Similar Documents

Publication Publication Date Title
CN108764372B (en) Construction method and device, mobile terminal, the readable storage medium storing program for executing of data set
CN108537742B (en) Remote sensing image panchromatic sharpening method based on generation countermeasure network
Cordonnier et al. Differentiable patch selection for image recognition
CN108052911B (en) Deep learning-based multi-mode remote sensing image high-level feature fusion classification method
CN105740894B (en) Semantic annotation method for hyperspectral remote sensing image
CN112651978B (en) Sublingual microcirculation image segmentation method and device, electronic equipment and storage medium
CN109271895B (en) Pedestrian re-identification method based on multi-scale feature learning and feature segmentation
CN113221641B (en) Video pedestrian re-identification method based on generation of antagonism network and attention mechanism
CN109684922B (en) Multi-model finished dish identification method based on convolutional neural network
CN112347970B (en) Remote sensing image ground object identification method based on graph convolution neural network
CN109741341A (en) A kind of image partition method based on super-pixel and long memory network in short-term
Gu et al. Blind image quality assessment via vector regression and object oriented pooling
CN111652273B (en) Deep learning-based RGB-D image classification method
CN114782997B (en) Pedestrian re-recognition method and system based on multi-loss attention self-adaptive network
Zulfiqar et al. AI-ForestWatch: semantic segmentation based end-to-end framework for forest estimation and change detection using multi-spectral remote sensing imagery
CN111626357B (en) Image identification method based on neural network model
CN110427819A (en) The method and relevant device of PPT frame in a kind of identification image
CN111832479B (en) Video target detection method based on improved self-adaptive anchor point R-CNN
CN114548256A (en) Small sample rare bird identification method based on comparative learning
CN111639697B (en) Hyperspectral image classification method based on non-repeated sampling and prototype network
CN113111716A (en) Remote sensing image semi-automatic labeling method and device based on deep learning
CN110688966B (en) Semantic guidance pedestrian re-recognition method
CN114898158A (en) Small sample traffic abnormity image acquisition method and system based on multi-scale attention coupling mechanism
CN115272956A (en) Chicken health degree monitoring method based on improved YOLOv5
CN113077438B (en) Cell nucleus region extraction method and imaging method for multi-cell nucleus color image

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200124

RJ01 Rejection of invention patent application after publication