CN111401113A

CN111401113A - Pedestrian re-identification method based on human body posture estimation

Info

Publication number: CN111401113A
Application number: CN201910006045.2A
Authority: CN
Inventors: 于耀; 邱睿; 周余
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2019-01-02
Filing date: 2019-01-02
Publication date: 2020-07-10

Abstract

The invention discloses a pedestrian re-recognition algorithm based on human body posture estimation, and belongs to the field of computer vision. The invention solves the problems that: the pedestrian re-identification task has the problem that the identification effect is not robust due to the fact that the human body posture and the camera angle are not fixed. The key part of the main algorithm of the invention is that human body joint points in a pedestrian picture are obtained by utilizing a human body posture estimation algorithm, and the pedestrian picture is divided into pictures of all parts of the human body through the joint points. The neural network is trained by pictures of each part of a human body, and the feature vectors of the parts are extracted. And further, fusing the feature vectors, and finally, measuring the distance of the feature vectors. The method and the device can improve the robustness of the unfixed sample of the pedestrian posture in the task of re-identifying the pedestrian, and have certain innovativeness.

Description

Pedestrian re-identification method based on human body posture estimation

Technical Field

The invention belongs to the field of computer vision and deep learning, and mainly relates to a pedestrian re-identification method based on attitude estimation.

Background

Pedestrian re-identification is a key task in a multi-camera monitoring system. The task requires that whether pedestrians shot by two different cameras are the same person or not is identified at different time and different places. Pedestrian re-identification is a key technology for realizing cross-camera pedestrian tracking and key person retrieval. Related research is increasing, but still there are many difficulties to be solved. Such as different resolution in the pictures taken by different cameras, illumination variations, irrelevant background noise. Meanwhile, the human body is not rigid, the actions of pedestrians are different, the shooting angles of the cameras are different, and the problem of posture change is also introduced. The above problems all affect the accuracy of the recognition to a different extent.

The existing pedestrian re-identification framework mainly comprises feature extraction and feature measurement. However, most methods ignore the influence of pedestrian posture change in the process of extracting image features, or perform implicit human body part alignment operation by training a neural network. Such an operation loses the spatial relationship of partial images of a part of the human body, so that the accuracy is reduced on matching some images with large differences in human body postures.

Disclosure of Invention

The invention provides a pedestrian re-identification method based on human body posture estimation. The method comprises the following steps:

step 1: the method comprises the steps of taking frames of pedestrian videos collected by a plurality of cameras, and detecting pedestrians in each frame of image;

step 2: acquiring the joint point position of a pedestrian in the picture through attitude estimation, and accordingly dividing the pedestrian picture into an upper body picture and a lower body picture;

and step 3: training 3 neural networks, and respectively extracting feature vectors of the original picture, the upper half body picture and the lower half body picture;

and 4, step 4: feature fusion and matching sorting.

For step 1, a plurality of cameras are distributed on different streets to shoot different scenes. And taking frames of the recorded video, wherein each frame of picture comprises pedestrians with different numbers. And obtaining a bounding box of each pedestrian by adopting a pedestrian detection method based on a convolutional neural network. And (4) carrying out screenshot on the bounding box of each pedestrian to obtain a pedestrian foreground picture.

And 2, performing joint point prediction on the pedestrian in the foreground picture by adopting a method based on a convolutional neural network. All joint points are two-dimensional joint points, which are labeled in the form of (x, y), with x representing the joint point abscissa and y representing the ordinate. By the ordinate y of the left and right hip joints₁、y₂Calculate the average value

To be provided with

As a boundary, the pedestrian picture is divided into an upper half and a lower half. The human body is a non-rigid object, and various postures can appear in a pedestrian picture shot by a street monitoring camera. Compared with the division of other image characteristic methods, the division of each part of the human body based on the joint points has better resistance to adverse effects caused by human body shape change. The overall deformation of the upper body and the lower body is relatively small; meanwhile, the part contains large picture areas and cannot be completely blocked.

And 3, respectively forming three independent training sets by the pedestrian whole-body picture, the upper-half body picture and the lower-half body picture obtained by posture estimation. And the same strategy is adopted to train three convolutional neural networks with the same structure and independent weights. And preprocessing the upper half body picture, the lower half body picture and the original picture. The preprocessing step is to zoom the picture to make the whole picture be square; and then randomly turning the picture left and right. And inputting the upper and lower half-body pictures after the preprocessing together with the original pictures into a neural network for two-stage training. The first stage is training with a Softmax loss function, and the formula is as follows:

wherein a is_jIs the j-th type of linear prediction score, y, of the network output_iIs the label of the input sample corresponding to class i, S_iRepresenting the probability that the current sample corresponds to category i, and T representing the total number of types L₁Is a numerical result of the loss function. When the neural network is trained by using the Softmax loss function to reach a certain precision, training is carried outAnd entering the second stage, and continuing by using the triple loss function. The Triplet loss function equation is as follows:

where f (x) represents the mapping from the input picture to the neural network feature vector.

Is a pedestrian picture randomly selected from the data set,

represents another picture with the same label,

a picture representing a different label.

The total number of all triples in the training set is n. α, which is the margin value, i.e., the boundary padding value.

One term obtains the feature vector f (x)^a)、f(x^p) The square of the two norms between them, namely the squared Euclidean distance, represents the intra-class distance of the feature vectors of the same type of samples. In the same way

One term characterizes the inter-class distance between the heterogeneous sample feature vectors. Only when f (x)^a) And f (x)^p) Squared Euclidean distance between, and f (x)^a) And f (x)ⁿ) When the squared Euclidean distance difference is greater than α, the loss function value is greater than zero, and the optimization of the network is guidedAnd aggregating the features of the same pedestrian picture and dispersing the features of different pedestrian pictures.

The method is characterized in that when a Triplet loss function is used for training a neural network, a training sample picture needs to be divided into triplets serving as input data, a large number of Triplet combinations can obtain zero values in the calculation of the Triplet loss function, the zero values do not contribute to network training, and the triplets are excluded, so that a difficult example mining method based on a small batch gradient Descent method is adopted in the network training process_a＝[x₁，x₂...x_n]Wherein x is_iRepresenting the ith input sample after sorting, a unit coefficient c is set, and L ist is paired_aRandomly scrambling the groups, scrambling the order between groups, and leaving the order within the groups unchanged to obtain L ist_bFinally set mini batch size, L ist_bA slicing operation is performed, whereby all the mini batch required for training is obtained.

Each training iteration only uses a sample in a mini batch, and a triple is constructed in each training by adopting the following method that all similar samples in the mini batch are combined, all samples which are different from the samples in the mini batch are found out for each similar sample and are combined together to obtain a triple, all the obtained triplets in the mini batch are input into a network, and the final triple loss function value L is calculated₂All triplets are as per L₂The sorting is done from large to small. The top ranked samples are called difficult cases. But such asIf the network is trained with all use difficulties, the network training will likely collapse. A hard-case coefficient λ between 0 and 1 is set to represent the proportion of hard-case triples used in one training iteration to the total number. And setting a coefficient n to indicate that the number of the triples entering the training in one iteration is n. Then the last triplet set participating in training is ranked by

A triplet difficulty, and

a randomly selected triplet.

And inputting a whole body picture, an upper body picture and a lower body picture with a pedestrian picture query into the corresponding network respectively by taking the trained network as a feature extractor. The output of the last fully connected layer of the network is extracted as a feature vector.

For step 4, connecting the three part of feature vectors obtained in step 3 end to form a combined feature vector, wherein the formula is as follows:

F′＝[F₁F₂F₃]

wherein F₁、F₂、F₃Respectively representing feature vectors extracted from the pictures of the whole body, the upper half and the lower half. The combined feature vector retains the whole information of the picture and the local information of the upper and lower bodies of the pedestrian, and has better robustness. Meanwhile, the integral features and the local features of the pedestrians are fixed on the specific vector dimension in a combined feature vector mode, namely, the integral features and the local features are subjected to explicit alignment operation, and the influence of the attitude change of the pedestrians on the accuracy rate of the heavy recognition can be reduced.

And finally, performing characteristic distance measurement. The existing candidate pedestrian picture set G to be inquired needs to find out the target pedestrian in the searched picture p. Extracting a joint feature vector F 'of the picture p'_pAnd a joint feature vector set G of all pictures in the picture set. And calculating Euclidean distances between p and all the joint feature vectors in G.

G＝{x₁，x₂，...，x_m}

G′＝{F′₁，F′₂，...，F′_m}

Wherein F'_iRepresenting the joint feature vector of the ith sample within the set G. d_iIs represented by F'_pAnd F'_iEuclidean distance of. d_iThe smaller the size, the more similar the pedestrian in the sample is to the target pedestrian. Sample x in G_iAccording to d_iThe values of (c) are ordered from small to large. And outputting the candidate pictures with the fixed number of the top ranks.

Drawings

The accompanying drawings are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. The drawings illustrate the following:

fig. 1 is a system architecture diagram.

Fig. 2 is a distribution diagram of the joint points.

FIG. 3 is a combined feature extraction graph.

Detailed Description

The embodiments of the present invention will be described in detail with reference to the accompanying drawings, by which how to apply the technical means to solve the problems and achieve the technical effects can be fully understood and implemented. The preferred examples are given solely for the purpose of illustration and are not intended to limit the scope of the invention.

The following describes the execution of the algorithm in detail. As shown in fig. 1, the pedestrian re-identification method based on human body posture estimation of the present invention comprises the following steps:

step 1: multi-camera pedestrian detection

Cameras are respectively installed on different streets. And taking frames of the recorded video, wherein each frame of picture comprises pedestrians with different numbers. And obtaining a foreground picture of each pedestrian by adopting a pedestrian detection method based on a convolutional neural network.

Step 2: pedestrian attitude estimation

And predicting the joint points by using a pedestrian posture estimation method based on a neural network. As shown in fig. 2, the pedestrian posture estimation method outputs the positions of the pixels of the respective joint points in the image. These joint points are nose, neck, left eye, right eye, left ear, right ear, left shoulder, right shoulder, left elbow, right elbow, left wrist, right wrist, left hip, right hip, left knee, right knee, left ankle, and right ankle, respectively. Each joint point has a dimension of 2, which is of the form (x, y). x is the abscissa of the pixel where the joint point is located, and y is the ordinate. By the ordinate value y of the left and right hip joints₁、y₂Calculate the average value

Finally, to

As a boundary, the pedestrian foreground picture is divided into an upper-body picture and a lower-body picture.

And step 3: training neural network to extract feature vector

To better train the neural network, the upper body picture, the lower body picture and the whole body picture obtained in step 2 are first preprocessed, the pictures are scaled so that the whole picture is square in size, preferably set to a picture of 256 ＊ 256 pixels, then they are randomly mirror flipped left and right, preferably the random ratio of the mirror flipping left and right of the pictures is set to 50%. the average of all picture pixels R, G, B three channels is calculated, set to be 50%. fig

And subtracting the mean value of the corresponding channel from all the picture pixels to realize mean value regularization.

Two stages of training are used in the neural network training process. The two stages differ in the loss function used to train the network. The first stage adopts the first stage to train by adopting a Softmax loss function, and the formula is as follows:

wherein a is_iIs the i-th type of linear prediction score, y, of the network output_iIs a label of the input sample, S_iIs the probability that the current sample corresponds to the ith type L₁Is the last calculated loss function value, in the first phase, when the network gradually converges, finally L₁And entering the second stage when the value is not reduced any more. Training is continued with the triple loss function instead. The Triplet loss function is formulated as follows:

Is a pedestrian picture randomly selected from the data set,

represents another picture with the same label,

a picture representing any one of the different labels.

All triplets in the training set are n. α, the total number of triplets is margin, i.e. boundary padding value, preferably α is set to 0.5.

During the second stage of training, a difficult case mining method based on a small batch gradient descent method is adopted, the input pictures are sorted according to the labels, and a sequence L ist is obtained_a＝[x₁，x₂...x_n]Wherein x is_iRepresenting the ith input sample after sorting, a unit coefficient c is set, preferably 4, every 4 samples form a sample subgroup, and L ist is paired with the subgroup as the minimum unit_aPerforming a scrambling operation to obtain L ist_bPreferably the size of the sample in the mini batch is set to 40. with 40 as the step size L ist will be used_bA slicing operation is performed, whereby all the mini batch required for training is obtained. After the mini batch is obtained, the sample is dug difficultly. A difficult-to-excavate scaling factor λ is set, preferably λ is set to 0.2 at the start of training and λ is set to 0.5 in the middle of training. And setting a coefficient n to indicate that the number of the triples entering the training in one iteration is n. Then the last triplet set participating in training is ranked by

A triplet difficulty, and

a randomly selected triplet.

As shown in fig. 3, the upper body picture, the lower body picture and the whole body picture are trained by the above steps to obtain a network 1, a network 2 and a network 3. And extracting the output of the last-but-one layer of the fully connected layer of the network, preferably setting the number of neurons of the layer to be 256, and setting the output of the layer to be a feature vector.

And 4, step 4: feature fusion and matching

And inputting the whole body picture, the upper half body picture and the lower half body picture of the picture to be inquired into the neural network. And splicing the obtained feature vectors end to form a 768-dimensional combined feature vector. And calculating the Euclidean distance between the combined feature vector of the target pedestrian picture and the combined feature vector of the pedestrian picture set to be retrieved. And sequencing the pedestrian picture sets to be retrieved from small to large according to the Euclidean distance. Preferably, the first 10 pictures of the pedestrian to be retrieved are output as a result.

Although the embodiments of the present invention have been shown and described, the above descriptions are only for the convenience of understanding the present invention and are not intended to limit the present invention. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A pedestrian re-identification method based on human body posture estimation is characterized by comprising the following main steps:

step one, taking frames of pedestrian videos collected by a plurality of cameras, detecting pedestrians in each frame of image, and obtaining a pedestrian foreground picture;

acquiring the joint point position of the pedestrian in the pedestrian picture by using a posture estimation method, and accordingly dividing the pedestrian picture into an upper body picture and a lower body picture;

step three, training a neural network, extracting features of the original picture, the upper half body picture and the lower half body picture through the neural network, and outputting feature vectors of all parts of the pedestrian;

and step four, fusing the obtained feature vectors of all the parts to obtain a combined feature vector. And (5) carrying out distance measurement through the joint feature vector, and outputting a final result.

2. The second step of the method of claim 1 is characterized in that joint prediction is performed on the pedestrian foreground picture by using a human body posture estimation method based on a convolutional neural network. And (3) marking the pixel position of the two-dimensional joint point obtained according to the prediction by (x, y), wherein x represents the abscissa of the joint point pixel, and y represents the ordinate of the joint point pixel. By the ordinate y of the left and right hip joints₁，y₂Calculate the average value

To be provided with

As a boundary, the pedestrian picture is divided into an upper body picture and a lower body picture.

3. Step three of the method of claim 1 wherein the neural network is trained in two stages using different loss functions. The first stage uses a Softmax loss function, which is formulated as follows:

a_iis the i-th type of linear prediction score, y, of the network output_iIs a label of the input sample, S_iIs the probability that the current sample corresponds to the ith type. When the network converges and the loss function value does not drop any more, the triple loss function is used, and the formula is as follows:

f (x) is the mapping of the input picture to the feature vector.

Is a randomly selected picture of a pedestrian,

is the same picture of another label,

a picture of a different label.

The total number of tuples in the training set is N, and α is the boundary padding value.

4. Step three of the method of claim 1 is characterized in that the method of claim 3 is adopted for the pedestrian's whole body picture, upper body picture and lower body picture, and the neural network is trained independently. And extracting the output of the last full-connection layer of the network to obtain the overall characteristic expression and the local characteristic expression of each sample.

5. A fourth step of the method of claim 1 is characterized in that the three part feature expressions obtained in the third step are vector-spliced to obtain a joint feature vector, and the global feature information and the local feature information of different samples are explicitly fixed in the same vector dimension.