CN113378729B

CN113378729B - Multi-scale convolution feature fusion pedestrian re-identification method based on pose embedding

Info

Publication number: CN113378729B
Application number: CN202110667913.9A
Authority: CN
Inventors: 廖开阳; 雷浩; 郑元林; 章明珠; 范冰; 黄港
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2021-06-16
Filing date: 2021-06-16
Publication date: 2024-05-10
Anticipated expiration: 2041-06-16
Also published as: CN113378729A

Abstract

The invention discloses a multi-scale convolution feature fusion pedestrian re-identification method based on pose embedding, which comprises the following steps: preprocessing an original pedestrian image by adopting a random erasing mode to obtain a pedestrian image, and performing baseline network optimization on a Resnet-50 network model to extract depth convolution characteristics; extracting a significant human body image from an original pedestrian image; firstly, carrying out gesture extraction on a human body saliency image, and then extracting local semantic features from a body part image; carrying out weighted fusion on the depth convolution feature and the local semantic feature, and carrying out distance measurement on the weighted fusion feature to generate an initial measurement list; and reordering the images in the initial measurement list according to a reordering algorithm to obtain correct matching ranking of the images, and outputting pedestrian matching images to identify specific pedestrians. The accuracy of recognition and positioning can be greatly improved.

Description

Multi-scale convolution feature fusion pedestrian re-identification method based on pose embedding

Technical Field

The invention belongs to the technical field of image processing methods, and relates to a multi-scale convolution feature fusion pedestrian re-identification method based on pose embedding.

Background

In recent years, artificial intelligence is an important progress point of current technological development, and the artificial intelligence is a new technology for chelating heads in various technologies used by us. Its use in the intelligent monitoring domain has also become extremely important. With the expansion of cities, monitoring systems have become more popular, with thousands of cameras being distributed throughout street locations. The use of cameras is increasing, and relying solely on manual monitoring is extremely expensive and cannot monitor so many pictures at the same time. Therefore, the pedestrian re-recognition technology attracts attention from researchers. It can help people perform monitoring, tracking, and identifying pedestrians. Since humans mainly receive and perceive various information from the outside through visual technology, the visual technology possessed by humans can directly obtain desired information from troublesome images. Researchers also want to be able to follow a person's visual system to allow a camera to effectively and quickly capture objects in the environment. This technique is ultimately derived as our current pedestrian re-recognition technique. The technology of pedestrian re-recognition is widely used, and for example, an intelligent monitoring system needs to use the technology of pedestrian re-recognition. The technology processes data by means of strong capability of a computer, for example, a video monitoring system can automatically filter out some useless information and actively identify a human body, so that the comprehensive monitoring is effectively carried out, and a 24-hour monitoring system for early warning and evidence collection after early warning is realized. Also using this technique is pedestrian traffic statistics. It also takes advantage of the powerful capabilities of the computer to process the data, automatically filter out some unwanted information, and automatically identify pedestrians and count them. Meanwhile, pedestrians which appear multiple times in different areas cannot be counted repeatedly, so that pedestrian traffic can be counted effectively and accurately.

One key factor affecting the accuracy of pedestrian re-recognition is the dislocation of pedestrians, and the continuous variation of mutual shielding and posture among parts of the body of the pedestrians caused by the dislocation is a great challenge for the study of pedestrian re-recognition. First, the posture of a pedestrian is constantly changing during movement, and the pedestrian inevitably changes various postures, which means that local changes of the body are unpredictable in the bounding box. For example, pedestrians may have their hands behind or overhead during movement, creating localized occlusion from misalignment, which has a significant impact on the extracted features. Secondly, detection when pedestrians are arranged irregularly has an influence on the accuracy of the pedestrian re-recognition study. One method commonly used in the pedestrian re-identification field is to divide the bounding box into horizontal stripes, however this method can only be built with slight vertical deviations. When the vertical bias mistake is on time, the detection of the body and head may match the background, resulting in false recognition of the pedestrian re-recognition task. The horizontal streaking method is not ideal in the case of severe misalignment. In the case of a pedestrian changing the posture, the background is changed continuously, so that the background may be weighted by the convolutional neural network erroneously to affect the recognition accuracy. Therefore, how to solve the influence of dislocation and background variation caused by the variation of the posture of the pedestrian is a key for improving the accuracy of the re-recognition of the pedestrian.

Disclosure of Invention

The invention aims to provide a multi-scale convolution feature fusion pedestrian re-recognition method based on pose embedding, which solves the problem of low pedestrian re-recognition precision caused by dislocation and background change due to pedestrian pose change in the prior art.

The technical scheme adopted by the invention is that the multi-scale convolution feature fusion pedestrian re-identification method based on pose embedding comprises the following steps:

step 1, preprocessing an original pedestrian image by adopting a random erasing mode to obtain a pedestrian image, performing baseline network optimization on a Resnet-50 network model, and inputting the pedestrian image into the optimized Resnet-50 network model to obtain a depth convolution characteristic;

step 2, taking an original pedestrian image as an input image to perform feature extraction to obtain a significant human body image;

step3, firstly carrying out gesture extraction on the remarkable human body image by adopting a gesture convolver to obtain a body part image, and then inputting the body part image into ResNet-50 networks to extract local semantic features;

Step 4, carrying out weighted fusion on the depth convolution feature and the local semantic feature to obtain weighted fusion features, measuring the distances between the images in the image test library and the image query library and the fusion features respectively, and generating an initial measurement list for the measured results;

and 5, reordering the images in the initial measurement list according to a reordering algorithm to obtain correct matching ranking of the images, and outputting pedestrian matching images to identify specific pedestrians.

The invention is also characterized in that:

the specific mode for carrying out the optimization of the base line network on the Resnet-50 network model is as follows:

and optimizing the loss function of the Resnet-50 network model by combining the Softmax loss and the Triplet loss, wherein the optimized loss function is as follows:

In the above formula, m is the number of loss functions;

In the above-mentioned method, the step of, Is the eigenvector of the anchor sample,/>Is a feature vector of positive samples,/>Feature vector being negative sample, a is/>Distance between and/>The minimum distance between the two is, + represents that the value in [ ] is greater than zero, the value is the loss value, and when the value is less than zero, the loss is zero.

The step2 specifically comprises the following steps:

Step 2.1, taking the last pooling stage of the VGG-16 network structure as the network structure after being removed, taking the original pedestrian image as an input image, inputting the input image into the network structure, and then outputting feature mapping;

Step 2.2, deconvoluting the feature map to the size of the input image, and adding a new convolution layer to generate a prediction significance map;

And 2.3, firstly applying a convolution layer with a kernel size of 1 multiplied by 1 in a network structure to the conv1-2 layer to generate boundary prediction, then adding the boundary prediction to the prediction saliency map to obtain a refined boundary frame, and then applying a convolution layer to convolve the refined boundary frame to obtain a saliency human body image.

The step 3 specifically comprises the following steps:

Step3.1, using the significant human body image as input of a gesture estimator to position 14 joint points;

step 3.2, positioning 14 human joints into 6 sub-areas, cutting, rotating and adjusting the sizes of the 6 sub-areas to fixed sizes and directions, and combining to form a spliced body part image;

Step 3.3, carrying out pose transformation on the size of each body part in the spliced body part image to obtain a body part image;

and 3.4, inputting the body part image into ResNet-50 networks for training, and extracting local semantic features.

The specific process of the step 5 is as follows:

encoding k-reciprocal nearest neighbors into single vectors through weighting to form k-reciprocal features, calculating Jacobian distances between the pedestrian test image p and the image set by using the k-reciprocal features, and finally weighting original distances between the pedestrian test image p and the image set and Jacobian distances to obtain a distance formula; and calculating the distance between the image and the fusion feature in the initial measurement list according to the distance formula, reordering to obtain the correct matching ranking of the image, and outputting the pedestrian matching image to identify the specific pedestrian.

The beneficial effects of the invention are as follows:

The invention discloses a multi-scale convolution feature fusion pedestrian re-identification method based on pose embedding, which is characterized in that depth global features and local semantic features are fused, distance measurement between different images is carried out through the fused weighted features, images of the same pedestrian are identified and retrieved, and the multi-scale convolution feature fusion pedestrian re-identification method based on pose embedding is used for identifying and retrieving the images of the pedestrian in an original image database to obtain images of specific pedestrians, so that the multi-scale convolution feature fusion pedestrian re-identification system based on pose embedding is better applicable to a multi-scale convolution feature fusion pedestrian re-identification system based on pose embedding; the performance of the base line network is improved by a method of random erasure and triplet loss function, and the characteristic weighting aggregation is carried out on the local characteristics obtained by extracting the gesture estimation and the global characteristics obtained by the base line network, so that the purpose of global optimization is realized, the target identification and positioning are facilitated, the operation speed of an algorithm is accelerated, and the stability of the system is improved; the method can greatly improve the accuracy of recognition and positioning, and can not only perform target recognition and retrieval on pedestrian images, but also be used in other fields.

Drawings

FIG. 1 is a flow chart of a multi-scale convolution feature fusion pedestrian re-identification method based on pose embedding;

FIG. 2 is a graph of random erasure processing effect of a multi-scale convolution feature fusion pedestrian re-identification method based on pose embedding;

FIG. 3 is a schematic diagram of triplet loss for a multi-scale convolution feature fusion pedestrian re-recognition method based on pose embedding;

fig. 4 is a pose embedding effect diagram of the multi-scale convolution feature fusion pedestrian re-recognition method based on pose embedding.

Detailed Description

The invention will be described in detail below with reference to the drawings and the detailed description.

A multi-scale convolution feature fusion pedestrian re-identification method based on pose embedding is shown in fig. 1, and comprises the following steps:

Step 1, an image database is established, wherein the image database is a pedestrian image which is manually collected and corrected by a computer, and 72000 images are taken in total. Preprocessing an original pedestrian image by adopting a random erasing mode to obtain a pedestrian image, performing baseline network optimization on a Resnet-50 network model, and inputting the pedestrian image into the optimized Resnet-50 network model to obtain a depth convolution characteristic;

Step 1.1, randomly erasing an original pedestrian image by adopting a random erasing enhancement processing method to obtain a pedestrian image;

specifically, the random erase enhancement process (Random Erasing Augmentation, REA) is an effective method of data enhancement. The method aims at shielding different training images, randomly generating a rectangular area in the images, randomly generating the position and the size of the rectangular area, shielding part of pedestrian images, and setting the pixel value of the image shielding area as a random value. By the method, the occurrence of the condition of over fitting can be reduced, and the convergence capacity of the network model is improved, so that the performance of the deep learning model is improved.

In the training of the network model, for the original training data set, assuming that the probability of random erasure of the original data set is P, the probability of non-erasure is 1-P. In the random erasing process, a rectangular area is generated with the set probability P to shade the image, and the random erasing and shading positions and the shading area are random in the process.

Assume that the size of an image that needs to be randomly erased, i.e., an original pedestrian image, is:

S＝W×H (1)；

in the above formula, W is the width of the pedestrian image, and H is the height value of the pedestrian image;

It is assumed that the area size of the rectangular area in which random erasure is performed is S _e, and the area size is within a range specified by the minimum value S _l and the maximum value S _h. The aspect ratio of the random erase region is r _e, then the width H _e and height W _e of the random erase rectangular region are:

In the above formula, S _e is the area value of the erased rectangular frame, r _e is the aspect ratio of the erased rectangular frame, H _e is the height of the erased rectangular frame, and W _e is the width of the erased rectangular frame.

Randomly selecting a point P= (x _e,y_e) on the original pedestrian image, if the following formulas (4) and (5) are satisfied:

x_e+W_e≤W (4)；

y_e+H_e≤H (5)；

Then the rectangular area to be erased of the original pedestrian image is (x _e,y_e,x_e+W_e,y_e+H_e), and the random value in each pixel in the rectangular area is allocated to replace the original rectangular area by using the randomly erased selected area to be erased. If the randomly selected point p= (x _e,y_e) does not meet the conditions of equations (4) and (5), the above procedure is repeated, and a new point p= (x _e,y_e) is re-selected in the image until the appropriate random point is selected. Finally, the original pedestrian image (i.e., pedestrian image) after random erasure is outputted as shown in fig. 2.

Step 1.2, optimizing a loss function of the Resnet-50 network model by combining Softmax loss and Triplet loss;

In particular, in the field of pedestrian re-identification, triple loss (triple loss) is also widely used, and more so, is applied in combination with Softmax loss in a network model. As shown in fig. 3, when using the triplet loss function, three pictures are taken as inputs to the network: Wherein/> For Anchor point samples (Anchor), randomly selecting samples in the data set for training a network model,/>Training samples representing the identity of pedestrians belonging to the same class as the anchor sample, i.e. positive samples,/>The training samples representing identities of pedestrians of different classes than the anchor sample, namely negative samples. These training samples are input to a similar network structure for feature extraction, as shown in fig. 3, and after learning by the Triplet loss, the distance between the original sample and the positive sample is minimized, and the distance between the original sample and the negative sample is maximized. The final formula for calculating the Triplet loss is:

In the above-mentioned method, the step of, Is the eigenvector of the anchor sample,/>Is a feature vector of positive samples,/>Feature vector being negative sample, a is/>Distance between and/>The minimum interval between the distances between, + represents that the value in [ ] is greater than zero, the value is a loss value, and when the value is less than zero, the loss is zero;

from the objective function it can be seen that: when (when) And/>The distance between them is less than/>And/>When the distance between the two values is added with a, [ ] being larger than zero, there will be a loss value when/>And/>The distance between them is greater than or equal to/>And/>When a is added to the distance between the two, the loss value is zero.

Through the triple loss function, the network model can shorten the distance between the pedestrian images of the same tag, and lengthen the distance between the pedestrian images of different tags, so that the trained network model has more discriminant.

In the above formula, m is the number of loss functions;

And 1.3, inputting the pedestrian image into the optimized Resnet-50 network model to obtain the depth convolution characteristic.

Step 2, taking an original pedestrian image as an input image to perform feature extraction, and separating a foreground from a background to obtain a significant human body image;

Specifically, because the VGG-16 model has ideal effects in terms of image classification and generalized special effects, the saliency model also uses VGG-16 to construct a network structure. Given an input image of size W H, the size of the output map is [ W/2 ⁵,H/2⁵ ], so the network architecture built on VGG-16 reduces the output by a factor of 32 by the feature map. In this embodiment, the last pooling stage of VGG-16 is eliminated, so that the size of the input image can be enlarged, and the semantic context and the image detail can be balanced. Therefore, the feature map of the network structure output of the present invention is reduced by a factor of 16 compared to the input image.

Step 2.2, the integrated feature map already contains various saliency cues, so they can be used to predict the saliency map. Specifically, deconvoluting the feature map to the size of the input image, and adding a new convolution layer to generate a prediction significance map;

And 2.3, adding boundary refinement by introducing short connection into the prediction result, and further separating the foreground and the background by boundary refinement, wherein the bottom features are expected to be helpful for predicting the boundary of the object. Furthermore, these features also have the same spatial resolution for the input image. Specifically, a convolution layer with a kernel size of 1×1 in a network structure is applied to conv1-2 layers to generate boundary prediction, the boundary prediction is added to a prediction saliency map to obtain a refined boundary frame, and then a convolution layer is applied to convolve the refined boundary frame to obtain a saliency human body image.

And 3, firstly carrying out gesture extraction on the remarkable human body image by adopting a gesture convolver to obtain a body part image, and then inputting the body part image into a ResNet-50 network to extract local semantic features. Specifically, the pose extraction is performed using an off-the-shelf model of a pose convolver, which is a sequential convolution structure that can detect 14 body joints, namely, head, neck, left and right shoulders, left and right elbows, left and right wrists, left and right hips, left and right knees, and left and right ankles, as shown in fig. 4.

Step 3.1, using the significant human body image as input of a gesture estimator to position 14 joint points, wherein the 14 joints are head, neck, right shoulder, right elbow, right wrist, left shoulder, left elbow, left wrist, left hip, left knee, left ankle, right hip, right knee and right ankle;

Step 3.2, positioning 14 human joints into 6 sub-areas (head, upper body, left arm, right arm, left leg and right leg) as human body parts, cutting, rotating and adjusting the sizes of the 6 sub-areas to fixed sizes and directions, and combining to form spliced body part images; due to the difference in the sizes of 6 parts of the human body, a black area inevitably appears in the human body image;

Since black areas appear in the stitched body part images, pose transformation is required for the size of each body part, which is mainly determined from observation, to remove the black areas. For example, this embodiment observes that the width of the arm is about 20 pixels and the width of the leg is about 30 pixels, and decreasing these parameters results in lost information, and increasing these parameters may result in more background noise. But system performance remains stable as long as the parameter variation is small. The reason for this is that when the part size varies within a small range, the authentication information contained therein does not vary much, and thus the network can still learn authentication embedding given the monitoring signal.

And 3.4, dividing the body part image into a test set and a training set, inputting the test set and the training set into a ResNet-50 network for training, and extracting local semantic features. The ResNet-50 network in the step and the ResNet-50 network optimized in the step 1 do not contribute weights, but train a new weight alone to judge the local semantic image and extract the local semantic features.

Step 4, carrying out weighted fusion on the depth convolution feature and the local semantic feature to obtain weighted fusion features, measuring the distances between the images in the image test library and the image query library and the fusion features respectively, generating an initial measurement list ranking on the measured results, and returning the query scores; the feature weighted aggregation is adopted as follows:

d＝αf_DEEP+(1-α)f_SOD (8)；

In the above formula, the parameter 0.ltoreq.α.ltoreq.1 represents different weights between the depth global feature and the local semantic feature.

Specifically, a pedestrian test image p and an image set G= { G _i |i=1, 2,.. N }, a k-reciprocal nearest neighbor is encoded into a single vector through weighting to form a k-reciprocal feature, then the Jacobian distance between the pedestrian test image p and the image set is calculated by using the image k-reciprocal feature, and finally the original distance between the pedestrian test image p and the image set and the Jacobian distance are weighted to obtain the distance; and calculating the distance between the images in the initial measurement list and the fusion features, sequencing the images to obtain correct matching ranking of the images, and outputting pedestrian matching images to identify specific pedestrians.

Step 5.1, firstly, a pedestrian image p is given for testing, and an image set g= { G _i |i=1, 2, & gt, N } is given for pedestrian image reference, the original distance between the pedestrian image p and the reference data set gi is measured through the mahalanobis distance, and the measurement result is shown in a formula

d(p,g_i)＝(x_p-x_gi)^TM(x_p-x_gi) (9)；

In the above formula, x _p is the appearance characteristic of the test image p,As the appearance characteristic of the reference image g _i, M is a semi-positive definite matrix;

the ordered list is initialized according to the original distance between the test image P and the reference image g _i, and then the ordered list is obtained:

Step 5.2, the purpose of the reordering strategy is to reorder the L (p, G) initial list ranking so that more correctly matched image samples are ranked at the first position of the list, thereby improving the recognition accuracy of pedestrian re-recognition.

The top k ranked samples in the initial ranked list are defined, namely k neighbors (k-nearest neighbors, k-nn):

The k-nearest neighbor (k-reciprocal nearest neighbors, k-rnn) is expressed as:

R(p，k)＝g_i|(g_i∈N(p,k))∧p∈N(g_i，k) (12)；

however, due to a series of influencing factors such as brightness variation, attitude variation, viewing angle variation, and occlusion, a correctly matched sample may be excluded from nearest neighbors. To solve this problem, each candidate nearest neighbor set is converted to a more robust set:

for each test image sample in the original set R (p, k), find their k-nearest neighbor set When the number of the coincident samples reaches a certain condition, obtaining a union set of the coincident samples and R (p, k), and adding more positive samples into the R (p, k) set after expansion;

Step 5.3, reassigning weights according to the original distance between the search image and the nearest neighbor, encoding the k-nearest neighbor set of the sample image into an N-dimensional vector by a Gaussian kernel, which is defined as Expressed as:

Based on the neighbors being assigned a greater weight and the neighbors being assigned a lesser weight, the number of candidates for the intersection and union needed to calculate the jacobian distance can be calculated as:

The intersection uses the smallest value in the corresponding dimension of the two feature vectors as the degree of g _i contained together by the two feature vectors through the minimum operation, and the maximum operation of the union is to count the total set of matching candidates in the two sets;

Step 5.4, the final jacobian distance is expressed as:

The initial ordered list is modified by combining the original distance and the Jacobian distance, and the final distance is defined as:

d^*(p,g_i)＝(1-λ)d_J(p,g_i)+λd(p,g_i) (18)；

in the above formula, λ is a weighting parameter, λ represents a weight of two distances, only the jacobian distance is considered when λ=0, only the original distance is considered when λ=1, and λ=0.3 is set herein;

and 5.5, calculating the distance between the image and the fusion feature in the initial measurement list by using the formula (18), and sequencing to obtain correct matching ranking of the images, outputting pedestrian matching images to identify specific pedestrians, and completing identification.

Through the mode, the multi-scale convolution feature fusion pedestrian re-identification method based on pose embedding is mainly used for searching and inquiring corresponding pedestrian pictures from a large number of pedestrian image databases, and the pictures of the same pedestrians in the image databases can be found through one image. Extracting local features of pedestrians by a method of estimating key points of human bodies under the influence of a complex background by separating a foreground and the background and filtering the influence of the complex background, and preprocessing images of a base line network by a random erasing method to strengthen the robustness of a network model, so that global features with higher robustness are extracted; finally, the similarity measurement between the features is improved by carrying out depth weighted fusion on the features with different scales and a reordering method.

Claims

1. The multi-scale convolution feature fusion pedestrian re-identification method based on pose embedding is characterized by comprising the following steps of:

Step 3, firstly carrying out gesture extraction on the remarkable human body image by adopting a gesture convolver to obtain a body part image, and then inputting the body part image into ResNet-50 networks to extract local semantic features;

Step 4, carrying out weighted fusion on the depth convolution feature and the local semantic feature to obtain weighted fusion features, measuring the distances between the images in the image test library and the image query library and the fusion features respectively, and generating an initial measurement list for the measured distances;

step 5, reordering the images in the initial measurement list according to a reordering algorithm to obtain correct matching ranking of the images, and outputting pedestrian matching images to identify specific pedestrians;

And optimizing the loss function of the Resnet-50 network model by combining Softmax loss and Triplet loss, wherein the optimized loss function is as follows:

In the above formula, m is the number of loss functions;

The step2 specifically comprises the following steps:

Step 2.1, taking the last pooling stage of the VGG-16 network structure as the network structure after being removed, and taking an original pedestrian image as an input image to be input into the network structure and then outputting feature mapping;

Step 2.3, firstly applying a convolution layer with a kernel size of 1 multiplied by 1 in the network structure to a conv1-2 layer to generate boundary prediction, then adding the boundary prediction to a prediction saliency map to obtain a refined boundary frame, and then applying a convolution layer to convolve the refined boundary frame to obtain a saliency human body image;

The step 3 specifically comprises the following steps:

Step 3.4, inputting the body part image into ResNet-50 networks for training, and extracting local semantic features;