CN112016661B

CN112016661B - Pedestrian re-identification method based on erasure significance region

Info

Publication number: CN112016661B
Application number: CN202010842675.6A
Authority: CN
Inventors: 沈栋; 蔡登�; 何晓飞
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2020-08-20
Filing date: 2020-08-20
Publication date: 2022-05-06
Anticipated expiration: 2040-08-20
Also published as: CN112016661A

Abstract

The invention discloses a pedestrian re-identification method based on an erasure significance region, which comprises the following steps: (1) randomly selecting P different people from the training data, and randomly selecting K pictures for each person to obtain P multiplied by K pictures as primary training data; (2) obtaining a salient region of the picture according to the similarity degree between the pictures; (3) erasing a certain proportion of significant areas according to preset erasing areas and probabilities; (4) extracting the features of the pictures with the significant areas erased, performing pooling operation by using an adaptive global pooling layer, and calculating errors by using the features; (5) calculating a feature vector of the original image, and calculating an error by using the feature vector; (6) returning a gradient training model by combining the errors obtained in the steps (4) and (5); (7) and carrying out pedestrian re-identification application by using the trained model. By using the method and the device, better and richer feature expression can be learned by a model, and the re-identification effect is improved.

Description

Pedestrian re-identification method based on erasure significance region

Technical Field

The invention belongs to the field of pedestrian re-identification, and particularly relates to a pedestrian re-identification method based on an erasure significance region.

Background

Pedestrian re-identification is a very important area of research, given a picture of a person, it requires the return of additional pictures of that person in a database. Due to the wide application of deep learning, there has been rapid development and progress in this field over the past few years. However, pedestrian re-recognition is still a very challenging task due to occlusion, changes in pedestrian pose, background effects, and camera view differences. The pedestrian re-identification has very important effects in the security field, for example, children scattered in an amusement park and parents can be inquired, and criminal suspects can be quickly positioned and tracked by using urban cameras. Therefore, how to improve the performance of re-recognition has attracted extensive attention from academic circles and governments of various countries.

As a pedestrian-level query task, it needs to learn as rich visual information as possible to improve its accuracy. However, as training is continuously performed, the model pays more and more attention to visual information with easy judgment of significance, so that the model neglects other visual features and reduces the accuracy of the visual information, and particularly, when pedestrian distinguishing of 2 pictures is not obvious, the effect is greatly reduced. For example, 2 pictures are taken, pedestrians wear green coats and are carried with brownish green bags, but the shoes of the pedestrians are different, and the model considers that the 2 pictures are the same person according to the information that the model only focuses on the salient region and ignores other characteristics, and the model gives wrong results. To address this phenomenon, some work has been proposed in academia and industry. Some methods use a multi-branch structure to learn visual information of different areas, and finally obtain richer feature expression by combining the learned features of different branches; some methods also use attention mechanisms to mine richer visual information; there are also some methods that use an erasure mechanism to randomly erase partial regions to force the model to learn more abundant features; for example, in 2019, "Batch DropBlock Network for Person Re-identification and Beyond" published in International Conference on Computer Vision, which is published in International Conference on Computer Vision, discloses an erasing method of BDB, which, however, randomly erases a part of a region to enhance the result of a model, not by erasing a salient region.

Disclosure of Invention

The invention provides a pedestrian re-identification method based on an erasure significance region, which enables a model to learn richer characteristics by erasing the significance region, thereby remarkably improving the re-identification effect.

A pedestrian re-identification method based on an erasure significance area comprises the following steps:

(1) selecting a training data set, randomly sampling from the training data set, and constructing a plurality of groups of training data; the construction mode of each group of training data is as follows: randomly selecting P different human pictures from the training data set, and randomly selecting K pictures for each human to obtain P multiplied by K pictures as a group of training data;

(2) for each group of training data, positioning to obtain a significant area of the picture according to the similarity of the picture;

(3) erasing a certain proportion of significant areas according to preset erasing areas and probabilities;

(4) extracting the features of the pictures with the significant areas erased, performing pooling operation by using a self-adaptive global pooling layer, relieving the problem of excessive erasure, and calculating errors by using the features;

(5) calculating a feature vector of the original image to prevent information loss, and calculating an error by using the feature vector;

(6) returning a gradient training network model by combining the errors obtained in the steps (4) and (5);

(7) after the model training is finished, inputting the pedestrian pictures to be inquired into the model, carrying out similarity sorting on the feature vectors of the pedestrian pictures to be inquired and the feature vectors of each picture in the picture library, and selecting the identity of the picture with the top sorting as a final recognition result.

The method of the invention enables the model to learn richer characteristics by erasing the salient region, thereby obviously improving the re-identification effect, and utilizes a self-adaptive global pooling mode to relieve the problem of excessive erasing, and can be more easily applied and deployed in the actual environment. Meanwhile, the method for positioning the salient region can effectively explain the query result of the model and perform visual display. And unlike other visualization methods at present, the methods herein have no special requirements on the structure of the model.

The specific process of the step (2) is as follows:

(2-1) for each of the P different person pictures, finding a picture which has the smallest cosine distance with the picture and belongs to the same person among the K pictures corresponding to the picture, and calculating the cosine distance between the pictures;

(2-2) calculating the weight of the query pair by using the cosine distance, and multiplying the weight by the cosine distance to be used as the contribution degree of the query pair;

(2-3) accumulating the contribution degrees of all P pictures to serve as the overall similarity of the group of training data, and then returning the gradient;

(2-4) multiplying the gradient and the characteristic value, and then scaling to the size of the input picture to obtain a final saliency map M; the value in M represents the degree to which the position contributes to the final matching result; the formula is as follows:

in the formula (I), the compound is shown in the specification,

in order to locate the feature values of a salient region,

for gradient, i, j represents pixel position, k represents k-th layer feature map, and ReLU is activation function; m_i,jRepresenting the degree of importance of the (i, j) position in the original image domain

In the step (2-1), the calculation formula of the cosine distance is as follows:

in the formula (f)_qAnd f_gRespectively 2 pictures, the molecule is the point multiplication of the vector, | f_qI and F_gAnd | | represents a two-normal form of the feature.

The specific process of the step (3) is as follows:

(3-1) first presetting an erasure probability p and an erasure number r;

(3-2) obtaining a value p from random sampling between 0 and 1₀；

(3-3) if p₀If the value is larger than p, the significance area is not erased, the original image is output, and the step (4) is directly carried out;

(3-4) if p₀Less than or equal to p, and sorting the significance map M obtained in the step (2) from high to low;

(3-5) finding coordinate values belonging to the first r regions with the largest value in the saliency map M;

(3-6) the pixel value corresponding to the original image is set to 0 by these coordinate values, and a map in which the salient region is erased is obtained.

The specific process of the step (4) is as follows:

(4-1) inputting the picture processed in the step (3) into the constructed network model;

(4-2) obtaining a characteristic map of the image by using a basic module in front of the network model;

(4-3) obtaining a characteristic vector by utilizing self-adaptive pooling operation P-posing; the calculation formula is as follows:

wherein A represents an input feature map, A^kThe k layer is represented, l is a parameter learned by the model, and learning is supervised under a training error; when l → ∞, P-pooling is global maximum pooling, and when l → 1, P-pooling is global mean pooling;

and (4-4) calculating training errors by using the feature vectors and subsequent modules of the network model, wherein the training errors comprise softmax and triplet loss.

Optionally, the network model uses resnet50 as the basic network, the basic modules are res _ conv4&5, i.e. the 4 th and 5 th layers of resnet50, and the subsequent module is Batch Normalization.

In order to alleviate the over-erasure problem and prevent the erasure of too many significant areas resulting in loss, step (4) of the present invention adopts an adaptive pooling operation that combines the advantages of a global average pooling operation and a global maximum pooling. The method has the advantages that the robustness of the global average pooling operation can be realized, the significant region can be efficiently extracted like the global maximum pooling, and the problem of over-erasure is relieved to a certain extent.

Meanwhile, step (5) of the invention can also be trained by using all original pictures, thereby further preventing information loss.

The specific process of the step (5) is as follows:

(5-1) inputting a complete original picture into a network;

(5-2) obtaining a feature map of the image by using a basic module in front of the network model;

(5-3) pooling the feature vectors and calculating training errors including softmax and triplet loss using subsequent modules of the network model.

In the specific retrieval, the invention combines the characteristics of the branch of the erasure significant area and the characteristics of the branch of the original image as characteristic expressions for matching.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention provides a brand-new method for positioning the salient region, which has no special requirements on the structure of the model, and the found salient region can effectively explain the query result.

2. Based on the proposed positioning significance region, the invention provides a brand-new training method, which forces the model to learn richer feature expressions by erasing the significance region; meanwhile, the problem of over-erasure is effectively relieved by utilizing a self-adaptive pooling operation.

Drawings

FIG. 1 is a schematic flow chart of a pedestrian re-identification method based on an erasure significant area according to the present invention;

FIG. 2 is a schematic diagram of the overall model architecture of the present invention;

FIG. 3 is a schematic diagram of an erase method according to the present invention;

FIG. 4 is a schematic diagram of qualitative analysis and visual display according to the present invention.

Detailed Description

The invention will be described in further detail below with reference to the drawings and examples, which are intended to facilitate the understanding of the invention without limiting it in any way.

As shown in fig. 1, a pedestrian re-identification method based on an erasure significant area includes the following steps:

and S01, sampling the constructed data.

When the salient region is located, a query picture and a database picture are needed, so that training data needs to be constructed and divided. The invention adopts a random identity sampling method. Firstly, P persons are randomly selected, and then each person selects K pictures to form primary training data.

And S02, positioning the salient region.

For each picture q, locating and finding the saliency region according to the database picture g returned by the picture q, and for the model at the present stage, which parts are the saliency regions, the model mainly matches the query picture q with the database picture g according to what information in the pictures, and considers that the two pictures are the same person when the two pictures are 2. Therefore, the invention needs to obtain a significance map M thereof. The value of M represents the degree of contribution of the position to the q and g matching results, and a larger value indicates that the pixel at the position is more prominent, and plays a more important role in the same person when the model considers 2 persons.

Definition of the invention f_qAnd f_gThe distance between 2 pictures is expressed by cosine distance for the feature expression corresponding to 2 pictures.

To obtain the saliency map M, first the eigenvalues of the localization saliency areas are calculated

Relative to S_q,gGradient of (2)

i, j denote pixel locations and k denotes the kth layer profile. Consider that

Express is given to

The degree of contribution to the final match.

According to the obtained weight

The weight summation can be carried out at the latitude of the number of the characteristic channel layers, and the finally obtained value is input into a ReLU activation functionAfter counting, the final saliency map M is obtained.

For each picture q, a picture g which has the smallest cosine distance with the picture q and belongs to the same person needs to be found, and the 2 pictures are considered to form a group of query picture and database return picture pairs.

However, in training, there is no division of training data, and random identity sampling is adopted in the previous period, P persons are randomly selected for each batch of training data, and then each person selects K pictures to form one training data.

So for each picture q, find the picture g that has the smallest distance to its cosine and belongs to the same person, consider these 2 pictures to form a set of query picture and database return picture pairs. Then M is obtained by utilizing the process calculation_i,j。

Because of 2 different pictures, they may find the same picture as their database picture g, and in order to make the picture pair with higher similarity play more important role, utilize

To model the weights, S_i,jHigher picture pairs play a more important role. The weights of the whole batch of training data are modeled, and then a primary gradient is returned to obtain a significance map of each batch of training data.

S03, erasing method.

As shown in fig. 3, the step of erasing the saliency areas is as follows:

firstly, presetting an erasure probability p and an erasure quantity r, and then randomly sampling from 0-1 to obtain a value p₀If p is₀If the value is larger than p, the saliency area is not erased, and the original image is output and the step 4 is directly performed.

If p is₀Less than or equal to p, sorting the significance map M obtained in the step 2 from high to low, and finding out the coordinates of the first r areas with the largest values in the significance map MAnd setting the pixel values of the coordinates corresponding to the original image as 0, so that the image with the significance area erased is obtained.

S04, obtaining a training loss of erasing the picture of the salient region.

As shown in fig. 2, resnet50 is used as a basic network, a picture of an erasure salient region is input into the network, and an adaptive pooling operation P-posing is utilized to obtain a feature vector. The specific calculation formula of P-pooling is shown in the following figure:

a represents the input feature map, A^kAnd (4) representing the k layer, wherein l is a parameter obtained by model learning, and learning is supervised under a training error. When l → ∞, P-pooling pools global maxima, and when l → 1, P-pooling pools global mean.

As shown in FIG. 2, training errors are next calculated using the feature vectors and subsequent modules, including softmax and triplet loss.

BESM is our proposed rub-out method, steps 2 and 3, beforees res _ conv4 representing layer1, layer2, layer3 inside the present 50 model. GAP stands for Gobal average pond, P-pond, the adaptive pooling operation taught in step four. Reference Vector represents the Vector used at the time of the last search. BN represents Batch Normalization.

S05, a training loss for erasing the picture of the salient region is obtained.

As shown in fig. 2, resnet50 is used as a basic network, the original complete picture is input into the network, the feature map of the image is obtained by using the former basic module, the feature vector is pooled, and the training error is calculated by using the subsequent module, wherein the training error comprises softmax and triplet loss.

And S06, updating the training model.

And (5) calculating a gradient according to the errors obtained in the steps (4) and (5), and updating the training model.

At the time of specific retrieval, the characteristics of the branches of the erasure saliency area and the characteristics of the branches of the original image are merged together to be used as characteristic expressions for matching.

To verify the effectiveness of the method of the present invention, comparative experiments of the present invention and existing methods are presented next.

In order to verify the effectiveness of the invention, experiments were carried out on 5 data sets of pedestrian re-identification and vehicle re-identification, and quantitative and qualitative analyses were carried out. In pedestrian re-identification, experiments were performed using Market-1501, MSMT17 and DukeMTMC-reiD.

The results of comparative experiments according to the invention are shown in table 1. In table 1, the first column is the base model and the middle column is the one-by-one addition of the individual details of the method. The method of the invention is obviously improved compared with a basic model (Base), and the effectiveness is proved by performing comparison tests on all details.

TABLE 1

This experiment was compared to the published method which is currently the best and the results are shown in table 2. Base is the basic model, and the last column is the method. Overall, the method of the present invention has higher accuracy than other methods, and achieves the best current results in most indicators of the three data sets.

TABLE 2

Besides, this embodiment also specifically shows several examples to prove that before and after adding the method of the present invention, the model can be helped to learn richer feature information, so as to obtain better results, as shown in fig. 4.

In FIG. 4, "vis" is a visual representation of the salient region in the query image, "query" represents the query picture, and "rank-1" represents our return. BESM represents the method of the present invention to erase saliency areas. (a) And (b) shows that the method of the present invention helps the model find the correct result by finding a more comprehensive function. In (c) and (d), the method and base of the present invention give the same correct query results, but our method finds more different salient regions, which further demonstrates the effectiveness of the present invention.

The embodiments described above are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only specific embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions and equivalents made within the scope of the principles of the present invention should be included in the scope of the present invention.

Claims

1. A pedestrian re-identification method based on an erasure significant area is characterized by comprising the following steps:

(3) erasing a certain proportion of significant areas according to preset erasing areas and probabilities; the specific process is as follows:

(3-1) first presetting an erasure probability p and an erasure number r;

(3-2) obtaining a value p from random sampling between 0 and 1₀；

(3-6) setting the coordinate values to be 0 in the pixel value corresponding to the original image to obtain a graph with the significance region erased;

(4) extracting the features of the pictures with the significant areas erased, performing pooling operation by using a self-adaptive global pooling layer, relieving the problem of excessive erasure, and calculating errors by using the features; the specific process is as follows:

(4-2) obtaining a feature map of the image by using a basic module in front of the network model;

(4-4) calculating training errors including softmax and triplet loss using the feature vectors and subsequent modules of the network model;

2. The pedestrian re-identification method based on the erasure significant area according to claim 1, wherein the specific process of the step (2) is as follows:

in the formula (I), the compound is shown in the specification,

in order to locate the feature values of a salient region,

for gradient, i, j represents pixel position, k represents k-th layer feature map, and ReLU is activation function; m_i,jIndicating the importance of the (i, j) location in the original image domain.

3. The pedestrian re-identification method based on the erasure significance area according to claim 2, wherein in the step (2-1), the cosine distance is calculated by the formula:

in the formula (f)_qAnd f_gThe corresponding characteristic expressions are 2 pictures respectively, the molecules are dot product of vectors, | f_qII and f_gAnd |' represents a two-paradigm of the features.

4. The pedestrian re-identification method based on erasure-significant area according to claim 1, wherein the network model adopts resnet50 as a basic network, the basic module is res _ conv4&5, which represents the 4 th and 5 th layers of resnet50, and the subsequent module is Batch Normalization.

5. The pedestrian re-identification method based on the erasure significant area according to claim 1, wherein the specific process of the step (5) is as follows:

(5-1) inputting a complete original picture into a network;