CN115909488A

CN115909488A - Method for re-identifying shielded pedestrian through attitude guidance and dynamic feature extraction

Info

Publication number: CN115909488A
Application number: CN202211406567.XA
Authority: CN
Inventors: 林菲; 陈绮萌
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2022-11-10
Filing date: 2022-11-10
Publication date: 2023-04-04

Abstract

The invention discloses a method for re-identifying shielded pedestrians by posture guidance and dynamic feature extraction; the main difficulty of the re-identification of the blocked pedestrians is that due to the blocking phenomenon, the bodies of the pedestrians are often incomplete, and the features of the pedestrians obtained through the images contain background noise. The invention effectively utilizes the human body posture estimation network to extract the key points of the human body, and automatically extracts the part which is not shielded by the human body from the image block output by the encoder in a matching mode, optimizes the alignment operation of the part which is not shielded, and reduces the interference of a shielding object to the image. In addition, the method effectively utilizes the characteristics of an attention mechanism and convolution operation, generates a dynamic unoccluded mask for the image by means of the attitude information, realizes dynamic alignment of unoccluded parts, and keeps intra-class compactness and inter-class dispersion through angkload barrier loss. The invention combines the measurement loss and the classification loss to carry out end-to-end network training, thereby fully improving the training efficiency and the training stability.

Description

Method for re-identifying shielded pedestrian through attitude guidance and dynamic feature extraction

Technical Field

The invention belongs to the technical field of computer vision and deep learning, and particularly relates to a pedestrian re-identification method under an occlusion scene through posture guidance and dynamic feature extraction.

Background

Pedestrian re-identification is a subtask of an image retrieval task in the field of computer vision, and given a pedestrian image, the pedestrian re-identification method can find the same pedestrian image at other time and other places. With the popularization of monitoring cameras, pedestrian re-identification has important significance in realizing intelligent security and city brain. The existing pedestrian re-identification mainly comprises two steps: 1) Extracting image features for characterizing pedestrians; 2) Image features such as common euclidean distance, cosine distance, etc. are measured. With the development of deep learning, researchers propose a pedestrian re-identification method based on deep learning, which can perform end-to-end training and is used for learning image features with robustness and discriminability on the appearance of pedestrians. However, the existing method has poor performance on re-identification of pedestrians in a shielding state.

The main difficulty of the method for re-identifying the blocked pedestrians is that due to the blocking phenomenon, the bodies of the pedestrians are often incomplete, and the features of the pedestrians obtained through the images contain background noise. Such pedestrian features have certain difficulties in comparison with other pedestrian features, resulting in unsatisfactory results.

Disclosure of Invention

The invention aims to provide a method for re-identifying a blocked pedestrian by posture guidance and dynamic feature extraction.

In a first aspect, the invention provides a pedestrian re-identification method under an occlusion scene through attitude guidance and dynamic feature extraction, which comprises the following steps:

step 1, preprocessing the identified image to obtain a two-dimensional matrix.

Step 2, inputting the two-dimensional matrix obtained in the step 1 into an encoder; global and local features of the identified image are extracted.

And 3, extracting the key point information of the human body posture through a human body posture estimation network to obtain the position and confidence coefficient containing the key point of the human body and a heat map generated by the position.

And 4, obtaining the local features of the posture guidance based on the local features and the human posture key point information.

4-1. The last two dimensions of the heatmap obtained in step 3 are combined and passed through a fully connected layer.

And 4-2, multiplying the heat map and the local features by elements to obtain a group of posture guiding features.

4-3, calculating the cosine distance between each posture guide feature and each group of local features, and matching the group of local features with the closest distance for each heat map.

And 4-4, adding the corresponding local feature to each attitude guide feature to obtain the attitude guide local feature.

And 5, multiplying the heat map obtained in the step 3 by the global features extracted in the step 2 after global average pooling is carried out on the heat map, so as to obtain the image features embedded with the attitude information.

And 6, extracting an image dynamic feature mask of the identified image through a dynamic feature generator.

And 7, removing the part of the attitude-guided local features obtained in the step 4, wherein the confidence coefficient of the part is lower than the threshold value, splicing the part of the attitude-guided local features with the image features embedded with the attitude information obtained in the step 5 and the image dynamic feature mask obtained in the step 6 to obtain final image features, respectively calculating the distances between the features and all the image features in the image library, sequencing the distances according to the ascending order, and taking the identity information of the image features corresponding to the minimum distance as the identity of the pedestrian in the identified image.

Preferably, the process of step 1 is specifically as follows:

and 1-1, preprocessing the identified image and adjusting the image to be identified into a standard identified image.

And 1-2, performing dimension conversion on the standard identified image through the convolutional layer, and merging the first two dimensions of the obtained features to obtain a two-dimensional matrix.

And 1-3, adding image block position marks and camera position marks with the same length to the two-dimensional matrix obtained in the step 1-2, and then splicing a learnable classification mark.

Preferably, in step 2, the encoder adopts a Vision Transformer network, which includes 12 self-attention layers; before inputting the features into the last self-attention layer, dividing the features except the classification mark into a plurality of groups, inputting each group of features into the weight-shared self-attention layer, obtaining a plurality of groups of local features after passing through a batch normalization layer, and taking the classification mark as a global feature.

Preferably, in step 3, the human body posture estimation network performs convolution operation on the identified image, and encodes the position information of the human body key points into a plurality of groups of heat maps to represent a plurality of human body posture key point information of the human body.

Preferably, the human body posture estimation network adopts an openpos network.

Preferably, the specific process of step 6 is as follows:

and 6-1, splicing the output characteristics of the 1 st, 3 rd, 5 th, 7 th, 9 th and 11 th layers of the encoder except the classification marks.

And 6-2, inputting the characteristic diagram obtained in the step 6-1 into a dynamic characteristic generator consisting of 6 groups of convolution layers, wherein each group of convolution layers reduces the number of channels by one time, then performing maximum pooling, and finally obtaining a dynamic human body characteristic mask through a full connection layer.

Preferably, the expression of the total loss L of the recognition method in the model training process is as follows:

L＝λ ₁ L _cls +λ ₂ L _tri +L _A-cls

wherein L is _cls For cross entropy loss, the expression is as follows:

b is the number of pictures of one batch in training; n is the total number of pedestrians at training; p _j Is the probability that the network predicts the identified image as class j, and y is the actual class of the identified image.

L _tri For triple loss, the expression is as follows:

L _tri ＝max(D(f _a ，f _p )-D(f _a ，f _n )+m，0)

f _a ，f _p ，f _n respectively an anchor point image, a positive sample image with the same category as the anchor point image and a negative sample image with the different category from the anchor point image; d (-) is a Euclidean distance calculation function; m is the barrier coefficient.

L _A-cls For angola barrier loss, the expression is as follows:

n is the total number of pedestrian IDs at training; s is a hyperparameter for adjusting the loss scale; m is the barrier coefficient; f _Pose Is a pose-embedded global image feature; d ^yi Is the dynamic mask of the image with pedestrian label yi, cos (·) is the cosine distance calculation function.

In a second aspect, the invention provides a computer apparatus comprising a memory and at least one processor; the memory stores computer-executable instructions; the at least one processor executes computer-executable instructions stored by the memory, causing the at least one processor to perform the aforementioned identification method.

In a third aspect, the present invention provides a readable storage medium storing computer instructions; which when executed by a processor is adapted to implement the identification method as described above.

The invention has the beneficial effects that:

1. the invention effectively utilizes the human body posture estimation network to extract the human body key points, and automatically extracts the human body unshielded part from the image block output by the encoder in a matching mode, optimizes the alignment operation of the unshielded part, and reduces the interference of an obstruction to the image.

2. The invention effectively utilizes the characteristics of attention mechanism and convolution operation, generates dynamic unoccluded mask for the image by means of the attitude information, realizes dynamic alignment of unoccluded parts, and keeps intra-class compactness and inter-class dispersion through angora barrier loss. The invention combines the measurement loss and the classification loss to carry out end-to-end network training, thereby fully improving the training efficiency and the training stability.

Drawings

FIG. 1 is a schematic flow chart of the present invention.

FIG. 2 is a schematic diagram of the model structure of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

As shown in fig. 1 and 2, a pedestrian re-identification method in an occlusion scene of posture guidance and dynamic feature extraction adopts a pedestrian re-identification model for identification; the pedestrian re-recognition model comprises an encoder, a human body posture estimation network, a local feature aggregation module, a global feature aggregation module and a dynamic feature generator.

The pedestrian re-identification method under the occlusion scene comprises the following steps:

step 1, image preprocessing.

1-1, preprocessing the identified image, and adjusting the image to be identified to be a standard identified image of 256 multiplied by 128 multiplied by 3.

1-2, converting the dimension of the standard identified image into 14 x 768 through a convolution layer with the convolution kernel size of 16 x 16, the step pitch of 16 and the channel number of 768, and then combining the first two dimensions to form a 196 x 768 two-dimensional matrix.

1-3, adding the image block position mark (token) and the camera position mark (token) with the same length to the two-dimensional matrix obtained in the step 1-2, then splicing the two-dimensional matrix with the last learnable classification mark (token), and finally inputting the two-dimensional matrix into the encoder, wherein the dimension of the two-dimensional matrix is 197 multiplied by 768.

And 2, extracting the global features and the local features through an encoder.

The encoder is composed of 12 self-attention layers, before the features except the classification token are input into the last self-attention layer, the features are divided into N groups, the value of N is 5-20, each group of features are input into the weight-shared self-attention layer, N groups of local features are obtained after the group of features are input into the weight-shared self-attention layer, and the classification token is used as a global feature. The size of the local feature is kx 768.

Optionally, the step of extracting the image global feature and the image grouping local feature by the encoder specifically includes:

(1) The pedestrian image is cut into N non-overlapping patches through a sliding window with a fixed size, and then an input sequence is obtained through a full connecting layer.

(2) Adding a position embedding token and a camera position embedding token for an input sequence, then splicing a learnable classification token to jointly form an encoder input matrix, wherein the input matrix comprises the classification token and a two-dimensional matrix added with position embedding information.

(3) After the features are extracted by the encoder, the features can be divided into an image global feature Fg extracted by the classification token and a local feature Fp extracted by the rest.

(4) At the last layer of the encoder, fp is divided into K groups, each group getting K image packet local features after a self-attention layer shared by weights.

And 3, extracting the key point information of the human body posture through a human body posture estimation network.

The convolution operation is carried out on the standard recognized image by utilizing the existing human body posture estimation network, so that the human body posture estimation network encodes the position information of the human body key points into M groups of heat maps (namely heatmaps) to represent the M individual body posture key point information of the human body. The human body estimation network adopts an OpenPose network. The heatmap size is mx64 × 32; to integrate local features and heatmaps, K = M.

And 4, the local feature aggregation module obtains a group of posture-guided local features based on the local features and the human posture key point information.

4-1. The latter two dimensions of the heatmap are merged, when the size of the heatmap is mx2048, and then the size of the heatmap is changed to mx768 through one full connection layer.

And 4-2, multiplying the heat map and the local features by elements to obtain a group of posture guidance features, wherein the size of the posture guidance features is Mx 768.

4-3, calculating the cosine distance between each posture guide feature and each group of local features, and matching the group of local features with the closest distance for each heat map. The set of local features represents the features that most probably represent a human body key point in the whole image. Based on the key points, the human body features of the key points can be extracted.

And 4-4, adding each attitude guide feature to the corresponding local feature to obtain a group of aggregated attitude guide local features.

And 5, the global feature aggregation module obtains the image features embedded with the posture information based on the human posture key point information and the global features.

And (3) after global average pooling is carried out on the M groups of heat maps obtained in the step (3), multiplying the heat maps by the global features extracted by the encoder in the step (2) to obtain the image features embedded with the attitude information.

And 6, extracting the dynamic feature mask of the image through a dynamic feature generator.

And 6-1, splicing the output characteristics of the 1 st, 3 rd, 5 th, 7 th, 9 th and 11 th layers of the encoder except the classification token to obtain a characteristic diagram with the size of 6 multiplied by 196 multiplied by 768. The size of the feature map was adjusted to 768 × 6 × w × h.

And 6-2, inputting the obtained feature map into a dynamic feature generator consisting of 6 groups of convolutional layers, wherein each group of convolutional layers reduces the number of channels by one time, then performing maximum pooling, and finally obtaining a dynamic human body feature mask with the size of 1 multiplied by 768 through full connection layers.

And 7, removing the part of the attitude-guided local features obtained in the step 4, with the confidence coefficient lower than the threshold, splicing the part of the attitude-guided local features with the image features embedded with the attitude information obtained in the step 5 and the image dynamic feature mask obtained in the step 6 to obtain final image features, respectively calculating the distances between the features and all the image features in the image library, sequencing the features according to the ascending order, and outputting the sequencing result as the pedestrian re-identification model.

The position and the confidence coefficient of each key point are given by the posture estimation network when the human body key point information is generated, the value range of the confidence coefficient is 0-1, and when the confidence coefficient is larger than a threshold value gamma, the human body key point is considered to be visible. Pose-guided local features with confidence greater than a threshold γ are used as the final feature generation. Here threshold γ =0.5.

The distance measurement mode between the pedestrian image features can adopt Euclidean distance, cosine distance and any reasonable distance measurement mode. The features of each image in the image library are calculated off-line and stored after the model is trained, so that when the pedestrian image is inquired, the distance between the image and all images in the image library can be quickly calculated by only putting the inquired image into a network to extract the features.

The training process of the pedestrian re-identification model is as follows:

(1) A pedestrian re-identification dataset is acquired.

The data set is divided into a training set, a query set and a queried set, and each batch contains k pictures of randomly selected P pedestrians as recognized images in the training stage.

(2) The global and local features of the image are extracted by the encoder.

(3) And extracting the key point information of the human body posture through a posture estimation network.

(4) And obtaining the human body key point characteristics guided by the postures through the characteristic aggregator by using the local characteristics and the human body posture key point information.

(5) And obtaining the image characteristics embedded with the posture information by the global characteristics and the human body posture key point information through a characteristic embedding layer.

(6) And extracting an image dynamic feature mask through a dynamic feature generator.

(7) And calculating classification loss and triple loss after global average pooling is carried out on the human key point characteristics guided by the postures.

(8) And calculating cross entropy loss and triple loss of the image characteristics embedded into the attitude information.

The cross entropy loss is:

wherein, B is the number of pictures of one batch during training; n is the total number of pedestrian IDs at training; p is _j Is the probability that the network predicts that the identified image is of the jth class, and y is the actual class of the identified image.

The triplet loss is:

L _tri ＝max(D(f _a ，f _p )-D(f _a ，f _n )+m，0)

wherein f is _a ，f _p ，f _n Respectively an anchor point image, a positive sample image with the same category as the anchor point image and a negative sample image with the different category from the anchor point image; d (-) is a Euclidean distance calculation function; and m is the barrier coefficient.

(9) And calculating the angora barrier loss by using the image features embedded with the attitude information and the dynamic feature mask.

The angola barrier loss is:

wherein, B is the number of pictures of one batch during training; n is the total number of pedestrian IDs at training; s is a hyperparameter adjusting the loss scale; m is the barrier coefficient; f _Pose Is a pose-embedded global image feature; d ^yi Is the dynamic mask of the image with pedestrian label yi, cos (·) is the cosine distance calculation function.

(10) The total loss L in the training phase is the cross entropy loss L _cls Loss of triad L _tri Barrier loss L of angora _A-cls The specific expression of the weighted sum of (c) is as follows:

L＝λ ₁ L _cls +λ ₂ L _tri +L _A-cls

wherein λ is ₁ And λ ₂ Is a hyperparameter that adjusts the loss scale.

(11) And training the pedestrian re-recognition model by taking the minimum total loss L as a target.

Embodiments of the present invention provide a storage device, in which a plurality of programs are stored, the programs being suitable for being loaded by a processor and implementing a pedestrian re-identification method in an occlusion scene of posture guidance and dynamic feature extraction as described above.

The embodiment of the invention provides a processing device, which comprises a processor and a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is suitable for being loaded and executed by a processor to realize the pedestrian re-identification method under the occlusion scene of the attitude guide and dynamic feature extraction.

Claims

1. A method for re-identifying shielded pedestrians through posture guidance and dynamic feature extraction is characterized by comprising the following steps: the method comprises the following steps:

step 1, preprocessing an identified image to obtain a two-dimensional matrix;

step 2, inputting the two-dimensional matrix obtained in the step 1 into an encoder; extracting global features and local features of the identified image;

step 3, extracting the key point information of the human body posture through a human body posture estimation network to obtain the position and confidence coefficient containing the key point of the human body and a heat map generated by the position;

step 4, obtaining local features of posture guidance based on the local features and the human posture key point information;

4-1, merging the last two dimensions of the heatmap obtained in the step 3 and passing the two dimensions through a full connection layer;

multiplying the heat map and the local features according to elements to obtain a group of posture guiding features;

4-3, calculating the cosine distance between each attitude guide feature and each group of local features, and matching a group of local features with the closest distance for each heat map;

4-4, adding the corresponding local feature to each attitude guide feature to obtain the attitude guide local feature;

step 5, multiplying the heat map obtained in the step 3 by the global features extracted in the step 2 after global average pooling is carried out on the heat map, and obtaining image features embedded with the attitude information;

step 6, extracting an image dynamic feature mask of the identified image through a dynamic feature generator;

2. The method for re-identifying the occluded pedestrian through posture guidance and dynamic feature extraction according to claim 1, wherein the method comprises the following steps: the process of step 1 is specifically as follows:

1-1, preprocessing the identified image, and adjusting the image to be identified into a standard image to be identified;

1-2, performing dimension conversion on the standard identified image through a convolutional layer, and then combining the first two dimensions of the obtained features to obtain a two-dimensional matrix;

and 1-3, adding the image block position mark and the camera position mark with the same length on the two-dimensional matrix obtained in the step 1-2, and then splicing the learned classification mark.

3. The method for re-identifying the pedestrian sheltered from the posture guidance and dynamic feature extraction according to claim 1, characterized in that: in step 2, the encoder adopts a Vision Transformer network, and comprises 12 self-attention layers; before inputting the features into the last self-attention layer, dividing the features except the classification mark into a plurality of groups, inputting each group of features into the weight-shared self-attention layer, obtaining a plurality of groups of local features after passing through a batch normalization layer, and taking the classification mark as a global feature.

4. The method for re-identifying the pedestrian sheltered from the posture guidance and dynamic feature extraction according to claim 1, characterized in that: in step 3, the human body posture estimation network carries out convolution operation on the identified image, and the position information of the human body key points is coded into a plurality of groups of heat maps to represent a plurality of human body posture key point information of the human body.

5. The method for re-identifying the occluded pedestrian through posture guidance and dynamic feature extraction according to claim 1, wherein the method comprises the following steps: the human body posture estimation network adopts an OpenPose network.

6. The method for re-identifying the pedestrian sheltered from the posture guidance and dynamic feature extraction according to claim 3, characterized in that: the specific process of step 6 is as follows:

6-1, splicing the other characteristics except the classification mark in the output characteristics of the 1 st, 3 rd, 5 th, 7 th, 9 th and 11 th layers of the encoder;

7. The method for re-identifying the occluded pedestrian through posture guidance and dynamic feature extraction according to claim 1, wherein the method comprises the following steps: the expression of the total loss L in the model training process of the identification method is as follows:

L＝λ ₁ L _cls +λ ₂ L _tri +L _A-cls

wherein L is _cls For cross entropy loss, the expression is as follows:

b is the number of pictures of one batch in training; n isTotal number of pedestrians at training; p _j Is the probability that the network predicts the identified image as the jth class, and y is the actual class of the identified image;

L _tri for triple loss, the expression is as follows:

L _tri ＝max(D(f _a ，f _p )-D(f _a ，f _n )+m，0)

f _a ，f _p ，f _n respectively an anchor point image, a positive sample image with the same category as the anchor point image and a negative sample image with the different category from the anchor point image; d (-) is a Euclidean distance calculation function; m is the barrier coefficient;

L _A-cls for angola barrier loss, the expression is as follows:

n is the total number of pedestrian IDs at training; s is a hyperparameter adjusting the loss scale; m is the barrier coefficient; f _Pose Is a pose-embedded global image feature; d ^yi Is the dynamic mask of the image with pedestrian label yi, cos (·) is the cosine distance calculation function.

8. A computer device comprising a memory and at least one processor; the method is characterized in that: the memory stores computer-executable instructions; the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the identification method of any of claims 1-7.

9. A readable storage medium storing computer instructions; the method is characterized in that: the computer instructions, when executed by a processor, are for implementing the identification method of any one of claims 1-7.