CN112861605A

CN112861605A - Multi-person gait recognition method based on space-time mixed characteristics

Info

Publication number: CN112861605A
Application number: CN202011570903.5A
Authority: CN
Inventors: 成科扬; 何霄兵; 王文杉; 师文喜; 司宇
Original assignee: Zhenjiang Zhaoyuan Intelligent Technology Co ltd; Jiangsu University; Electronic Science Research Institute of CTEC
Current assignee: Zhenjiang Zhaoyuan Intelligent Technology Co ltd; Jiangsu University; Electronic Science Research Institute of CTEC
Priority date: 2020-12-26
Filing date: 2020-12-26
Publication date: 2021-05-28

Abstract

The invention discloses a multi-person gait recognition method based on space-time mixed characteristics. The method includes the steps of firstly introducing a hybrid mask network to carry out pedestrian detection and segmentation, and adding a correlation head in the hybrid mask network to extract low-dimensional features of pedestrians. And then, selecting a pedestrian contour sequence to be identified, and sending the pedestrian contour sequence to a gait identification model based on a pseudo three-dimensional residual error network for feature extraction. The method utilizes a pseudo three-dimensional residual error network to respectively extract the characteristics of the upper half and the lower half of the pedestrian outline, and the features are spliced when the pedestrian outline is subjected to horizontal pyramid pooling. And finally, outputting the identity information of the pedestrian through Euclidean distance measurement. The multi-person gait recognition method disclosed by the invention can solve the problem that the gait recognition of pedestrians cannot be carried out in a complex scene.

Description

Multi-person gait recognition method based on space-time mixed characteristics

Technical Field

The invention relates to the technical fields of computer vision, pattern recognition and the like, mainly relates to identification of pedestrians in a monitoring video, and has wide application in crime prevention, forensic identification, social security and the like.

Background

Unlike other biometric technologies (e.g., face, fingerprint, and iris), gait is a unique biometric function that can be identified at a distance without cooperation of the subject. Therefore, the gait recognition has higher practical value and wide application prospect.

Gait recognition has made a series of advances in the past decade, but the research on gait recognition still remains in the stage of single pedestrian gait recognition, and the field of multi-person gait recognition research still remains blank. Currently, single gait recognition methods can be mainly divided into two categories: model-based methods and appearance-based methods. Model-based methods extract features by modeling human body structures and local motion patterns of different body parts. Some early model-based methods even marked different body parts manually, or used some specific equipment to acquire joint positions of the human body, with heavy computational expense. Later with the development of posture estimation, Liao et al proposed a posture-based gait recognition method in 2017 with great progress.

Appearance-based methods typically use human body contours as raw input data. Gait energy images, one of the most popular features, are obtained by aligning and averaging contours, have a low computational cost and can achieve a relatively high recognition rate. This approach still does not perform well enough because the gait energy image can result in some loss of time information. Recently, some researchers have used body contours directly as input data, rather than using their average values. Wu et al used a deep learning model to extract features from a sequence of human body contours for the first time in 2017. Chao et al, 2018, treated gait as a set consisting of individual contours with non-continuous silhouettes to extract invariant features from the set. Experiments show that the time characteristics between frames can achieve better performance than gait energy images.

The single gait recognition technology has made great progress, but in a real application scene, only a single pedestrian cannot appear in a monitoring video, so in order to solve the problem, a multi-person gait recognition method based on space-time mixed features is provided.

Disclosure of Invention

The purpose of the invention is as follows: in a real application scene, the monitoring video not only comprises a gait form under a single-person walking condition, but also comprises a gait form under a multi-person walking condition. However, the current gait recognition technology still stays in the laboratory stage, that is, pedestrians or moving objects other than the target cannot appear in the video. Therefore, the invention hopes to solve the problem of multi-person gait recognition by combining the pedestrian segmentation and tracking with the gait recognition technology, so that the gait recognition technology can really fall to the real place, and more resources are saved in the aspect of social security.

1. A multi-person gait recognition method based on space-time mixed characteristics is characterized by comprising the following steps:

step 1.1: segmenting and tracking the pedestrians in the original video frame by using a pedestrian segmentation and tracking method;

step 1.2: respectively storing the gait contour sequence of each pedestrian into corresponding folders;

step 1.3: selecting a pedestrian gait contour sequence to be identified, and extracting features through a gait identification network;

step 1.4: and outputting the identity information of the pedestrian through the Euclidean distance measurement.

2. The multi-person gait recognition method based on space-time mixed features as claimed in claim 1, wherein the method of pedestrian segmentation and tracking in step 1.1 is as follows:

step 2.1: performing feature extraction on the video frame by using the 2 three-dimensional convolution layers;

step 2.2: adopting a hybrid mask network to detect and divide pedestrians;

step 2.3: the mixed mask network is expanded through the association head, the feature maps corresponding to the regions generated by the mixed mask network are used as input, the association vectors of each region are extracted, and the Euclidean distance between the association vectors is used for associating detection changing along with time to the track, so that the tracking of pedestrians is realized.

Step 2.4: calculating the correlation loss of the triples by selecting the positive sample farthest from the sample and the negative sample closest to the sample, and optimizing the whole tracking module, wherein the correlation loss is as follows:

wherein,

for the detection set of video, d and e are respectively time frames t_dAnd t_eDetection of a_dAnd a_eRespectively, are associated vectors, and alpha is a threshold value.

3. The pedestrian segmentation and tracking method according to claim 2, wherein the method for extracting the association vector in step 2.3 is as follows:

step 3.1: continuously up-sampling the feature map of the last layer of convolution in the mixed mask network, and performing addition and combination operation on the feature map of each pyramid level to obtain new feature maps of different pyramid levels with stronger representation capability;

step 3.2: and taking the corresponding area of the bottom layer feature map of the new pyramid as the input of the association module, and extracting the low-dimensional features of the pedestrian.

4. The multi-person gait recognition method based on space-time mixed features as claimed in claim 1, wherein the method for extracting features through the gait recognition network in the step 1.3 comprises the following steps:

step 4.1: the gait is regarded as a group of sequences consisting of continuous pedestrian contours, and the space-time mixed characteristics of the upper half and the lower half of the pedestrian contours are respectively extracted through two pseudo three-dimensional residual error network main pipelines. Simultaneously, adding features of different layers to the multilayer global pipeline;

step 4.2: extracting features of 4 scales by utilizing horizontal pyramid pooling;

step 4.3: optimizing the whole network model by adopting a triple loss and center loss combined training mode, wherein:

the triplet loss function is as follows:

in the above formula, the first and second carbon atoms are,

in the form of the euclidean distance,

and

respectively are the characteristic expressions of a sample, a positive sample and a negative sample, a is a threshold value, and the meaning of + is when]When the value of the inner value is greater than 0, the value is taken as a loss, and when the value is less than 0, the loss is taken as 0.

The center loss function is as follows:

in the above formula, x_iFeatures before fully connecting layers, c_yiRepresenting the feature center of the yi-th class.

5. The gait recognition model of claim 4, wherein the method of extracting the space-time mixture features of the upper body and the lower body of the pedestrian outline in step 4.1 comprises:

step 5.1: horizontally dividing an input feature map into an upper part and a lower part, and respectively extracting features through two pseudo three-dimensional residual error network main pipelines;

step 5.2: and splicing the extracted two parts of feature vectors when the two parts of feature vectors are subjected to horizontal pyramid pooling.

The beneficial results of the invention are as follows:

on the basis of a single pedestrian gait recognition technology, a solution for multi-person gait recognition is provided by combining with the related technologies of pedestrian segmentation and tracking, and the possibility of realizing the gait recognition technology in practical application is greatly improved.

Drawings

FIG. 1 is a core structure diagram of a multi-person gait recognition method based on space-time mixed features according to the invention;

FIG. 2 is a schematic diagram of a pedestrian segmentation and tracking model structure;

FIG. 3 is a schematic diagram of a hybrid mask network architecture;

FIG. 4 is a schematic diagram of a convolutional network and feature pyramid module structure;

FIG. 5 is a schematic diagram of a gait recognition model structure;

fig. 6 is a schematic diagram of a pseudo three-dimensional residual block structure.

Detailed Description

The invention will be further explained with reference to the drawings.

As shown in fig. 1, the method for identifying multiple gait based on space-time mixed features of the invention specifically comprises the following steps:

step 1: and inputting the original video sequence to a pedestrian segmentation and tracking model to obtain a pedestrian segmentation and tracking result. The pedestrian segmentation and tracking model structure is shown in fig. 2:

step 1.1: in order to enhance the temporal information correlation between frames, feature extraction is performed on the original video frame by using two three-dimensional convolutional layers, wherein the sizes of convolutional kernels of the two three-dimensional convolutional layers are both 3x3x3, and the step size is 1x1x 1. Then, a linear rectification function is used as an activation function, and maximum three-dimensional pooling operation with the convolution kernel size of 1x1x1 is carried out;

step 1.2: and adopting a hybrid mask network to detect and divide the pedestrians. The hybrid mask network is a network for unifying target detection and instance segmentation, and the specific structure is shown in fig. 3: the system mainly comprises a feature extraction module, a target detection module and an instance segmentation module.

The feature extraction module inputs the feature map extracted in step 1, and the feature map mainly comprises a convolution network and a feature pyramid module, and the specific structure and parameters are shown in fig. 4.

The target detection module, i.e. the detector module, is divided into classification branches and regression branches, each adding 4 convolutional layers after the feature map. The input and output channels of each convolutional layer are 256, the size of the convolutional kernel is 3x3, the step and the padding are 1, and then group normalization is carried out and a linear rectification function is adopted as an activation function.

The example partitioning module is composed of a top module, a bottom module and a mixing module. The top module adds a single convolutional layer to each detector, wherein the number of input and output channels is 256 and 4 respectively, and the size of the convolutional kernel is 1x1, so as to generate 4 prediction example attention diagrams. The input of the bottom module is some characteristic diagrams of C2-C5 or P2-P5, for example, C3 and C5 are selected, and then C5 is up-sampled by 4 times and spliced with C3. Then, after a convolution kernel with the size of 3x3 and the number of output channels of 4, 4 prediction score maps are generated. And finally, the mixing module sequentially multiplies the matrix of the example attention diagram and the matrix corresponding to the score diagram by elements, and then adds the 4 results to obtain the mask.

Step 1.3: the mixed mask network is expanded through an association head, the association head is a full-connection layer, a feature map corresponding to a region generated by the mixed mask network is used as input, an association vector of each region is extracted, the size of the association vector is 128-dimensional, and Euclidean distances among the association vectors are used for associating detection changing along with time to a track, so that tracking of pedestrians is achieved.

Step 1.4: calculating the correlation loss of the triples by selecting the positive sample farthest from the sample and the negative sample closest to the sample, and optimizing the whole tracking module, wherein the correlation loss is as follows:

wherein,

In general, a pedestrian segmentation and tracking model trains a target data set KITTI MOTS, using adaptive moment estimation as an optimizer, with 5 x 10^-7The learning rate of (2) is trained for 40 cycles. During training, a small batch consisting of 8 adjacent frames of a single video is used as input.

Step 2: and (2) after the segmentation and tracking result obtained in the step (1) is subjected to post-processing, generating a pedestrian gait contour sequence, wherein the specific method comprises the following steps: firstly, for each frame, each target generates a corresponding binary mask, converts the binary mask into a corresponding binary image, and sequentially stores the binary image in a folder corresponding to the identity serial number of the binary image.

And step 3: selecting a pedestrian contour sequence to be identified, and extracting features by using a gait recognition model, wherein the structure of the gait recognition model is shown in figure 5:

step 3.1: the gait is regarded as a group of sequences consisting of continuous pedestrian contours, and the space-time mixed characteristics of the upper half and the lower half of the pedestrian contours are respectively extracted through 2 pseudo three-dimensional residual error network main pipelines. At the same time, features of different layers are added to the multi-layer global pipeline in order to utilize features of different depths. The so-called pseudo three-dimensional residual network is composed of different pseudo three-dimensional residual block structures and three-dimensional pooling. The basic idea is to decouple the three-dimensional convolution kernel of 3x3x3 into a two-dimensional space convolution of 1x3x3 and a one-dimensional time-domain convolution of 3x1x 1. Then, combining the idea of a residual error learning unit, a two-dimensional space convolution and a one-dimensional time domain convolution are respectively formed into pseudo three-dimensional residual block structures A and B in a serial and parallel mode, and the specific structure is shown in FIG. 6.

Step 3.2: in order to make feature extraction local and global, 4-scale features are extracted using horizontal pyramid pooling. The horizontal pyramid pooling is to divide the feature maps horizontally according to the scale, perform maximum pooling and average pooling on each horizontal feature map, and add the results of the two pooling correspondingly. Feature ensemble discrimination is optimized using parameter independent fully connected layers after horizontal pyramid pooling.

Step 3.3: and finally, optimizing the whole model by adopting a triple loss and center loss combined training mode. The triple loss maximizes the inter-class difference by optimizing that the distance between the sample and the positive sample is smaller than the distance between the sample and the negative sample, and the loss function is as follows:

in the above formula, the first and second carbon atoms are,

in the form of the euclidean distance,

and

The center loss focuses more on the uniformity of the intra-class distribution and is distributed uniformly around the intra-class center, thereby minimizing the intra-class difference, and the loss function is as follows:

the upper typeIn, x_iFeatures before fully connecting layers, c_yiRepresenting the feature center of the yi-th class.

In general, the gait recognition model trains the target dataset CASIA-B with a set of aligned contours, 64 × 44 in size. The profile base in the training was set to 30, the number of people p was set to 8 for each batch, and the number of frames per person k was set to 16. The adaptive moment estimate is chosen as the optimizer, and the triplet losses and the center losses are chosen as the loss functions. Learning rate is set to 1e^-4. Training was performed for 80000 iterations.

And 4, performing Euclidean distance measurement on the characteristic vector of the pedestrian to be identified and the characteristic vector of the existing pedestrian in the database, thereby judging the identity of the pedestrian.

The above-listed detailed description is only a specific description of a possible embodiment of the present invention, and they are not intended to limit the scope of the present invention, and equivalent embodiments or modifications made without departing from the technical spirit of the present invention should be included in the scope of the present invention.

Claims

step 2.1: performing feature extraction on the video frame by utilizing two three-dimensional convolutions;

step 2.2: adopting a hybrid mask network to detect and divide pedestrians;

wherein,

step 4.1: the gait is regarded as a group of sequences consisting of continuous pedestrian contours, and the space-time mixed characteristics of the upper half and the lower half of the pedestrian contours are respectively extracted through 2 pseudo three-dimensional residual error network main pipelines. Simultaneously, adding features of different layers to the multilayer global pipeline;

step 4.3: optimizing the whole network model by adopting a triple loss and central loss combined training mode,

wherein:

the triplet loss function is as follows:

in the above formula, the first and second carbon atoms are,

in the form of the euclidean distance,

and

The center loss function is as follows: