CN114140873A

CN114140873A - Gait recognition method based on convolutional neural network multi-level features

Info

Publication number: CN114140873A
Application number: CN202111317122.XA
Authority: CN
Inventors: 查杭; 杨波
Original assignee: Wuhan Zhongzhi Digital Technology Co ltd
Current assignee: Wuhan Zhongzhi Digital Technology Co ltd
Priority date: 2021-11-09
Filing date: 2021-11-09
Publication date: 2022-03-04

Abstract

A gait recognition method based on multi-level features of a convolutional neural network comprises the following steps: reading an input gait image sequence, and preprocessing the image sequence; extracting features of different levels from the image sequence through the feature extraction branches with different convolution sizes; fusing the extracted features of different layers in different modes to obtain and store the features finally used for gait recognition; calculating loss by using a loss function according to the characteristic diagram and the label, and updating the convolutional neural network model by adopting a back propagation algorithm until a preset condition is met, and generating a final recognition model; and inputting the gait target to be recognized into the final recognition model, matching the output result of the model with the stored characteristic values, and taking the result with the highest similarity in the stored characteristic values as the recognition result. The invention enhances the online data of the input gait outline map, increases the diversity of the input data and improves the robustness of the algorithm to the pedestrian outline under the actual situation.

Description

Gait recognition method based on convolutional neural network multi-level features

Technical Field

The invention relates to the field of computer vision and deep learning, in particular to a gait recognition method based on multilayer characteristics of a convolutional neural network.

Background

Gait is a physiological and biological feature that describes a person's walking pattern. Compared with the biological characteristics of human faces, fingerprints, irises and the like, the gait recognition method has the advantages of non-contact, long distance, difficulty in camouflage, no need of active matching and the like, so that the gait recognition technology is widely applied to the fields of access control systems, safety monitoring, human-computer interaction and the like.

With the development of deep learning and convolutional neural network technology, gait recognition technology also makes great progress. The mainstream gait recognition method at present is generally divided into a template-based method and a sequence-based method, and recognition is carried out through a pedestrian gait contour map. Although the gait recognition method based on the sequence, such as GaitSet and GaitPart algorithms, can achieve good effects on public data sets, the GaitSet algorithm is based on the whole contour map for recognition and does not catch local features, the GaitPart algorithm can extract the local features, the extracted feature levels are single, and the recognition accuracy of the two algorithms under actual situations is not enough.

Disclosure of Invention

In view of the above, the present invention has been developed to provide a gait recognition method based on multi-level features of a convolutional neural network that overcomes or at least partially solves the above problems.

In order to solve the technical problem, the embodiment of the application discloses the following technical scheme:

a gait recognition method based on multi-level features of a convolutional neural network comprises the following steps:

s100, reading an input gait image sequence, and preprocessing the image sequence;

s200, extracting features of different levels of an image sequence through feature extraction branches with different convolution sizes;

s300, fusing the extracted features of different layers in different modes to obtain and store features finally used for gait recognition;

s400, calculating loss by using a loss function according to the feature map and the label, and updating the convolutional neural network model by adopting a back propagation algorithm until a preset condition is met, and generating a final recognition model;

s500, inputting the gait target to be recognized into a final recognition model, matching the output result of the model with the stored characteristic values, and taking the result with the highest similarity in the stored characteristic values as a recognition result.

Further, the specific method of S100 is: downloading a CASIA-B open source gait recognition data set, judging the length of a gait sequence in the data set, taking 30 middle images if the length of the sequence is more than 150, and discarding the sequence if the length of the sequence is less than 30; the input gait sequence is cut into 64 x 64 size according to the principle of the edge and the center line in the height and width directions.

Further, the specific method of S100 further includes: randomly performing morphological opening operation or closing operation on the image according to a certain proportion of ratio, wherein the kernel size of the opening and closing operation is set to be 3, the proportion ratio is set to be 0.2 during training, and the proportion ratio is set to be 0 during reasoning; the input sequence number of each time is P x K, wherein P is the number of pedestrians in each batch during training or reasoning, and K is the gait sequence number of each selected pedestrian under different viewing angles, dressing, backpacks or walking conditions.

Further, the specific method of S200 includes: extracting features of different granularities of an input image sequence by adopting three convolutions of 1x1, 3x3 and 5x5, wherein the number of input channels of the three convolutions is 1, the number of output channels of the three convolutions is 32, the step length is 2, and n image sequences with the length of 64 x 64 and the length of s-30 are input and extracted by the features of different granularities; and splicing the 3 feature maps with the size of 32 × 32 in a channel dimension through a Concat layer, and finally obtaining the feature maps containing different levels of information.

Furthermore, the Block layer is composed of two convolution layers with convolution size of 3 and step size of 1 and a maximum pooling with step size of 2, the number of input channels of the first convolution layer is 96, the number of output channels is 64, the number of input channels of the second convolution layer is 64, the number of output channels is 128, and the output of the feature map extracted by the Block layer and containing different layers of information is n × 30 × 128 × 16.

Further, the specific method of S300 includes: fusing the multi-level features extracted in the S200 in the directions of W and H of the feature map by adopting a Block _ W layer, a Block _ H layer, an HPM layer, a VPM layer and a Concat2 layer; the Block _ W layer and the Block _ H layer are formed by 2 layers of convolutions with the convolution size of 3 and the step length of 1, the number of input channels and the number of output channels of the first layer of convolution are respectively 128 and 256, and the number of input channels and the number of output channels of the second layer of convolution are respectively 256 and 256.

Further, the specific method of S300 further includes: the Block _ W layer equally divides the features output by the feature extraction layer into n equal parts in the W direction, local features are respectively extracted from the n strip-shaped feature graphs divided in the W direction by using convolution operation, the features concat are aggregated, and strip-shaped local information in the W direction is extracted by adopting global maximum set pooling; the Block _ H layer equally divides the features output by the feature extraction layer into n equal divisions in the H direction, respectively extracts local features from the n H-direction segmented strip feature graphs by using convolution operation, then aggregates the features concat, and extracts the H-direction strip local information by adopting global maximum set pooling; mapping the h-direction pooled features to a discrimination space by the HPM layer by using an independent full-connection layer; the VPM layer maps the pooled features in each w direction to a discrimination space by using an independent full connection layer; the Concat2 layer contains a Concat operation to splice the output feature maps of the HPM layer and the VPM layer together in the channel dimension to form a feature map containing local features in both the w direction and the h direction.

Further, in S400, Treplitloss and centerlos are used for joint training, and the loss function is designed as follows:

LOSS＝L_tri+β·L_Cen

wherein the joint loss is represented by a ternary loss L_triAnd central loss L_CenAnd two lost weight balance factors beta; ternary loss L_triIn the formula, N is the number of samples in a training batch,

for a certain sample chosen at random in the training,

the positive samples corresponding to the training samples are,

the negative sample corresponding to the training sample is the negative sample,

the in-class euclidean distance between the randomly selected sample and the similar sample,

the Euclidean distance between the randomly selected sample and the different samples, alpha is the minimum difference value between the inter-class distance and the intra-class distance corresponding to the sample, and the center loss L_CenIn x_iC is the class center of all samples, and is obtained by the training process.

Further, the training process is optimized by adopting an Adam algorithm, the learning rate is set to be 0.0001, the momentum is 0.9, and the training is stopped if the loss of the training process is less than 0.0000001 or the iteration number is more than 100000.

The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:

the gait recognition method based on the multilayer characteristics of the convolutional neural network disclosed by the invention has the advantages that the online data enhancement is carried out on the input gait outline graph, the diversity of the input data is increased, and the robustness of the algorithm to the pedestrian outline under the actual situation is improved; the method adopts convolution of different receptive fields to extract gait information and respectively expresses the gait by using different local characteristics in the height h and width w directions, and adopts ternary loss and central loss in the algorithm training process, thereby increasing the distance between different classes, reducing the distance between the same classes and improving the accuracy of gait recognition.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

fig. 1 is a flowchart of a gait recognition method based on a convolutional neural network multi-level feature according to embodiment 1 of the present invention;

fig. 2 is a structural diagram of a convolutional neural network in a gait recognition method based on multi-level features of the convolutional neural network in embodiment 1 of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

In order to solve the problems in the prior art, the embodiment of the invention provides a gait recognition method based on multilayer characteristics of a convolutional neural network.

Example 1

The embodiment discloses a gait recognition method based on multi-level features of a convolutional neural network, as shown in fig. 1, comprising the following steps:

in this embodiment, the specific method of S100 is: downloading a CASIA-B open source gait recognition data set, judging the length of a gait sequence in the data set, taking 30 middle images if the length of the sequence is more than 150, and discarding the sequence if the length of the sequence is less than 30; the input gait sequence is cut into 64 x 64 size according to the principle of the edge and the center line in the height and width directions.

In some preferred embodiments, the specific method of S100 further comprises: randomly performing morphological opening operation or closing operation on the image according to a certain proportion of ratio, wherein the kernel size of the opening and closing operation is set to be 3, the proportion ratio is set to be 0.2 during training, and the proportion ratio is set to be 0 during reasoning; the input sequence number of each time is P x K, wherein P is the number of pedestrians in each batch during training or reasoning, and K is the gait sequence number of each selected pedestrian under different viewing angles, dressing, backpacks or walking conditions. As an example P may be set to 8 and K may be set to 16.

S200, extracting features of different levels of an image sequence through feature extraction branches with different convolution sizes; in this embodiment, as shown in fig. 2, the specific method of S200 includes: extracting features of different granularities from an input image sequence by adopting three convolutions of 1x1, 3x3 and 5x5, wherein the number of input channels of the three convolutions is 1, the number of output channels is 32, the step size is 2, and a feature map of n image sequences with the length of 64 x 64 and the length of s-30 is n x 30 x 32 after extracting the features of the different granularities; splicing 3 characteristic graphs with the size of 32 × 32 in a channel dimension through a Concat layer, and outputting n × 30 × 96 × 32; the Block layer is composed of two convolution layers with convolution size of 3 and step size of 1 and a maximum pooling with step size of 2, the number of input channels of the first convolution layer is 96, the number of output channels is 64, the number of input channels of the second convolution layer is 64, the number of output channels is 128, and the feature graphs containing different levels of information extracted by the Block layer are output as n 30 x 128 x 16.

S300, fusing the extracted features of different layers in different modes to obtain and store features finally used for gait recognition; specifically, in this embodiment, a Block _ W layer, a Block _ H layer, an HPM layer, a VPM layer, and a Concat2 layer are used to fuse the multi-level features extracted in S200 in the W and H directions of the feature map; the Block _ W layer and the Block _ H layer are formed by 2 layers of convolutions with the convolution size of 3 and the step length of 1, the number of input channels and the number of output channels of the first layer of convolution are respectively 128 and 256, and the number of input channels and the number of output channels of the second layer of convolution are respectively 256 and 256.

The Block _ W layer equally divides the features output by the feature extraction layer into n equal parts in the W direction, local features are respectively extracted from the n strip-shaped feature graphs divided in the W direction by using convolution operation, the features concat are aggregated, and strip-shaped local information in the W direction is extracted by adopting global maximum set pooling; the Block _ H layer equally divides the features output by the feature extraction layer into n equal divisions in the H direction, respectively extracts local features from the n H-direction segmented strip feature graphs by using convolution operation, then aggregates the features concat, and extracts the H-direction strip local information by adopting global maximum set pooling; mapping the h-direction pooled features to a discrimination space by the HPM layer by using an independent full-connection layer; the VPM layer maps the pooled features in each w direction to a discrimination space by using an independent full connection layer; the Concat2 layer contains a Concat operation to splice the output feature maps of the HPM layer and the VPM layer together in the channel dimension to form a feature map containing local features in both the w direction and the h direction.

in this embodiment, Treplitloss and Centerlos are used for joint training, and the loss function is designed as follows:

LOSS＝L_tri+β·L_Cen

wherein the combined loss is threeLoss of element L_triAnd central loss L_CenAnd two lost weight balance factors beta; ternary loss L_triIn the formula, N is the number of samples in a training batch,

for a certain sample chosen at random in the training,

the positive samples corresponding to the training samples are,

the Euclidean distance between the randomly selected sample and the different samples, alpha is the minimum difference value between the inter-class distance and the intra-class distance corresponding to the sample, and the center loss L_CenIn x_iC is the class center of all samples, and is obtained by the training process. And optimizing by adopting an Adam algorithm in the training process, setting the learning rate to be 0.0001 and the momentum to be 0.9, and stopping training if the loss of the training process is less than 0.0000001 or the iteration times is more than 100000.

S500, inputting the gait target to be recognized into a final recognition model, matching the output result of the model with the stored characteristic values, and taking the result with the highest similarity in the stored characteristic values as a recognition result. When gait recognition is carried out, firstly establishing a query set, namely storing the gait characteristic values of all targets; and for a gait target to be identified, calculating the characteristic value and the Euclidean distance between the gait target and the stored target one by one according to S100-S300, and taking the result closest to the Euclidean distance of the stored target as an identification result.

The gait recognition method based on the multilayer characteristics of the convolutional neural network disclosed by the embodiment performs online data enhancement on an input gait contour map, increases the diversity of input data, and improves the robustness of an algorithm to a pedestrian contour under an actual scene; the method adopts convolution of different receptive fields to extract gait information and respectively expresses the gait by using different local characteristics in the height h and width w directions, and adopts ternary loss and central loss in the algorithm training process, thereby increasing the distance between different classes, reducing the distance between the same classes and improving the accuracy of gait recognition.

It should be understood that the specific order or hierarchy of steps in the processes disclosed is an example of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not intended to be limited to the specific order or hierarchy presented.

In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, invention lies in less than all features of a single disclosed embodiment. Thus, the following claims are hereby expressly incorporated into the detailed description, with each claim standing on its own as a separate preferred embodiment of the invention.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. Of course, the processor and the storage medium may reside as discrete components in a user terminal.

For a software implementation, the techniques described herein may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. The software codes may be stored in memory units and executed by processors. The memory unit may be implemented within the processor or external to the processor, in which case it can be communicatively coupled to the processor via various means as is known in the art.

What has been described above includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, to the extent that the term "includes" is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean a "non-exclusive or".

Claims

1. A gait recognition method based on multilayer characteristics of a convolutional neural network is characterized by comprising the following steps:

2. The gait recognition method based on the multi-level features of the convolutional neural network as claimed in claim 1, wherein the specific method of S100 is as follows: downloading a CASIA-B open source gait recognition data set, judging the length of a gait sequence in the data set, and if the sequence length is more than 150, extracting 30 images in the middle; if the sequence length is less than 30, discarding the sequence; the input gait sequence is cut into 64 x 64 size according to the principle of the edge and the center line in the height and width directions.

3. The gait recognition method based on the multi-level features of the convolutional neural network as claimed in claim 1, wherein the specific method of S100 further comprises: randomly performing morphological opening operation or closing operation on the image according to a certain proportion of ratio, wherein the kernel size of the opening and closing operation is set to be 3, the proportion ratio is set to be 0.2 during training, and the proportion ratio is set to be 0 during reasoning; the input sequence number of each time is P x K, wherein P is the number of pedestrians in each batch during training or reasoning, and K is the gait sequence number of each selected pedestrian under different viewing angles, dressing, backpacks or walking conditions.

4. The gait recognition method based on the multi-level features of the convolutional neural network as claimed in claim 1, wherein the specific method of S200 comprises: extracting features of different granularities of an input image sequence by adopting three convolutions of 1x1, 3x3 and 5x5, wherein the number of input channels of the three convolutions is 1, the number of output channels of the three convolutions is 32, the step length is 2, and n image sequences with the length of 64 x 64 and the length of s-30 are input and extracted by the features of different granularities; and splicing the 3 feature maps with the size of 32 × 32 in a channel dimension through a Concat layer, and finally obtaining the feature maps containing different levels of information.

5. A gait recognition method based on the multilevel characteristics of the convolutional neural network as claimed in claim 4, characterized in that the Block layer is composed of two convolutional layers with convolution size of 3 and step size of 1 and a maximum pooling with step size of 2, the number of input channels of the first layer of convolution is 96, the number of output channels is 64, the number of input channels of the second layer of convolution is 64, the number of output channels is 128, and the output of the feature map containing different levels of information extracted by the Block layer is n 30 x 128 x 16.

6. The gait recognition method based on the multi-level features of the convolutional neural network as claimed in claim 1, wherein the specific method of S300 comprises: fusing the multi-level features extracted in the S200 in the directions of W and H of the feature map by adopting a Block _ W layer, a Block _ H layer, an HPM layer, a VPM layer and a Concat2 layer; the Block _ W layer and the Block _ H layer are formed by 2 layers of convolutions with the convolution size of 3 and the step length of 1, the number of input channels and the number of output channels of the first layer of convolution are respectively 128 and 256, and the number of input channels and the number of output channels of the second layer of convolution are respectively 256 and 256.

7. The method for gait recognition based on multi-level features of convolutional neural network as claimed in claim 6, wherein the specific method of S300 further comprises: the Block _ W layer equally divides the features output by the feature extraction layer into n equal parts in the W direction, local features are respectively extracted from the n strip-shaped feature graphs divided in the W direction by using convolution operation, the features concat are aggregated, and strip-shaped local information in the W direction is extracted by adopting global maximum set pooling; the Block _ H layer equally divides the features output by the feature extraction layer into n equal divisions in the H direction, respectively extracts local features from the n H-direction segmented strip feature graphs by using convolution operation, then aggregates the features concat, and extracts the H-direction strip local information by adopting global maximum set pooling; mapping the h-direction pooled features to a discrimination space by the HPM layer by using an independent full-connection layer; the VPM layer maps the pooled features in each w direction to a discrimination space by using an independent full connection layer; the Concat2 layer contains a Concat operation to splice the output feature maps of the HPM layer and the VPM layer together in the channel dimension to form a feature map containing local features in both the w direction and the h direction.

8. The gait recognition method based on multilevel features of convolutional neural network as claimed in claim 1, wherein in S400, joint training is performed by using Treplitloss and cenerloss, and the loss function is designed as follows:

LOSS＝L_tri+β·L_Cen

for a certain sample chosen at random in the training,

the positive samples corresponding to the training samples are,

9. The gait recognition method based on the multilevel features of the convolutional neural network as claimed in claim 8, wherein the training process is optimized by using Adam algorithm, the learning rate is set to 0.0001, the momentum is 0.9, and the training is stopped if the loss of the training process is less than 0.0000001 or the iteration number is more than 100000 times.