CN116682141A

CN116682141A - Multi-label pedestrian attribute identification method and medium based on multi-scale progressive perception

Info

Publication number: CN116682141A
Application number: CN202310657643.2A
Authority: CN
Inventors: 陈婷婷; 陈明明; 杨光; 林国凤; 张勤; 黄智财; 薛鹏辉; 郭泽扬
Original assignee: Xiamen Huaxia University
Current assignee: Xiamen Huaxia University
Priority date: 2023-06-05
Filing date: 2023-06-05
Publication date: 2023-09-01

Abstract

The invention discloses a multi-label pedestrian attribute identification method based on multi-scale progressive perception, which comprises the following steps: inputting the pedestrian image into a backbone network, and carrying out feature extraction through a plurality of residual convolution blocks of the backbone network to obtain attribute feature information; constructing a plurality of multi-scale progressive perception models and training; embedding a plurality of trained multi-scale progressive perception models into a backbone network, and inputting attribute characteristic information output by a target residual convolution block into the corresponding multi-scale progressive perception model to obtain multi-scale characteristic information; processing the multi-scale characteristic information through a global average pooling layer, and then sending the multi-scale characteristic information into a first attribute prediction layer for attribute probability prediction; and performing progressive constraint on the first attribute prediction layer corresponding to the previous multi-scale progressive perception model by the first attribute prediction layer corresponding to the latter multi-scale progressive perception model. The invention also discloses a computer readable storage medium, which improves the characteristic robustness of the whole pedestrian attribute area.

Description

Multi-label pedestrian attribute identification method and medium based on multi-scale progressive perception

Technical Field

The invention relates to the technical field of pedestrian attribute identification, in particular to a multi-label pedestrian attribute identification method and medium based on multi-scale progressive perception.

Background

The purpose of pedestrian attribute recognition is to recognize a plurality of attributes (e.g., long hair, business wear, leather shoes, glasses, age, character, etc.) in one pedestrian image. With the rapid development of monitoring technology, a large number of monitoring systems are deployed in public places. Therefore, pedestrian attribute identification is a technology for acquiring semantic attribute information of a specific target, and has been attracting more and more attention in recent years, and has become a key technology in video monitoring applications. It is also increasingly becoming the primary option for facilitating pedestrian re-identification and pedestrian retrieval studies. However, despite many years of efforts, pedestrian attribute recognition remains a challenging problem because pedestrian gestures, viewpoints, lighting, imperfect pedestrian detection, occlusion, lighting discrepancies, etc., can affect the recognition results.

In the past few years, many approaches have demonstrated their superiority in the task of attribute identification of pedestrians. Unlike conventional image classification tasks (where images belong to a single category), pedestrian images typically have multiple attribute tags that need to be classified. Pedestrian attribute identification is considered a multi-tag task in that in order to predict the presence of a particular attribute, it is necessary to locate the area where the pedestrian attribute is present.

The existing method adopts a backbone network to extract characteristics, embeds a linear classification layer, constructs a plurality of binary classifiers in the linear classification layer, and predicts a plurality of pedestrian attributes under the constraint of a binary cross entropy loss function. However, the existing method ignores the change of appearance forms and positions of the same type of attribute under different pedestrian postures, and the display forms of the attribute are different; the backbone network cannot cope with such intra-category attribute variations in the case of learning global feature information. Meanwhile, the backbone network adopts the same convolution check image to extract the characteristics, and the characteristics of the local pedestrian attribute area are often ignored. For example: when a person carries the bag on the left side and faces the camera; when the pedestrian is facing away from the camera, the backpack appears on the right side of the image. Meanwhile, the backbone network adopts the same convolution to check the pedestrian attribute image to extract the characteristics, and the partial attribute area is ignored. Because the previous methods pay more attention to global information, lack of learning of local feature information results in an inability to have better robustness to all properties for the extracted features.

Disclosure of Invention

In view of the above, the present invention aims to provide a multi-tag pedestrian attribute recognition method based on multi-scale progressive perception, which introduces a multi-scale progressive perception model embedded in a backbone network for feature learning of a local area. The scale progressive perception model provided by the invention can be applied to various backbone networks to promote the characteristic information learning of the local attribute by the existing pedestrian attribute method. In addition, for the characteristic information of different scales, a dynamic aggregation strategy is used for combining various characteristics so as to improve the characteristic robustness of the whole pedestrian attribute area.

In order to achieve the technical purpose, the invention adopts the following technical scheme:

the invention provides a multi-label pedestrian attribute identification method based on multi-scale progressive perception, which comprises the following steps:

step 1, inputting a pedestrian image into a main network, and carrying out feature extraction through a plurality of residual convolution blocks of the main network to obtain attribute feature information output by each residual convolution block;

step 2, constructing a plurality of multi-scale progressive perception models, and training each multi-scale progressive perception model;

step 3, embedding a plurality of trained multi-scale progressive perception models into the backbone network, and inputting attribute characteristic information output by a target residual convolution block into the corresponding multi-scale progressive perception model to obtain multi-scale characteristic information;

step 4, after the multi-scale characteristic information is processed by the global average pooling layer, the multi-scale characteristic information is sent to the first attribute prediction layer to predict attribute probability;

and 5, performing progressive constraint on the first attribute prediction layer corresponding to the previous multi-scale progressive perception model by the first attribute prediction layer corresponding to the latter multi-scale progressive perception model.

Further, the step 1 specifically includes:

step 11, the ith pedestrian image x in the acquired pedestrian data set D _i As an input to the backbone network, the ith pedestrian image x _i The corresponding pedestrian attribute tag is defined as y _i ∈{0,1} ^M Wherein M represents the category number of the pedestrian attribute, 0 represents that the pedestrian attribute does not exist, and 1 represents that the pedestrian attribute exists;

step 12, the backbone network comprises l residual convolution blocks which are sequentially connected, the pedestrian image is used as the input of the 1 st residual convolution block, and the output of the current residual convolution block is used as the input of the next residual convolution block;

step 13, the pedestrian image x _i Feature extraction is carried out through a residual convolution block of the backbone network to obtain corresponding attribute feature information, and the expression is as follows:

F _l ＝(B _l {x|θ ₁ ，…θ _l }) (1)

wherein F is _l Attribute characteristic information which represents the output of the first residual convolution block; b (B) _l Represents the 1 st to the l th residual convolution blocks, θ, in the backbone network ₁ ，…θ _l Training parameters representing the 1 st to the l th residual convolution blocks in the backbone network.

Further, the step 2 specifically includes:

step 21, constructing a plurality of multi-scale progressive perception models;

step 22, training each multi-scale progressive perception model by using a binary cross entropy as a loss function;

the expression of the loss function is:

wherein L is _bce Representing a loss function, N and M representing data amounts, i representing numbers of pedestrian image sheets, j representing numbers of pedestrian attributes,representing the ith pedestrian image x _i Sending the image x of the ith pedestrian into a multi-scale progressive perception model _i The model of the jth pedestrian attribute predicts a probability value; y is _i，j Jth pedestrian attribute tag value, ω, representing ith pedestrian image _j Representing an imbalance suppression factor; log represents a logarithmic function, σ represents an activation function, e represents an index, r _j Representing the positive sample proportion of the jth pedestrian attribute in the training set.

Further, the step 3 specifically includes:

step 31, taking the 1 st to the 1 st residual convolution blocks in the backbone network as target residual convolution blocks, and setting the number of scale-progressive perception models according to the number of the target residual convolution blocks;

step 32, embedding a plurality of trained multi-scale progressive perception models into the backbone network;

step 33, output F of the p-th target residual convolution block _p Inputting the p-th multi-scale progressive perception model to obtain p-th scale characteristic information; p is a positive integer and the value range is 1-p-1.

Further, the step 33 specifically includes:

step 331, taking attribute characteristic information output by each target residual convolution block in the backbone network as input of a corresponding multi-scale progressive perception model;

step 332, in the multi-scale progressive perception model, sending the attribute characteristic information into a dimension-reducing convolution layer for dimension-reducing operation;

step 333, attribute characteristic information after dimension reduction is respectively fed into a plurality of branches, and is fed into a plurality of different convolution kernels through different branches to extract characteristics of different dimensions, so as to obtain different first-dimension characteristics;

step 334, the first scale feature extracted from the convolution kernel in each branch is subjected to a full connection layer to adjust the dimension of the first scale feature, and then nonlinear processing is performed through an activation function to obtain a second scale feature;

step 335, multiplying the second scale feature and the first scale feature in each branch as the output feature of the branch;

step 336, adding the output features of the multiple branches to obtain multi-scale feature information.

Further, the dimension-reducing convolution layer adopts a 1x1 convolution kernel.

Further, the branch road is provided with 3, is provided with 3 kinds of different convolution kernels on 3 branch roads respectively, adopts respectively: a 3x3 convolution kernel, a 5x5 convolution kernel, and a 7x7 convolution kernel.

Further, in the step 5, the first attribute prediction layer corresponding to the next multi-scale progressive perception model is subjected to progressive constraint on the first attribute prediction layer corresponding to the previous multi-scale progressive perception model by using the L2 norm.

Further, the step 5 further includes:

and 6, after the attribute characteristic information output by the last residual convolution block is processed by the global average pooling layer, the attribute characteristic information is sent to a second attribute prediction layer, and the output result of the second attribute prediction layer is used as a final pedestrian attribute identification result.

The invention also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the multi-tag pedestrian attribute identification method based on multi-scale progressive perception as described above.

By adopting the technical scheme, compared with the prior art, the invention has the beneficial effects that:

the invention introduces a scale progressive perception model, which is embedded into a backbone network thereof and is used for learning the characteristics of a local area. The scale progressive perception model provided by the invention can be applied to various backbone networks to promote the characteristic information learning of the local attribute by the existing pedestrian attribute method. In addition, for the characteristic information of different scales, a dynamic aggregation strategy is used for combining various characteristics so as to improve the characteristic robustness of the whole pedestrian attribute area. In addition, the multi-scale progressive perception model provided by the invention is plug and play, no extra calculation cost is generated during reasoning, and experiments on a plurality of data sets prove that the proposed method can bring about remarkable performance improvement.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of a multi-tag pedestrian attribute identification method based on multi-scale progressive sensing according to an embodiment of the present invention.

Fig. 2 is a frame diagram of a multi-scale progressive perception model provided by an embodiment of the present invention.

Fig. 3 is a schematic diagram of a computer readable storage medium according to an embodiment of the present invention.

The reference numerals in the figures illustrate:

the multi-scale progressive perception model comprises a residual convolution block 1, a multi-scale progressive perception model 2, a global average pooling layer 3, a first attribute prediction layer 4 and a second attribute prediction layer 5.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is specifically noted that the following examples are only for illustrating the present invention, but do not limit the scope of the present invention. Likewise, the following examples are only some, but not all, of the examples of the present invention, and all other examples, which a person of ordinary skill in the art would obtain without making any inventive effort, are within the scope of the present invention.

Referring to fig. 1 and 2, the multi-label pedestrian attribute identification method based on multi-scale progressive perception of the invention comprises the following steps:

step 1, inputting a pedestrian image into a backbone network (ResNet 50, which comprises 4 residual convolution blocks 1 in the embodiment), and performing feature extraction through a plurality of residual convolution blocks 1 of the backbone network to obtain attribute feature information output by each residual convolution block 1;

in this embodiment, the step 1 specifically includes:

step 12, the backbone network comprises l residual convolution blocks 1 which are sequentially connected, the pedestrian image is used as the input of the 1 st residual convolution block 1, and the output of the current residual convolution block 1 is used as the input of the next residual convolution block 1;

step 13, the pedestrian image x _i Feature extraction is performed through a residual convolution block 1 of the backbone network to obtain corresponding attribute feature information, and the expression is as follows:

F _l ＝(B _l {x|θ ₁ ，…θ _l }) (1)

wherein F is _l Attribute characteristic information output by the first residual convolution block 1 is represented; b (B) _l Residual convolution blocks 1, θ representing the 1 st through the l in the backbone network ₁ ，…θ _l Representing the training parameters of the 1 st to the l th residual convolution blocks 1 in the backbone network.

Step 2, constructing a plurality of multi-scale progressive perception models 2, and training each multi-scale progressive perception model 2;

in this embodiment, the step 2 specifically includes:

step 21, constructing a plurality of multi-scale progressive perception models 2;

step 22, the whole pedestrian attribute identification problem is regarded as a multi-label classification task, binary cross entropy (BCELoss) is adopted as a loss function, and each multi-scale progressive perception model 2 is trained through the loss function;

the expression of the loss function is:

wherein L is _bce Representing a loss function, N and M representing data amounts, i representing numbers of pedestrian image sheets, j representing numbers of pedestrian attributes,representing the ith pedestrian image x _i Feeding the image into a multi-scale progressive perception model 2, and aiming at an ith pedestrian image x _i The model of the jth pedestrian attribute predicts a probability value; y is _i，j Jth pedestrian attribute tag value, ω, representing ith pedestrian image _j Representing an imbalance suppression factor; log represents a logarithmic function, σ represents an activation function, e represents an index, r _j Representing the positive sample proportion of the jth pedestrian attribute in the training set.

Step 3, embedding a plurality of trained multi-scale progressive perception models 2 into the backbone network, and inputting attribute characteristic information output by the target residual convolution block 1 into the corresponding multi-scale progressive perception models 2 to obtain multi-scale characteristic information;

in this embodiment, the step 3 specifically includes:

step 31, taking the 1 st to the 1 st residual convolution blocks 1 in the backbone network as target residual convolution blocks 1, and setting the number of scale-progressive perception models according to the number of the target residual convolution blocks 1;

step 32, embedding a plurality of trained multi-scale progressive perception models 2 into the backbone network; for feature learning of a local region;

step 33, output F of the p-th target residual convolution block 1 _p Inputting the p-th multi-scale progressive perception model 2 to obtain p-th scale characteristic information; p is a positive integer and the value range is 1-p-1. The method is used for extracting the features of different scales and pushing the network to learn the local attribute region features.

In this embodiment, the step 33 specifically includes:

step 331, taking attribute characteristic information output by each target residual convolution block 1 in the backbone network as input of a corresponding multi-scale progressive perception model 2;

step 332, in the multi-scale progressive perception model 2, sending the attribute characteristic information into a dimension-reducing convolution layer for dimension-reducing operation; in this embodiment, the dimension-reducing convolution layer uses a 1×1 convolution kernel (conv), and the purpose of this step is to reduce the feature dimension to reduce the calculation amount.

in this embodiment, the branches are provided with 3, and 3 branches are respectively provided with 3 different convolution kernels, and respectively adopt: a 3x3 convolution kernel (left-most branch of fig. 2), a 5x5 convolution kernel (middle branch of fig. 2), and a 7x7 convolution kernel (right-most branch of fig. 2), wherein the 5x5 convolution kernel is constructed using two 3x3 convolution kernels and the 7x7 convolution kernel is constructed using three 3x3 convolution kernels. Two 3x3 convolution kernels are used for each of the 3 different convolution kernels in order to reduce the number of parameters to achieve light weight. The input features are respectively subjected to 3x3, 5x5 and 7x7 to extract features with different scales, so that regional features with different pedestrian attributes are covered, and the extracted features have the most characterization force.

Step 334, the first scale feature extracted from the convolution kernel in each branch is subjected to a full connection layer (FC layer) to adjust the dimension of the first scale feature, and then nonlinear processing (adding nonlinearity) is performed through an activation function (Sigmoid function) to obtain a second scale feature;

step 336, adding the output features of the multiple branches to obtain multi-scale feature information. The characterization of the local feature information is increased through the acquisition of the features with different scales, so that the robustness of the features is enriched.

Step 4, after the multi-scale characteristic information is processed by a global average pooling layer 3 (GAP), the multi-scale characteristic information is sent to a first attribute prediction layer 4 for attribute probability prediction;

and 5, performing progressive constraint on the first attribute prediction layer 4 corresponding to the previous multi-scale progressive perception model 2 by the first attribute prediction layer 4 corresponding to the next multi-scale progressive perception model 2. The purpose of this step is to further constrain the parametric training of the model.

In this embodiment, the step 5 is to use the L2 norm to progressively constrain the first attribute prediction layer 4 corresponding to the next multi-scale progressive perception model 2 to the first attribute prediction layer 4 corresponding to the previous multi-scale progressive perception model 2. In order to prevent the condition that the parameters of the previous convolution blocks are updated timely due to the fact that gradients disappear in the model training process, the first attribute prediction layer 4 of the next multi-scale progressive perception model 2 is utilized to carry out progressive constraint on the first attribute prediction layer 4 of the previous multi-scale progressive perception model 2, so that the parameter optimization direction of the previous multi-scale progressive perception model 2 can be consistent with that of the next multi-scale progressive perception model 2. That is, the adjacent first attribute prediction layers 4 in fig. 1 are constrained by the progressive constraint (L2 norm) in fig. 1, and the training of the progressive constraint model facilitates the parameter update of the previous residual convolution block 1.

In this embodiment, the step 5 further includes:

and 6, processing the attribute characteristic information output by the last residual convolution block 1 through the global average pooling layer 3, sending the processed attribute characteristic information into a second attribute prediction layer 5, and taking the output result of the second attribute prediction layer 5 as a final pedestrian attribute identification result.

As shown in fig. 3, an embodiment of the present invention further provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the above-mentioned multi-tag pedestrian attribute identification method based on multi-scale progressive perception.

In addition, each functional unit in each embodiment of the present invention may be integrated in one processing unit, each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing description is only a partial embodiment of the present invention, and is not intended to limit the scope of the present invention, and all equivalent devices or equivalent processes using the descriptions and the drawings of the present invention or directly or indirectly applied to other related technical fields are included in the scope of the present invention.

Claims

1. The multi-label pedestrian attribute identification method based on multi-scale progressive perception is characterized by comprising the following steps of:

2. The multi-tag pedestrian attribute identification method based on multi-scale progressive sensing as set forth in claim 1, wherein the step 1 specifically includes:

F _l ＝(B _l {x|θ ₁ ，…θ _l }) (1)

3. The multi-tag pedestrian attribute identification method based on multi-scale progressive perception according to claim 2, wherein the step 2 specifically includes:

step 21, constructing a plurality of multi-scale progressive perception models;

the expression of the loss function is:

4. The multi-tag pedestrian attribute identification method based on multi-scale progressive perception of claim 3, wherein the step 3 specifically includes:

5. The multi-tag pedestrian attribute identification method based on multi-scale progressive sensing of claim 4, wherein the step 33 specifically includes:

6. The multi-tag pedestrian attribute identification method based on multi-scale progressive perception of claim 5 wherein the dimension-reduction convolution layer employs a 1x1 convolution kernel.

7. The multi-tag pedestrian attribute identification method based on multi-scale progressive sensing as claimed in claim 5, wherein the branches are provided with 3, 3 different convolution kernels are respectively provided on the 3 branches, and the method respectively adopts: a 3x3 convolution kernel, a 5x5 convolution kernel, and a 7x7 convolution kernel.

8. The multi-tag pedestrian attribute recognition method based on multi-scale progressive sensing according to claim 1, wherein the step 5 is to progressively constrain a first attribute prediction layer corresponding to a next multi-scale progressive sensing model to a first attribute prediction layer corresponding to a previous multi-scale progressive sensing model by using an L2 norm.

9. The multi-tag pedestrian attribute identification method based on multi-scale progressive perception of claim 1, further comprising, after step 5:

10. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the multi-tag pedestrian attribute identification method based on multi-scale progressive perception as claimed in any one of claims 1 to 9.