CN111488834A

CN111488834A - Crowd counting method based on multi-level feature fusion

Info

Publication number: CN111488834A
Application number: CN202010284030.5A
Authority: CN
Inventors: 霍占强; 路斌; 宋素玲; 雒芬; 乔应旭
Original assignee: Henan University of Technology
Current assignee: Henan University of Technology
Priority date: 2020-04-13
Filing date: 2020-04-13
Publication date: 2020-08-04
Anticipated expiration: 2040-04-13
Also published as: CN111488834B

Abstract

The invention relates to a crowd counting method based on multi-level feature fusion, which comprises the following steps: preprocessing the acquired crowd image, generating a corresponding crowd density map by utilizing label information, constructing a multi-level feature fusion crowd counting network, initializing network weight parameters, inputting the preprocessed crowd image and the crowd density map into the network, completing forward propagation, calculating loss of a forward propagation result and a real density map, updating model parameters, iterating the forward propagation and updating the model parameters to appointed times, and acquiring the crowd density map to obtain the estimated number of people. The method provided by the invention can overcome the problem of crowd scale change in the crowd counting task, and the crowd counting is more accurate.

Description

Crowd counting method based on multi-level feature fusion

Technical Field

The invention relates to the field of image crowd counting and deep learning, in particular to a crowd counting method based on deep learning.

Background

People counting is an important problem in the field of image processing and computer vision, and aims to: a crowd density map is automatically generated from the crowd images and the number of people in the scene is estimated. The crowd counting is widely applied to the fields of traffic scheduling, safety prevention and control, city management and the like.

The traditional crowd counting method needs to carry out complex preprocessing on crowd images, needs to manually design and extract human body features, needs to re-extract features under the condition of crossing scenes, and is poor in adaptability. In recent years, the successful application of convolutional neural networks has brought about a major breakthrough to the task of population counting. Zhang [1] et al propose a convolutional neural network model suitable for crowd counting, which realizes end-to-end training without foreground segmentation and artificial design and feature extraction, obtains high-level features through multilayer convolution, and improves the performance of crowd counting in cross-scene. However, in different crowded scenes, the crowd scales are different greatly, and the density and distribution of the crowd also differ in the same image due to different distances from the camera, so that the method is lower in accuracy when processing scenes with large crowd scale differences.

In order to solve the problem of population scale variation, the attention of the existing research work is mainly focused on extracting a plurality of features with different scales to reduce the influence of the scale variation. Zhang [2] et al propose a multi-branch convolutional neural network, in which each branch is composed of convolutional kernels of different sizes, and the problem of crowd scale variation is solved by extracting features of different scales through the convolutional kernels of different branches. Cao 3 et al propose a scale aware network that solves the scale variation problem by designing feature extraction modules consisting of convolution kernels of different sizes. The above methods all solve the problem of scale variation of the crowd by extracting features of different scales through convolution kernels of different sizes. However, the scale variation of the population size in an image is continuous, and only the features of the population at discrete scales can be extracted by convolution kernels of different sizes, which ignores the population at other scales. Therefore, the problem of the scale difference of the crowd in different scenes is not completely solved.

Reference documents:

1.C.Zhang,H.Li,X.Wang,and X.Yang,Cross-Scene Crowd Counting via DeepConvolutional Neural Networks[C].Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition,2015,833-841.

2.Y.Zhang,D.Zhou,S.Chen,et al.Single-image crowd counting via multi-column convolutional neural network[C].Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition,2016,589-597.

3.X.Cao,Z.Wang,Y.Zhao,and F.Su,Scale aggregation network for accurateand efficient crowd counting[C].European Conference on Computer Vision,2018,734-750.

disclosure of Invention

The invention provides a crowd counting method based on multi-level feature fusion, which aims to solve the problem of crowd scale difference in different scenes in the prior art. The method mainly comprises the following steps:

step S1: preprocessing the acquired crowd image, and generating a corresponding crowd density map by using the labeling information;

step S2: constructing a multi-level feature fused crowd counting network;

step S3: initializing a network weight parameter;

step S4: inputting the preprocessed crowd image and the crowd density map of the S1 into a network to finish forward propagation;

step S5: calculating loss by using the result of forward propagation of S4 and a real density map, and updating model parameters;

step S6: iterating steps S4, S5 a specified number of times;

step S7: and acquiring a crowd density map to obtain the estimated number of people.

Compared with the current method for solving the crowd scale change by adopting multi-branch and multi-size convolution kernels, the invention provides a method based on multi-level feature fusion, wherein the shallow output features of the VGG16 feature extractor contained in the network comprise the spatial information and the texture information of the crowd, and the high-level output features comprise the semantic information of the crowd. The shallow features describe the spatial position of the crowd, and the high-level features provide specific details of the crowd features. The method combines the low-level features and the high-level features, can effectively solve the problem of crowd scale change, and overcomes the defect that the method adopting multi-branch and multi-size convolution kernels can only extract the crowd features with discrete scales. Compared with the existing method, the method provided by the invention is more accurate.

Drawings

Fig. 1 is a flowchart of a crowd counting method based on multi-level feature fusion according to the present invention.

Fig. 2 is a diagram of a crowd counting network structure based on multi-level feature fusion according to the present invention.

Fig. 3 is a structural diagram of a channel domain attention module of a crowd counting network based on multi-level feature fusion according to the present invention.

Detailed Description

Fig. 1 is a flowchart of a crowd counting method based on multi-level feature fusion according to the present invention. The method mainly comprises the following steps: preprocessing the acquired crowd image, generating a corresponding crowd density map by utilizing label information, constructing a multi-level feature fusion crowd counting network, initializing network weight parameters, inputting the preprocessed crowd image and the crowd density map into the network, completing forward propagation, calculating loss of a forward propagation result and a real density map, updating model parameters, iterating the forward propagation and updating the model parameters to specified times, and acquiring the crowd density map to obtain an estimated number of people, wherein the specific implementation details of each step are as follows:

step S1: preprocessing the acquired crowd image, and generating a corresponding crowd density map by using the labeling information, wherein the specific mode is as follows:

step S11: the collected crowd image is subjected to centralization processing, specifically, the average value corresponding to the channel is subtracted from the elements on the three channels of the image R, G and B, and then the average value is divided by the standard deviation corresponding to the channel, wherein the average value corresponding to the three channels of R, G and B is (0.485,0.456,0.406), and the corresponding standard deviation is (0.229,0.224, 0.225).

Step S12: and generating a position matrix for the provided labeling information, wherein the specific mode is that a matrix with the elements which are the same as the corresponding image resolution and are all 0 is created, and the elements at the corresponding positions of the matrix are set to be 1 according to the coordinates provided by the labeling information.

And step S13, randomly cutting the centralized crowd image and the corresponding position matrix into image blocks and matrixes with fixed sizes, wherein the cutting size is 400 × 400 in the specific embodiment of the invention.

Step S14: and performing convolution operation on the two-dimensional Gaussian convolution kernels and elements with the size of 1 in the position matrix to generate the crowd density map.

And step S15, the density map generated in the step S14 is down-sampled to 200 × 200 resolution, specifically, the density map is convolved by taking steps as 2 through a convolution kernel with 2 × 2 parameters being 1.

Step S2: a multi-level feature fusion crowd counting network is constructed, as shown in fig. 2, in a specific manner as follows:

step S21: a VGG16 network was built that did not contain a full connectivity layer.

And S22, building a channel domain attention module, as shown in FIG. 3, specifically, building a global average pooling layer on the channel domain, pooling input features X into features of 1 × 1 × C, adding two full connection layers behind the pooling layer, wherein the number of neurons is C/4 and C respectively, building a Sigmoid activation layer behind the two full connection layers, and performing element multiplication operation on the activation layer output and the input features X to obtain the output of the channel domain attention module.

Step S23: outputting characteristics X of the fifth layer to the fourth layer of the VGG16 network constructed in the step S21₅₀,X₄₀Performing feature fusion by outputting the fifth layer with the feature X₅₀Performing an upsampling operation (the amplification factors of the upsampling layer are all 2 in the invention), and combining the upsampled characteristics with the output characteristics X of the fourth layer₄₀Performing splicing operation on the channel domain, inputting the spliced characteristics into a channel domain attention module, and inputting the output of the channel domain attention module into a convolution block consisting of two convolution layers with the channel number of 256 of 3 × 3 to obtain the output characteristic X of the convolution block₄₁。

Step S24: outputting characteristics X of the fourth layer to the third layer of the VGG16 network constructed in the step S21₄₀,X₃₀And the feature X obtained in step S23₄₁Performing feature fusion by combining the features X₄₀Up-sampling is carried out, and the up-sampled result and the characteristic X are₃₀Performing splicing operation on the channel domain, inputting the spliced characteristics into a convolution block consisting of two convolution layers with the channel number of 128 of 3 × 3 to obtain characteristics X₃₁The feature X₄₁Performing an upsampling operation to obtain a feature X₃₂The feature X₃₁And feature X₃₂Performing splicing operation on a channel domain, inputting the spliced characteristics into a channel domain attention module, and inputting the output of the channel domain attention module into a convolution block consisting of two convolution layers with the channel number of 128 of 3 × 3 to obtain the output characteristics X of the convolution block₃₃。

Step S25: outputting characteristics X from the third layer to the second layer of the VGG16 network constructed in the step S21₃₀,X₂₀And the feature X obtained in step S24₃₁，X₃₃Performing feature fusion by combining the features X₃₀Performing an upsampling operation, and comparing the upsampled feature with the feature X₂₀Performing splicing operation on the channel domain, inputting the spliced characteristics into a convolution block consisting of two convolution layers with the channel number of 3 × 3 being 64 to obtain characteristics X₂₁The feature X₃₁Performing an upsampling operation to obtain a feature X₂₂The feature X₂₁And feature X₂₂Performing splicing operation on the channel domain, inputting the spliced characteristics into a convolution block consisting of two convolution layers with the channel number of 3 × 3 being 64 and obtaining the output characteristics X of the convolution block₂₃The feature X₃₃Performing an upsampling operation to obtain a feature X₂₄The feature X₂₃And feature X₂₄And performing splicing operation on a channel domain, inputting the spliced features into a channel domain attention module, inputting the output of the channel domain attention module into a convolution block consisting of two convolution layers with the channel number of 3 × 3 being 64 and a convolution layer with the channel number of 3 × 3 being 32, and inputting the output of the convolution block into a convolution layer with the channel number of 1 × 1 being 1, so as to complete the construction of the crowd counting network with multi-level feature fusion.

Step S3, initializing network weight parameters, specifically, for the crowd counting network obtained in step S2, the initial value of the feature extractor VGG16 is the classification weight of ImageNet of VGG16 not including the full connection layer, and other convolutional layers and the full connection layer all adopt positive-too-distribution initialization parameters, wherein: μ ═ 0 and σ ═ 0.01.

And step S4, inputting the crowd image and the crowd density map preprocessed in the step S1 into a network to finish forward propagation.

And S5, calculating loss by the result of forward propagation in the step S4 and the real density map of the input network, and updating model parameters in the following specific mode:

step S51 calculating mean square error loss L of the result of forward propagation and the true density map_MSEThe concrete mode is as follows:

where N represents the number of samples of input data that are propagated forward at one time, where N is 8 in the present invention,

a density map representing the current ith data forward propagation computation,

representing the true density map of the current ith datum.

Step S52, loss L calculated in the step S51_MSEAnd updating the model parameters by using a random gradient descent method.

And step S6, iterating the steps S4 and S5 to a specified number of times, wherein the iteration number is 50 times.

And step S7, acquiring the crowd density map to obtain the estimated number of people, wherein the specific mode is that the number of people contained in the crowd image is obtained by summing all pixels in the crowd density map calculated by the model.

Claims

1. A crowd counting method based on multi-level feature fusion is characterized by specifically comprising the following steps:

step S11: centralizing the acquired crowd image, specifically, subtracting an average value corresponding to a channel from elements on three channels of R, G and B of the image, and dividing the average value by a standard deviation corresponding to the channel, wherein the average value corresponding to the three channels of R, G and B is (0.485,0.456,0.406), and the corresponding standard deviation is (0.229,0.224, 0.225);

step S12: generating a position matrix for the provided labeling information, wherein the specific mode is that a matrix with the elements which are the same as the corresponding image resolution and are all 0 is created, and the elements at the corresponding positions of the matrix are set to be 1 according to the coordinates provided by the labeling information;

step S13, randomly cutting image blocks and matrixes with fixed sizes from the centralized crowd images and the corresponding position matrixes, wherein in the specific embodiment of the invention, the cutting size is 400 × 400;

step S14: generating a corresponding crowd density map by convolving the position matrix through a Gaussian kernel in a specific mode that two one-dimensional Gaussian convolution kernels are generated, wherein mu is 15, and sigma is 4, transposing one of the Gaussian convolution kernels and multiplying the other Gaussian convolution kernel to obtain a two-dimensional Gaussian convolution kernel, and performing convolution operation on the two-dimensional Gaussian convolution kernel and an element with the size of 1 in the position matrix to generate the crowd density map;

step S15, down-sampling the density map generated in the step S14 to 200 × 200 resolution, specifically, performing convolution operation on the density map by taking the step size as 2 by using convolution kernels with 2 × 2 parameters as 1;

step S2: the method comprises the following steps of constructing a multi-level feature fusion crowd counting network, and specifically comprising the following steps:

step S21: building a VGG16 network which does not contain a full connection layer;

step S22, constructing a channel domain attention module, wherein the specific method is that a global average pooling layer on the channel domain is constructed, an input feature X is pooled into a feature of 1 × 1 × C, two full connection layers are added behind the pooling layer, the number of neurons is C/4 and C respectively, a Sigmoid activation layer is constructed behind the two full connection layers, and element multiplication operation is carried out on the output of the activation layer and the input feature X to obtain the output of the channel domain attention module;

step S23: outputting characteristics X of the fifth layer to the fourth layer of the VGG16 network constructed in the step S21₅₀,X₄₀Performing feature fusion by outputting the fifth layer with the feature X₅₀Performing an upsampling operation (the amplification factors of the upsampling layer are all 2 in the invention), and combining the upsampled characteristics with the output characteristics X of the fourth layer₄₀Performing splicing operation on the channel domain, inputting the spliced characteristics into a channel domain attention module, and inputting the output of the channel domain attention module into a convolution block consisting of two convolution layers with the channel number of 256 of 3 × 3 to obtain the output characteristic X of the convolution block₄₁；

Step S24: outputting characteristics X of the fourth layer to the third layer of the VGG16 network constructed in the step S21₄₀,X₃₀And the feature X obtained in step S23₄₁Performing feature fusion by combining the features X₄₀Up-sampling is carried out, and the up-sampled result and the characteristic X are₃₀Performing splicing operation on the channel domain, inputting the spliced characteristics into a convolution block consisting of two convolution layers with the channel number of 128 of 3 × 3 to obtain characteristics X₃₁The feature X₄₁Performing an upsampling operation to obtain a feature X₃₂The feature X₃₁And feature X₃₂Performing splicing operation on a channel domain, inputting the spliced characteristics into a channel domain attention module, and inputting the output of the channel domain attention module into a convolution block consisting of two convolution layers with the channel number of 128 of 3 × 3 to obtain the output characteristics X of the convolution block₃₃；

Step S25: outputting characteristics X from the third layer to the second layer of the VGG16 network constructed in the step S21₃₀,X₂₀And the feature X obtained in step S24₃₁，X₃₃Performing feature fusion by combining the features X₃₀Performing an upsampling operation, and comparing the upsampled feature with the feature X₂₀Performing splicing operation on the channel domain, inputting the spliced characteristics into a convolution block consisting of two convolution layers with the channel number of 3 × 3 being 64 to obtain characteristics X₂₁The feature X₃₁Performing an upsampling operation to obtain a feature X₂₂The feature X₂₁And feature X₂₂Performing splicing operation on the channel domain, inputting the spliced characteristics into a convolution block consisting of two convolution layers with the channel number of 3 × 3 being 64 and obtaining the output characteristics X of the convolution block₂₃The feature X₃₃Performing an upsampling operation to obtain a feature X₂₄The feature X₂₃And feature X₂₄Performing splicing operation on a channel domain, inputting spliced features into a channel domain attention module, inputting the output of the channel domain attention module into a convolution block consisting of two convolution layers with the channel number of 3 × 3 being 64 and a convolution layer with the channel number of 3 × 3 being 32, inputting the output of the convolution block into a convolution layer with the channel number of 1 × 1 being 1, and completing the construction of a multi-level feature fusion crowd counting network;

step S3, initializing network weight parameters, specifically, for the crowd counting network obtained in step S2, the initial value of the feature extractor VGG16 is the classification weight of ImageNet of VGG16 not including the full connection layer, and other convolutional layers and the full connection layer all adopt positive-too-distribution initialization parameters, wherein: μ ═ 0, σ ═ 0.01;

step S4, inputting the crowd image and the crowd density map preprocessed in the step S1 into a network to finish forward propagation;

a true density map representing the current ith datum;

step S52, loss L calculated in the step S51_MSEUpdating the model parameters by using a random gradient descent method;

step S6, iterating the steps S4 and S5 to the specified times, wherein the iteration times are 50 times;