CN111488834B

CN111488834B - Crowd counting method based on multi-level feature fusion

Info

Publication number: CN111488834B
Application number: CN202010284030.5A
Authority: CN
Inventors: 霍占强; 路斌; 宋素玲; 雒芬; 乔应旭
Original assignee: Henan University of Technology
Current assignee: Henan University of Technology
Priority date: 2020-04-13
Filing date: 2020-04-13
Publication date: 2023-07-04
Anticipated expiration: 2040-04-13
Also published as: CN111488834A

Abstract

The invention relates to a crowd counting method based on multi-level feature fusion, which comprises the following steps: preprocessing the obtained crowd images, generating a corresponding crowd density map by using labeling information, constructing a crowd counting network with multi-level feature fusion, initializing network weight parameters, inputting the preprocessed crowd images and the crowd density map into the network, completing forward propagation, calculating loss of a forward propagation result and a real density map, updating model parameters, iterating forward propagation and updating the model parameters to appointed times, obtaining the crowd density map, and obtaining estimated number of people. The method provided by the invention can overcome the problem of crowd scale change in the crowd counting task, and the crowd counting is more accurate.

Description

Crowd counting method based on multi-level feature fusion

Technical Field

The invention relates to the field of image crowd counting and deep learning, in particular to a crowd counting method based on deep learning.

Background

Crowd counting is an important problem in the fields of image processing and computer vision, and aims at: and automatically generating a crowd density map according to the crowd image and estimating the number of people in the scene. The crowd counting is widely applied in the fields of traffic scheduling, safety prevention and control, urban management and the like.

The traditional crowd counting method needs to carry out complex preprocessing on crowd images, needs to manually design and extract human body characteristics, needs to re-extract the characteristics under the condition of crossing scenes, and has poor adaptability. In recent years, the successful application of convolutional neural networks brings a significant breakthrough for the task of crowd counting. Zhang [1] et al propose a convolutional neural network model suitable for crowd counting, which realizes end-to-end training without foreground segmentation and artificial design and feature extraction, and obtains high-level features after multi-layer convolution, thereby improving the performance of crowd counting in a cross-scene. However, in different crowded scenes, the crowd scale is very different, the density and distribution of the crowd in the same image are different due to the fact that the distances between the crowd and the cameras are different, and the accuracy of the method is low when the method is used for processing scenes with large crowd scale differences.

In order to solve the problem of the scale change of the crowd, the focus of the existing research work is mainly on extracting a plurality of features with different scales to reduce the influence of the scale change. Zhang [2] et al propose a multi-branched convolutional neural network in which each branch consists of convolution kernels of different sizes, and the problem of crowd scale variation is solved by extracting features of different scales from the convolution kernels of different branches. Cao 3 et al propose a scale-aware network that solves the scale-change problem by designing feature extraction modules composed of convolution kernels of different sizes. The method solves the problem of scale change of the crowd by extracting features of different scales through convolution kernels of different sizes. However, the scale variation of the population scale in one image is continuous, and only discrete scale population features can be extracted by different size convolution kernels, which ignores other scale populations. Therefore, the problem of scale difference of people in different scenes is not completely solved.

Reference is made to:

1.C.Zhang,H.Li,X.Wang,and X.Yang,Cross-Scene Crowd Counting via Deep Convolutional Neural Networks[C].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2015,833-841.

2.Y.Zhang,D.Zhou,S.Chen,et al.Single-image crowd counting via multi-column convolutional neural network[C].Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2016,589-597.

3.X.Cao,Z.Wang,Y.Zhao,and F.Su,Scale aggregation network for accurate and efficient crowd counting[C].European Conference on Computer Vision,2018,734-750.

disclosure of Invention

The invention provides a crowd counting method based on multi-level feature fusion, which aims to solve the problem of crowd scale difference in different scenes in the prior art. Mainly comprises the following steps:

step S1: preprocessing the obtained crowd images, and generating corresponding crowd density maps by using labeling information;

step S2: constructing a crowd counting network with multi-level feature fusion;

step S3: initializing network weight parameters;

step S4: inputting the crowd image and the crowd density map which are preprocessed in the S1 into a network to finish forward propagation;

step S5: calculating loss of the S4 forward propagation result and a real density map, and updating model parameters;

step S6: iterating the step S4, S5 to the appointed times;

step S7: and obtaining a crowd density map to obtain an estimated crowd.

Compared with the current method for solving the problem of crowd scale change by adopting multi-branch and multi-size convolution kernels, the invention provides a method based on multi-level feature fusion, wherein shallow output features of VGG16 feature extractors contained in a network contain spatial information and texture information of crowds, and high-level output features contain semantic information of crowds. Shallow features describe the spatial location of the crowd and high-level features provide specific details of the crowd features. The method combines the low-level features with the high-level features, so that the problem of crowd scale change can be effectively solved, and the defect that only discrete-scale crowd features can be extracted by adopting a multi-branch and multi-size convolution kernel method is overcome. Compared with the existing method, the method provided by the invention is more accurate.

Drawings

Fig. 1 is a flow chart of a crowd counting method based on multi-level feature fusion according to the invention.

Fig. 2 is a diagram of a crowd counting network based on multi-level feature fusion according to the present invention.

Fig. 3 is a block diagram of a channel domain attention module of a crowd counting network based on multi-level feature fusion according to the invention.

Detailed Description

Fig. 1 is a flowchart of a crowd counting method based on multi-level feature fusion according to the invention. Mainly comprises the following steps: preprocessing the obtained crowd images, generating a corresponding crowd density map by using labeling information, constructing a crowd counting network with multi-level feature fusion, initializing network weight parameters, inputting the preprocessed crowd images and the crowd density map into the network, completing forward propagation, calculating loss of a forward propagation result and a real density map, updating model parameters, iterating the forward propagation and updating the model parameters to specified times, obtaining the crowd density map, and obtaining estimated population, wherein the specific implementation details of each step are as follows:

step S1: preprocessing the obtained crowd images, and generating corresponding crowd density maps by using labeling information, wherein the specific mode is as follows:

step S11: the collected crowd images are subjected to centering treatment in a specific mode that the average value corresponding to the channels is subtracted from elements on three channels R, G and B of the images, and then the elements are divided by the standard deviation corresponding to the channels, wherein the average value corresponding to the channels R, G and B is (0.485,0.456,0.406), and the corresponding standard deviation is (0.229,0.224,0.225).

Step S12: generating a position matrix for the provided labeling information, specifically, creating a matrix with all elements being 0, which are the same as the corresponding image in resolution, and setting the elements at the positions corresponding to the matrix to be 1 according to the coordinates provided by the labeling information.

Step S13: and randomly cutting the centralized crowd image and the corresponding position matrix to fix the image blocks and the matrix with the fixed size, wherein the cutting size is 400 multiplied by 400 in the specific embodiment of the invention.

Step S14: generating a corresponding crowd density map by Gaussian kernel convolution of the position matrix, specifically, generating two one-dimensional Gaussian kernel convolution, wherein μ=15, σ=4, transpose one of the Gaussian kernels, and multiplying the two-dimensional Gaussian convolution kernel with the other one to obtain a two-dimensional Gaussian convolution kernel, and carrying out convolution operation on the two-dimensional Gaussian convolution kernel and an element with the size of 1 in the position matrix to generate a crowd density map.

Step S15: the density map generated in step S14 is downsampled to a 200×200 resolution size, specifically, a convolution kernel with 2×2 parameters being 1 is used to perform convolution operation on the density map with a step of 2.

Step S2: the crowd counting network with multi-level feature fusion is constructed, as shown in fig. 2, in the following specific manner:

step S21: a VGG16 network is built that does not contain a fully connected layer.

Step S22: the channel domain attention module is built, as shown in fig. 3, in a specific manner, a global average pooling layer on the channel domain is built, the input characteristic X is pooled into 1×1×C characteristics, two full-connection layers are added after the pooling layer, the number of neurons is C/4 and C respectively, a Sigmoid activation layer is built after the two full-connection layers, and element multiplication operation is carried out on the output of the activation layer and the input characteristic X to obtain the output of the channel domain attention module.

Step S23: output characteristics X of fifth layer to fourth layer of VGG16 network constructed in step S21 ₅₀ ,X ₄₀ Feature fusion is performed by outputting the feature X from the fifth layer ₅₀ Up-sampling (the amplification factors of the up-sampling layer are 2) is carried out, and the up-sampled characteristics and the fourth layer output characteristics X are compared ₄₀ Performing splicing operation on the channel domain, inputting the spliced characteristics into a channel domain attention module, inputting the output of the channel domain attention module into a convolution block formed by two convolution layers with the number of 3 multiplied by 3 channels of 256, and obtaining the output characteristics X of the convolution block ₄₁ 。

Step S24: outputting the output characteristics X of the fourth layer to the third layer of the VGG16 network constructed in the step S21 ₄₀ ,X ₃₀ And the feature X obtained in step S23 ₄₁ Feature fusion is performed by combining features X ₄₀ Upsampling and combining the upsampled result with the feature X ₃₀ Performing splicing operation on the channel domain, inputting the spliced characteristics into a convolution block formed by two convolution layers with the number of 3 multiplied by 3 channels being 128, and obtaining characteristics X ₃₁ Will characteristic X ₄₁ Up-sampling operation to obtain feature X ₃₂ Will characteristic X ₃₁ And feature X ₃₂ Performing splicing operation on the channel domain, inputting the spliced characteristics into a channel domain attention module, inputting the output of the channel domain attention module into a convolution block formed by two convolution layers with the number of 3 multiplied by 3 channels being 128, and obtaining the output characteristics X of the convolution block ₃₃ 。

Step S25: building step S21Output characteristics X of the third layer to the second layer of the VGG16 network ₃₀ ,X ₂₀ And the feature X obtained in step S24 ₃₁ ，X ₃₃ Feature fusion is performed by combining features X ₃₀ Up-sampling operation is carried out, and the up-sampled characteristics and characteristics X ₂₀ Performing splicing operation on the channel domain, inputting the spliced characteristics into a convolution block formed by two convolution layers with the number of 3 multiplied by 3 channels being 64, and obtaining characteristics X ₂₁ Will characteristic X ₃₁ Up-sampling operation to obtain feature X ₂₂ Will characteristic X ₂₁ And feature X ₂₂ Performing splicing operation on the channel domain, inputting the spliced characteristics into a convolution block formed by two convolution layers with the number of 3 multiplied by 3 channels being 64, and obtaining the output characteristics X of the convolution block ₂₃ Will characteristic X ₃₃ Up-sampling operation to obtain feature X ₂₄ Will characteristic X ₂₃ And feature X ₂₄ And performing splicing operation on the channel domain, inputting the spliced characteristics into a channel domain attention module, inputting the output of the channel domain attention module into a convolution block consisting of two convolution layers with the number of 3 multiplied by 3 being 64 and one convolution layer with the number of 3 multiplied by 3 being 32, inputting the output of the convolution block into one convolution layer with the number of 1 multiplied by 1, and completing the construction of the crowd counting network with the multi-level characteristic fusion.

Step S3, initializing network weight parameters in a specific manner, wherein for the crowd counting network obtained in the step S2, an initial value of a feature extractor VGG16 is a classification weight of ImageNet of the VGG16 without a full connection layer, other convolution layers and the full connection layer adopt forward distribution initialization parameters, and the forward distribution initialization parameters comprise: μ=0, σ=0.01.

And S4, inputting the crowd image and the crowd density map preprocessed in the step S1 into a network to finish forward propagation.

And S5, calculating loss by using the forward propagation result of the step S4 and a true density map of the input network, and updating model parameters, wherein the specific mode is as follows:

step S51, calculating the mean square error loss L of the forward propagation result and the true density map _MSE The specific mode is as follows:

where N represents the number of samples of the input data that are propagated forward once, n=8 in the present invention,

density map representing the current ith data forward propagation calculation,/for>

Representing the true density map of the current ith data.

Step S52, the loss L calculated in the step S51 _MSE Model parameters are updated using a random gradient descent method.

And S6, iterating the steps S4 and S5 to the appointed times, wherein the specific mode is that the iteration times are 50 times.

And S7, obtaining a crowd density map to obtain an estimated number of people, wherein the specific mode is that all pixels in the crowd density map calculated by the model are summed to obtain the number of people contained in the crowd image.

Claims

1. The crowd counting method based on multi-level feature fusion is characterized by comprising the following steps of:

step S11: the collected crowd images are subjected to centering treatment in a specific mode that elements on three channels R, G and B of the images are subtracted by average values corresponding to the channels and then divided by standard deviations corresponding to the channels, the average values corresponding to the channels R, G and B are (0.485,0.456,0.406), and the corresponding standard deviations are (0.229,0.224,0.225);

step S12: generating a position matrix for the provided labeling information by creating a matrix with all 0 elements with the same resolution as the corresponding image, and setting the element at the corresponding position of the matrix to be 1 according to the coordinates provided by the labeling information;

step S13: randomly cutting the centralized crowd image and the corresponding position matrix to obtain image blocks and matrixes with fixed sizes, wherein in a specific embodiment, the cutting size is 400 multiplied by 400;

step S14: generating a corresponding crowd density map by Gaussian kernel convolution of the position matrix, specifically, generating two one-dimensional Gaussian kernel convolution, wherein μ=15, σ=4, transpose one of the Gaussian kernels, multiplying the two-dimensional Gaussian convolution kernel with the other one to obtain a two-dimensional Gaussian convolution kernel, and carrying out convolution operation on the two-dimensional Gaussian convolution kernel and an element with the size of 1 in the position matrix to generate a crowd density map;

step S15: downsampling the density map generated in the step S14 to 200×200 resolution, specifically, performing convolution operation on the density map with a stride of 2 by using a convolution kernel with a 2×2 parameter of 1;

step S2: the crowd counting network with multi-level feature fusion is constructed in the following specific mode:

step S21: building a VGG16 network which does not contain a full connection layer;

step S22: setting up a channel domain attention module, namely setting up a global average pooling layer on the channel domain, pooling an input characteristic X into a characteristic of 1 multiplied by C, adding two full-connection layers after the pooling layer, wherein the number of neurons is C/4 and C respectively, setting up a Sigmoid activation layer after the two full-connection layers, and carrying out element multiplication operation on the output of the activation layer and the input characteristic X to obtain the output of the channel domain attention module;

step S23: output characteristics X of fifth layer to fourth layer of VGG16 network constructed in step S21 ₅₀ ,X ₄₀ Feature fusion is performed by outputting the feature X from the fifth layer ₅₀ Up-sampling (up-sampling layer amplification factors are 2), and up-sampling features and fourth layer output features X ₄₀ Performing splicing operation on the channel domain, inputting the spliced characteristics into a channel domain attention module, inputting the output of the channel domain attention module into a convolution block formed by two convolution layers with the number of 3 multiplied by 3 channels of 256, and obtaining the output characteristics X of the convolution block ₄₁ ；

Step S24: outputting the output characteristics X of the fourth layer to the third layer of the VGG16 network constructed in the step S21 ₄₀ ,X ₃₀ And the feature X obtained in step S23 ₄₁ Feature fusion is performed by combining features X ₄₀ Upsampling and combining the upsampled result with the feature X ₃₀ Performing splicing operation on the channel domain, inputting the spliced characteristics into a convolution block formed by two convolution layers with the number of 3 multiplied by 3 channels being 128, and obtaining characteristics X ₃₁ Will characteristic X ₄₁ Up-sampling operation to obtain feature X ₃₂ Will characteristic X ₃₁ And feature X ₃₂ Performing splicing operation on the channel domain, inputting the spliced characteristics into a channel domain attention module, inputting the output of the channel domain attention module into a convolution block formed by two convolution layers with the number of 3 multiplied by 3 channels being 128, and obtaining the output characteristics X of the convolution block ₃₃ ；

Step S25: outputting the output characteristic X from the third layer to the second layer of the VGG16 network constructed in the step S21 ₃₀ ,X ₂₀ And the feature X obtained in step S24 ₃₁ ，X ₃₃ Feature fusion is performed by combining features X ₃₀ Up-sampling operation is carried out, and the up-sampled characteristics and characteristics X ₂₀ Performing splicing operation on the channel domain, inputting the spliced characteristics into a convolution block formed by two convolution layers with the number of 3 multiplied by 3 channels being 64, and obtaining characteristics X ₂₁ Will characteristic X ₃₁ Up-sampling operation to obtain feature X ₂₂ Will characteristic X ₂₁ And feature X ₂₂ Performing splicing operation on the channel domain, inputting the spliced characteristics into a convolution block formed by two convolution layers with the number of 3 multiplied by 3 channels being 64, and obtaining the output characteristics X of the convolution block ₂₃ Will characteristic X ₃₃ Up-sampling operation to obtain feature X ₂₄ Will characteristic X ₂₃ And feature X ₂₄ Performing splicing operation on a channel domain, inputting the spliced characteristics into a channel domain attention module, inputting the output of the channel domain attention module into a convolution block consisting of two convolution layers with the number of 3 multiplied by 3 being 64 and one convolution layer with the number of 3 multiplied by 3 being 32, inputting the output of the convolution block into one convolution layer with the number of 1 multiplied by 1, and completing the construction of a crowd counting network with multi-level characteristic fusion;

step S3, initializing network weight parameters in a specific manner, wherein for the crowd counting network obtained in the step S2, an initial value of a feature extractor VGG16 is a classification weight of ImageNet of the VGG16 without a full connection layer, other convolution layers and the full connection layer adopt forward distribution initialization parameters, and the forward distribution initialization parameters comprise: μ=0, σ=0.01;

s4, inputting the crowd image and the crowd density map preprocessed in the step S1 into a network to finish forward propagation;

where N represents the number of samples of the input data for one forward propagation, n=8,

A true density map representing the current ith data;

step S52, the loss L calculated in the step S51 _MSE Updating model parameters by using a random gradient descent method;

step S6, iterating the steps S4 and S5 to the appointed times, wherein the specific mode is that the iterated times are 50 times;