CN111815665A

CN111815665A - Single image crowd counting method based on depth information and scale perception information

Info

Publication number: CN111815665A
Application number: CN202010662406.1A
Authority: CN
Inventors: 田玲; 朱大勇; 张栗粽; 罗光春; 邬丹丹; 董文琦
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2020-07-10
Filing date: 2020-07-10
Publication date: 2020-10-23
Anticipated expiration: 2040-07-10
Also published as: CN111815665B

Abstract

The invention relates to a computer vision technology, and discloses a single image crowd counting method based on depth information and scale perception information, which improves the prediction capability and reduces the calculation complexity. The method comprises the following steps: s1, performing Gaussian mapping on head center coordinate data corresponding to the input sample picture to generate a preliminary truth-value density map, and correcting the preliminary truth-value density map based on depth information obtained by a depth estimation algorithm to obtain a truth-value density map; s2, predicting the crowd density map of the input sample picture by adopting a density estimation network to generate a predicted density map, calculating loss errors according to the predicted density map and the true density map, adjusting network parameters through gradient back propagation, and generating a density prediction model through iteration; and S3, when counting the crowd of a single image, generating a predicted density map of the image by using a density prediction model, and calculating to obtain the total number of people in the image.

Description

Single image crowd counting method based on depth information and scale perception information

Technical Field

The invention relates to a computer vision technology, in particular to a single image crowd counting method based on depth information and scale perception information.

Background

The crowd counting aims to output a crowd density graph corresponding to a picture after the picture is input and processed by a network model, and finally, the probability of the number of people corresponding to each pixel on the density graph is summed to obtain the final total number of people. The crowd counting task is challenging due to problems of occlusion, view angle change, crowd size change, and distribution diversity.

In the early methods, each pedestrian in the crowd was mainly located by a target detector, and the number of detected targets was the counting result. However, these methods use manual features for classifier training and perform poorly in highly crowded scenes. In order to solve the problem of counting crowds in a complex scene, a crowd density graph is generated by using a convolutional neural network, and counting performance is improved by capturing scale change.

In 2016, Zhang et al proposed an MCNN algorithm to cope with scale changes, which consisted of three branch networks, each of which sampled features using different sized receptive fields. And for a given picture, processing by three branch networks respectively, performing channel fusion on the obtained result, and finally obtaining a final density map by 1-by-1 convolution. But since this design only involves convolution of three different scales, each class can only serve a certain density level. However, there are dense variations and uneven population distribution in the actual scene, and it is not possible to strictly classify the population pictures into which category, so the effectiveness of the MCNN algorithm is limited by the number of branches.

Cao et al proposed an SANet algorithm to improve the scale-aware structure in 2018, integrated scale information using an inclusion structure, and performing convolution operation on each convolution layer using a plurality of convolution kernels, fusing each part of information, and fully sharing information from the bottom layer to the top layer. The network comprises four inclusion structures, and scale reduction is performed by using transposed convolution after each inclusion structure, so that the generated density map has the same size as an input density map, and pixel-level supervision can be performed. However, in the crowd counting scene, the pedestrians far away in the image appear as small targets under the influence of the camera angle. Such small objects are numerous in images and are the main subject of investigation. Although multi-scale information can be integrated by using the inclusion structure, with the forward transmission of a network, features are highly abstracted, and detail features of small targets are lost, so that the prediction capability of the final small targets is reduced. In addition, the scale reduction is carried out by using the transposition convolution, the calculation complexity is high, and the performance of the method has no outstanding advantages in a certain training batch range.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: a single image crowd counting method based on depth information and scale perception information is provided, prediction capability is improved, and calculation complexity is reduced.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a single image crowd counting method based on depth information and scale perception information comprises the following steps:

s1, performing Gaussian mapping on head center coordinate data corresponding to the input sample picture to generate a preliminary truth-value density map, and correcting the preliminary truth-value density map based on depth information obtained by a depth estimation algorithm to obtain a truth-value density map;

s2, predicting the crowd density map of the input sample picture by adopting a density estimation network to generate a predicted density map, calculating loss errors according to the predicted density map and the true density map, adjusting network parameters through gradient back propagation, and generating a density prediction model through iteration;

and S3, when counting the crowd of a single image, generating a predicted density map of the image by using a density prediction model, and calculating to obtain the total number of people in the image.

As a further optimization, step S1 specifically includes:

s11, performing Gaussian distribution mapping on the head coordinate points in the sample picture label data by a Gaussian kernel with a fixed size, and superposing mapping values at all positions of the image to form a primary true-value density graph F₁(x)；

S12, carrying out Gaussian distribution mapping on the head coordinate points in the sample picture label data by using a geometric self-adaptive Gaussian kernel, and superposing mapping values at all positions of the image to form a primary true-value density graph F₂(x)；

S13, extracting depth information of each pixel position in the input sample picture by adopting a monocular depth estimation algorithm to form a depth estimation map depth (x);

s14, determining a final truth density map by using a threshold segmentation algorithm based on the information of the depth estimation map depth (x):

wherein, being a preset segmentation threshold, F₁(i, j) represents the preliminary truth density plot F₁(x) Value, F, corresponding to the middle coordinate (i, j)₂(i, j) represents the preliminary truth density plot F₂(x) The coordinate (i, j) in the Depth (x) represents a Depth value at the coordinate (i, j) in the Depth (x), and the M (i, j) represents a value corresponding to the coordinate (i, j) in the final truth density map.

As a further optimization, in step S2, the density estimation network includes: the system comprises a basic feature extraction module, a multi-scale capture module and a scale transfer module; the basic feature extraction module is used for extracting low-level features such as textures of the pictures; the multi-scale capturing module is used for further extracting picture features, fusing multi-scale information and storing detailed features of small targets; the scale transfer module is used for scale reduction of the feature map and lifting the feature map to the size of the input picture.

As a further optimization, the basic feature extraction module consists of a convolutional layer before conv4_3 in the VGG16 network; the multi-scale capturing module adopts four dense connecting layers, each layer uses a 3 multiplied by 3 convolution kernel to extract features, the resolution of a feature map is kept unchanged by using edge filling, and the growth rate of convolution is set to be 256; the scale transfer module adopts sub-pixel convolution to restore the scale of the feature map and increase the resolution of the feature map to the size of the input picture.

As a further optimization, in step S2, the calculating of the loss error according to the predicted density map and the true density map specifically includes:

and measuring the error between the predicted density graph and the true density graph by using Euclidean distance as a loss function, wherein the expression is as follows:

wherein, F (X)_i(ii) a Theta) is a predicted density map of the network output, theta represents a learning parameter in the network, X_iThe ith picture that represents the input is shown,

the truth density map of the ith picture is shown, and N is the number of training pictures.

The invention has the beneficial effects that:

(1) the supervision information is more accurate:

the invention utilizes depth information to guide the generation of the truth density map, and the generated truth density map is more accurate than the truth density map generated by the traditional single mode. The information is used for guiding network training, and the predicted density graph is closer to a true value.

(2) A wide range of dimensional changes can be captured:

the invention constructs the multi-scale capture module suitable for the current scene by utilizing dense connection, fuses multi-scale information and reserves more detailed characteristics of small targets, thereby being beneficial to improving the prediction performance of the network on the multi-scale targets.

(3) Scale reduction is performed with low computational complexity:

the invention utilizes the sub-pixel convolution module to carry out scale reduction, avoids the problem that bilinear interpolation upsampling is used to ignore the self characteristics of the image, and simultaneously avoids the computational complexity of using transposition convolution upsampling.

Drawings

FIG. 1 is a flow chart of a crowd counting algorithm based on depth information and scale perception information according to the present invention;

FIG. 2 is a diagram of a process of generating a truth density map;

FIG. 3 is a diagram of a process for generating a predicted density map for population counting by a density estimation network.

Detailed Description

The invention aims to provide a method for counting the population of a single image based on depth information and scale perception information, which improves the prediction capability and reduces the calculation complexity. The core idea is as follows: (1) training a prediction model: firstly, generating a preliminary true density map, then, correcting the preliminary true density map based on depth information obtained by a depth estimation algorithm so as to obtain a true density map, wherein the true density map is used for a predicted density map generated by a point-to-point supervised density estimation network, adjusting network parameters through gradient back transmission according to an error between the true density map and the predicted density map, and generating a final prediction model through iteration; (2) and realizing the prediction of the density graph of the input picture based on the trained prediction model, and calculating the total number of people in the graph.

In the invention, the truth density map is generated not by single fixed Gaussian kernel mapping or geometric self-adaptive Gaussian kernel mapping but by the source analysis of the crowd picture, the size of the target close to the camera in the picture is large, and the distance between the targets is large. Targets farther from the camera are affected by the viewing angle, smaller targets, and smaller distances between targets. In view of this, the depth information of the picture is introduced to guide the generation of the true density map, and a more accurate density map is obtained to supervise and predict the generation of the density map.

In the process of acquiring the predicted density map, the dense connection structure is used, the small target detail features are fully reserved while the multi-scale features are fused, and the problem that the small target detail features are lost when the multi-scale features are captured by the conventional method is solved. In order to improve the resolution of the prediction graph, the method of filling dimension information by using channel information makes full use of the information of the image. The influence of manual characteristics caused by the adoption of linear interpolation upsampling in the conventional method is avoided, and meanwhile, the calculation complexity caused by the transposition convolution mode is avoided.

In a specific implementation, as shown in fig. 1, a crowd counting algorithm flow based on depth information and scale perception information in the present invention includes the following steps:

s1: obtaining a truth density map of an input sample picture:

in order to obtain a truth-value density map label of an input sample picture, Gaussian mapping needs to be performed on head center coordinate data corresponding to the input picture to generate a preliminary truth-value density map. And then correcting the preliminary true density map based on the depth information obtained by the depth estimation algorithm, wherein the obtained true density icon is used for a predicted density map generated by the point-to-point supervised density estimation network.

Here, two gaussian mapping methods are used, which are a fixed gaussian kernel function and a geometric adaptive gaussian kernel function, respectively, and the preliminary truth density maps generated by the two methods are fused by depth information to generate a final truth density map, which is specifically shown in fig. 2.

S11, fixing a Gaussian kernel mapping mode:

let the coordinate of a head label point be x_iUsing (x-x)_i) Indicating a gaussian distribution position, so a picture with N persons' heads can be represented as

The corresponding population density map may be expressed as f (x) ═ G (x) ·_σ(x) In that respect Wherein G is_σ(x) Representing a gaussian kernel function, the closer the coordinates are to the center point, the larger the value, and σ represents the size of the region range upon which the function acts. This density function assumes that each head is marked with a point x_iThe distribution in the image space is independent of each other, but the range of the region involved by different samples is different in size in the three-dimensional space due to the influence of perspective distortion.

S12, a geometric self-adaptive Gaussian kernel mapping mode:

through each oneAverage distance decision related parameters of person to neighboring object for each head marker x in the picture_iMarks m nearby objects as

The average distance between the objects is

The distribution of the image in the crowd is Gaussian kernel

Wherein sigma_iAnd

and (4) correlating. The density map generated by the method can be expressed as

Wherein

Beta is a hyperparameter.

S13, extraction of the depth estimation map:

the method calculates the depth map corresponding to the input sample picture through a monocular depth estimation algorithm, corrects the Gaussian mapping value of each position in the picture by utilizing a threshold segmentation algorithm based on the information of the depth map, and fuses the information of the two density maps to form the final density map. Specifically, the monadepth algorithm may be used to estimate the depth information of the input sample picture, and the input sample picture passes through the monadepth algorithm model to obtain a gray scale map, where each pixel value on the map represents the distance from the camera to the surface of the object.

S14, fusing the density maps generated in S11 and S12 by using the depth information:

suppose the input picture is X ∈ R^h×h×cWhere h denotes the size of the picture and c denotes the dimension of the picture, using a fixedObtaining a true density map F by a Gaussian kernel function₁(x) In that respect Using geometric self-adaptive Gaussian kernel function mapping to the input picture to obtain a true value density graph F₂(x) In that respect And processing the input picture through a monodepth model to obtain a depth information map depth (x). The two types of true density map information obtained are further processed based on the depth information, and a true density map F is obtained₁(x)、F₂(x) The operation of segmentation according to the preset depth threshold is as follows:

wherein F₁(i, j) shows a density map F₁(x) Value, F, corresponding to the middle coordinate (i, j)₂(i, j) shows a density map F₂(x) The coordinate (i, j) in the Depth (x) represents a Depth value at the coordinate (i, j) in the Depth (x), and the M (i, j) represents a value corresponding to the coordinate (i, j) in the final truth density map.

S2, obtaining a predicted density map (density estimation map) of the input sample picture based on the density estimation network:

the density estimation network employed in the present invention consists of three main components: the system comprises a basic feature extraction module, a multi-scale capture module and a scale reduction module; the basic feature extraction module is mainly used for extracting low-level features such as textures of pictures; the multi-scale capturing module is used for further extracting the features, fusing multi-scale information and storing the detailed features of the small target; the scale transfer module is mainly used for scale reduction of the feature map and lifting the feature map to the size of the input picture.

S21, a basic feature extraction module:

the model may use layers in a pre-trained VGG module. By taking pictures with size of 256 × 256 as input, analyzing the convolutional layer in VGG16, the scope of the receptive field of conv4_3 layer reaches 172, which is far beyond the scale of large target. In the current scenario, the ratio of the scale of the large target in the picture is less than one half, and finally, the basic feature extraction module adopted by the invention is composed of the convolutional layer before conv4_3 in the VGG 16.

S22, a multi-scale capturing module:

in order to keep the detail features of the small targets in the current scene, the feature information output by the basic feature extraction module is sequentially transmitted backwards through the multi-scale module, and the performance bottleneck caused by the loss of the detail information in the conventional research method is avoided. Unlike the random short connections of Resnet, dense connections ensure the greatest degree of information sharing between layers. Through receptive field analysis, four layers of dense connection are added, and the receptive field range can meet the extraction of semantic information of all size targets. To ensure that the module can extract enough context information while avoiding too high a growth rate, the module extracts features using a 3 × 3 convolution kernel per layer, keeps the resolution of the feature map unchanged using edge-filling, and sets the growth rate of the convolution to 256. Since the output channel of the basic network is 512-dimensional, dimension conversion into a feature map of 256 channels is required before entering the scale capture module.

S23, a scale reduction module:

the module improves the resolution of the feature map based on sub-pixel convolution, and the size of the picture is changed into 1/8 as the basic feature extraction part performs 8 times down-sampling on the picture by using three times of pooling operation. In the multi-scale capture module, the multi-layer feature maps need to be connected through channels, so that the feature map size is kept unchanged by using edge filling. For scale reduction we need 8 times up-sampling the feature map. Since the number of low resolution feature maps in the sub-pixel convolution operation must be the square number of the upsampling factor, a 1 × 1 convolution is added after the multi-scale capture module structure, and the number of channels of the feature maps is adjusted to be the square number of the upsampling factor of 64. Finally, dimension filling of the feature map is performed using the channel features.

The process of generating the predicted density map based on the density estimation network with the structure is shown in fig. 3, wherein an input sample picture firstly passes through a basic feature extraction part to extract basic features, then enters a multi-scale feature capture module to fuse scale information, and finally is subjected to scale reduction to generate the predicted density map.

After a prediction density map of an input sample picture is generated, calculating loss errors according to the prediction density map and a true density map, adjusting network parameters through gradient back propagation, and generating a density prediction model through iteration;

in the training process, a population density estimation algorithm is trained by using Euclidean distance as a loss function, wherein Euclidean loss is mainly used for calculating estimation errors at a pixel level and has the expression of

Wherein F (X)_i(ii) a θ) is a density estimation map of the network output, θ represents a learning parameter in the network, X_iThe ith picture that represents the input is shown,

the truth label graph of the ith picture is shown, and N is the number of training pictures.

Inputting a density prediction model: crowd picture and label data for training

And (3) outputting: predicted density map

The training process is as follows:

1. data preprocessing: obtaining a preliminary truth density map F of a picture₁(X_i)、F₂(X_i) (ii) a Depth map information Depth (X) of picture_i) (ii) a Determining a depth segmentation threshold; based on the depth information, a final truth density map M (X) is obtained_i)；

2. Model parameters are initialized, and then the model is trained until the model converges: loading pictures according to batches; extracting basic characteristics and updating a characteristic diagram F_i ¹∈R^h1×h1×c1←X_i∈R^h×h×c(ii) a Channel transformation F for feature map_i ¹∈R^h1×h1×c2←F_i ¹∈R^h1 ^×h1×c1(ii) a Multi-scale capturing and updating feature map F_i ²←F_i ¹(ii) a Channel transformation of feature maps

Scale reduction to obtain prediction chart

Calculating M (X)_i)，

Updating the model parameters.

The initialization of the model parameters, except for the VGG part participating in training, the convolution kernel parameters of the rest parts are initialized by using a Gaussian function, and the standard deviation of the parameters is set to be 0.01. Optimization of the model uses Adam algorithm to replace conventional random gradient descent algorithm, and in order to enable the model to be converged quickly, a fixed learning rate is set to be 1e^-5。

After a stable density prediction model is trained, a predicted density map of the image can be generated for an input image by using the model in practical application, and after the density map is obtained, the total number of people in the map can be obtained through summation of pixel points, which is the conventional calculation and is not repeated herein.

Claims

1. The single image crowd counting method based on the depth information and the scale perception information is characterized by comprising the following steps of:

2. The method of claim 1,

it is characterized in that step S1 specifically includes:

3. The method of claim 1,

wherein, in step S2, the density estimation network includes: the system comprises a basic feature extraction module, a multi-scale capture module and a scale transfer module; the basic feature extraction module is used for extracting low-level features such as textures of the pictures; the multi-scale capturing module is used for further extracting picture features, fusing multi-scale information and storing detailed features of small targets; the scale transfer module is used for scale reduction of the feature map and lifting the feature map to the size of the input picture.

4. The method of claim 3, wherein the method for counting the population of single-image based on the depth information and the scale perception information,

the basic feature extraction module is composed of convolutional layers before conv4_3 in a VGG16 network; the multi-scale capturing module adopts four dense connecting layers, each layer uses a 3 multiplied by 3 convolution kernel to extract features, the resolution of a feature map is kept unchanged by using edge filling, and the growth rate of convolution is set to be 256; the scale transfer module adopts sub-pixel convolution to restore the scale of the feature map and increase the resolution of the feature map to the size of the input picture.

5. The method for counting the population of single-image based on depth information and scale perception information as claimed in any one of claims 1 to 4,

in step S2, calculating a loss error according to the predicted density map and the true density map specifically includes:

truth density map representing the ith picture, N is trainingThe number of pictures.