CN115331171A

CN115331171A - Crowd counting method and system based on depth information and significance information

Info

Publication number: CN115331171A
Application number: CN202210992920.0A
Authority: CN
Inventors: 崔子冠; 苏航; 唐贵进; 干宗良; 刘峰
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2022-08-18
Filing date: 2022-08-18
Publication date: 2022-11-11

Abstract

The invention discloses a crowd counting method and system based on depth information and significance information, which comprises the following steps: collecting a crowd sample image of a designated area; inputting the collected crowd sample images into a trained density map prediction model based on significance information and depth information; and outputting the total number of people in the crowd sample image. The method comprises the steps of introducing crowd significance information into the crowd counting field, taking a head marking point as a human eye attention point, generating a visual significance label of the crowd counting by utilizing Gaussian blur, performing training test by utilizing a deep learning network, obtaining the visual significance information of the crowd counting, and assisting in training of the crowd counting; the mode of combining the visual saliency information and the depth information is utilized to assist people counting, the depth information can be corrected by utilizing the saliency information, the interference caused by the area without the crowd information is reduced, and the counting effect is improved.

Description

Crowd counting method and system based on depth information and significance information

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a crowd counting method based on depth information and significance information.

Background

The task of dense crowd counting is to estimate the number of people contained in an image or video. With the increase of the number of global population and the increase of human social activities, a large amount of crowds are often gathered in public places of various regions, such as transportation hubs, entertainment places and the like, and great hidden dangers are brought to public safety. Intensive population counting tasks are widely applied to video surveillance, traffic control and metropolis safety, and researchers in various countries carry out a great deal of research. The population counting method can also be popularized to similar tasks in other fields, such as cell number estimation of microscopic images in medicine, vehicle estimation in traffic congestion situations, extensive biological sample investigation and the like.

The traditional crowd counting method can be mainly divided into methods based on detection and regression, and the two methods are difficult to solve the problem of serious occlusion among crowds along with the increase of crowd density. Because the deep learning model has strong feature extraction capability, the population counting method research based on deep learning has achieved a lot of excellent achievements. The mainstream method at present is to predict the density map of the original image by using a convolutional neural network, and calculate the number of people by using the density map.

Wang et al first introduced a Convolutional Neural Network (CNN) into the population counting field, and proposed an end-to-end CNN regression model suitable for dense population scenarios. The AlexNet network is improved by the model, and the number of the crowd is directly predicted by replacing the final full-connection layer with a single neuron layer. The method has the disadvantages that the personnel distribution condition in the scene can not be counted, and the effect is not good under the condition of dense crowd or complex scene. Zhang et al proposed a multi-column convolutional neural network MCNN for population counting by a multi-branch deep convolutional neural network, and each branch network adopts convolutional kernels with different sizes to extract characteristic information of targets with different scales, so that counting errors caused by different sizes of targets formed by view angle changes are reduced. Although the multi-branch counting network achieves a better counting effect, the complexity of the multi-branch counting network model is higher, and new problems are brought about. For example, the network model has many parameters, is difficult to train, and has redundant structure. To this end, li et al propose an expanded convolutional neural network model CSRNet suitable for dense population counting. CSRNet does not adopt a multi-branch network structure which is widely used in the past, but a VGG16 network with a discarded full connection layer is used as the front end part of the network, and a 6-layer expanded convolutional neural network is adopted as the rear end to form a single-channel counting network, so that the parameter quantity is greatly reduced, and the training difficulty is reduced. Meanwhile, the advantage of the field of view can be expanded while the resolution of the input image is kept by means of the hole convolution, more image detail information is reserved, and the quality of the generated crowd density image is higher.

In order to solve the problem that the size of a target is greatly changed due to different distances between a camera and people, attention is paid to the introduction of auxiliary information to assist people counting. Shi et al combine perspective information with crowd counting to improve counting accuracy. The perspective information shows the depth difference of the whole image, and has certain similarity with the depth image. Xu et al uses the depth information of the image to segment the scene into a distant view region and a near view region, and then applies different mechanisms (based on density maps and based on detection) to estimate the count results of these two regions to count out the total population. Yang et al use the depth branch of training in advance to provide depth information for crowd's count, and depth information has reflected crowd's density to a certain extent to implied scale change information, ignored the depth information outside the crowd's region nevertheless can cause this problem of influence to the counting result.

Disclosure of Invention

The invention aims to provide a crowd counting method and system based on depth information and significance information, which assist crowd counting by combining visual significance information and depth information, correct depth information by utilizing significance information, reduce interference caused by areas without crowd information and improve counting effect.

In order to realize the purpose, the invention is realized by the following technical scheme:

in a first aspect, the present invention provides a people counting method based on depth information and saliency information, comprising:

collecting a crowd sample image of a designated area;

inputting the collected crowd sample images into a trained density map prediction model based on significance information and depth information;

and outputting the total number of people in the crowd sample image.

Further, the density map prediction model is constructed by the following method:

carrying out depth prediction on an input crowd sample image by using an image depth information prediction network to obtain image depth information;

the input crowd sample image, the corresponding prediction significance information and the depth information are input into a crowd density map prediction network together, the depth information is corrected by using the significance information, and the corrected depth information is used for guiding the density map prediction network to train so as to generate a density map prediction model.

Further, the significance information is generated by predicting a significance map of the input human population sample image by using a significance prediction model, and the significance prediction model is constructed by the following method:

carrying out Gaussian blur on head annotation data corresponding to the input crowd sample image to generate a truth value saliency map;

predicting significance information of the input crowd sample image by using a visual significance prediction network to generate a predicted significance map;

and calculating a loss function according to the prediction significance map and the truth value significance map, adjusting network parameters through gradient back propagation, and generating a significance prediction model through iteration.

Further, the gaussian blurring of the head labeling data corresponding to the input crowd sample image is performed by using a gaussian kernel function with a standard deviation of 19.

Further, the density map prediction network training comprises:

for the input crowd sample image R, the depth map D and the saliency map S corresponding to the crowd sample image R, at the l-th layer of the encoder, let R ^l 、D ^l And S ^l Respectively, the output feature maps of the previous convolution layers of the encoder, and the depth features are corrected by the significance features of the corresponding layers, wherein the correction method comprises the following steps:

V ^l ＝sigmoid(Φ _s (S ^l ))

D ^l ＝V ^l ⊙D ^l

wherein phi _s Represents a 1X 1 convolutional layer, V ^l Weights for encoder layer l calculated by sigmoid function, the-l indicates element level multiplication, using weight V ^l The Dl is acted to highlight the crowd area and reduce the influence of depth information of the non-crowd area;

the corrected depth information D ^l To R ^l Weighted as

R ^l ＝R ^l ⊙D ^l

Wherein, an element level multiplication;

then R is put into ^l 、D ^l And S ^l And inputting the data into a subsequent network.

Further, the density prediction network includes: the device comprises a coding module, a depth correction and embedding module, an enhanced multi-scale module and a decoding module;

the encoding module is used for extracting multi-level features of the input image;

the depth correction and embedding module is used for correcting and fusing depth information;

the enhanced multi-scale module is used for extracting and fusing multi-scale comprehensive features;

the decoding module is used for outputting a prediction density map with the same size as the input image.

Further, the encoder module is a pre-trained VGG16 network; the enhanced multi-scale module comprises multi-branch 3 x 3 convolutions with different expansion rates, and the expansion convolutions provide a larger receptive field than ordinary convolution operations; the decoding module is a 7-layer expansion convolution network and is used for outputting a prediction density map with the same size as the input image.

Further, the total number of people in the output crowd sample image comprises

And generating a predicted density map of the crowd sample image by using a density prediction model, and summing all pixel points of the predicted density map to obtain the total number of people in the map.

In a second aspect, the present invention also provides a people counting system based on depth information and saliency information, comprising a processor and a storage medium;

the storage medium is to store instructions;

the processor is configured to operate in accordance with the instructions to perform the steps of the method of the first aspect.

Compared with the prior art, the invention has the following beneficial effects:

(1) The method introduces the crowd significance information into the crowd counting field, takes the head marking point as the eye attention point, generates a visual significance label of the crowd counting by Gaussian blur, performs training test by using a deep learning network, obtains the visual significance information of the crowd counting, and assists in the training of the crowd counting;

(2) According to the method, the mode of combining the visual saliency information and the depth information is utilized to assist people counting, the depth information can be corrected by utilizing the saliency information, interference caused by a region without the crowd information is reduced, and the counting effect is improved.

Drawings

Fig. 1 is a flowchart of a crowd counting method based on depth information and saliency information according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating an overall network architecture of population counting according to an embodiment;

fig. 3 is a schematic network structure diagram of an enhanced multi-scale module according to an embodiment.

Detailed Description

The present invention will be further described with reference to the accompanying drawings and specific examples, which are only used for more clearly illustrating the technical solutions of the present invention, and the protection scope of the present invention is not limited thereby.

Example 1

As shown in fig. 1 to 3, a method for counting people based on depth information and saliency information includes: collecting a crowd sample image of a designated area; inputting the collected crowd sample images into a trained density map prediction model based on significance information and depth information; and outputting the total number of people in the crowd sample image.

In this embodiment, for the acquired crowd sample image, a crowd counting method based on depth information and saliency information is adopted, and an application process thereof is shown in fig. 1, and specifically involves the following steps:

step 1) carrying out Gaussian blur on head annotation data corresponding to the input crowd sample image to generate a truth significance map.

In the present embodiment, reference is made to a human eye attention prediction data set SALICON data set, which includes 20000 images selected from Microsoft COCO data sets and is the largest size data set of the image human eye attention detection field so far. However, the data set does not use an eye tracker to record eye movement data, but uses an amazon crowd-funding marking platform to enable a marker to click on a position concerned by the marker with a mouse. For the population count data set, this is similar to the process of head labeling. And then carrying out Gaussian blur on all preprocessed mouse click samples of the same image to generate a true value saliency map. And performing Gaussian blur on the head annotation data corresponding to the input sample image by using a Gaussian kernel function with the standard deviation of 19 to generate a true value significance map.

And 2) predicting significance information of the input crowd sample image by using a visual significance prediction network to generate a prediction significance map, calculating a loss function according to the prediction significance map and a truth value significance map, adjusting network parameters through gradient back propagation, generating a significance prediction model through iteration, and predicting the significance map of the input sample image by using the model to generate prediction significance information.

And 3) carrying out depth prediction on the input sample image by using an image depth information prediction network to obtain image depth information.

In the implementation of the invention, the pre-trained depth information prediction network model is used for carrying out depth prediction on the input sample image, and the predicted image can be well adapted to various scene layouts and shows the distance change from different positions to the camera.

And 4) inputting the input sample image, the corresponding prediction significance information and the depth information into the crowd density map prediction network, correcting the depth information by using the significance information, and guiding the density map prediction network to train by using the corrected depth information to generate a density map prediction model.

In this embodiment, the overall network structure of the crowd density map prediction network is shown in fig. 2. And inputting the input sample image and the corresponding prediction significance information thereof into the crowd density map prediction network together with the depth information. For the input sample image R, the depth map D and the saliency map S corresponding to the input sample image R are led to R at the l-th layer of the encoder ^l 、D ^l And S ^l Respectively, the output characteristic diagram of the previous convolution layer of the encoder. And correcting the depth features by using the significance features of the corresponding layers, wherein the correction method comprises the following steps:

V ^l ＝sigmoid(Φ _s (S ^l ))

D ^l ＝V ^l ⊙D ^l

wherein phi is _s Represents a 1X 1 convolutional layer, V ^l Weights for encoder layer l calculated by sigmoid function, the-l indicates element level multiplication, using weight V ^l Action D ^l So that the crowd area is highlighted, and the influence of depth information of the non-crowd area is reduced;

the corrected depth information D ^l To R ^l Weighted as

R ^l ＝R ^l ⊙D ^l

Wherein, an element level multiplication;

then R is put into ^l 、D ^l And S ^l And inputting the data to a subsequent network.

In order to deal with the problem of scale variation of the crowd, most of the previous works adopt a multi-column network architecture, for example, MCNN uses three columns of subnetworks to extract features of different scales, but the scale diversity of the features is limited by the number of columns of the network. In order to solve the problem, an enhancement multi-scale module is provided by using the architectural idea of inclusion and further utilizing the expansion convolution, so as to perform scale enhancement on the input feature map, as shown in fig. 3.

The dilated convolution provides a larger field of experience than the normal convolution operation, and can capture the area around the boundary and richer context information than the normal convolution. The head area of a person in an image in a crowd scene always varies greatly. The single receptive field can not adapt to the variation of the scale of the human head, and 3 multiplied by 3 convolutions with expansion rates d of 1, 2 and 4 are respectively used for capturing the characteristics, so that the method can better adapt to diversified population distribution in the population scene.

And finally, inputting the feature map into a decoder module, wherein the decoder module is a 7-layer expansion convolutional network, extracting deeper important information by utilizing a larger receptive field, outputting a predicted density map with the same size as the input image, and measuring the difference between the predicted result density map and the label by utilizing Euclidean distance, as shown in formula (1)

In the formula (1), X _i For the input image, F (X) _i ) To estimate the density map, D (X) _i ) Is a real density map, and N is the number of training samples. And adjusting network parameters through gradient back propagation, and iteratively training a density prediction network.

And 5) when the crowd of a single image is counted, generating a predicted density map of the image by using a density prediction model, and summing all pixel points of the predicted density map to obtain the total number of people in the map.

After a stable density prediction model is trained, a predicted density map of the image is generated for an input image by using the model, and after the density map is obtained, the total number of people in the map is obtained through summation of pixel points.

Example 2

Based on the crowd counting method based on the depth information and the significance information described in embodiment 1, the embodiment provides a crowd counting system based on the depth information and the significance information, which comprises a processor and a storage medium; the storage medium is used for storing instructions; the processor is configured to operate in accordance with the instructions to perform the steps of the method of embodiment 1.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and technical principles of the described embodiments, and such modifications and variations should also be considered as within the scope of the present invention.

Claims

1. A crowd counting method based on depth information and significance information is characterized by comprising the following steps:

collecting a crowd sample image of a designated area;

and outputting the total number of people in the crowd sample image.

2. The population counting method based on depth information and significance information according to claim 1, wherein the density map prediction model is constructed by the following method:

3. The method of claim 2, wherein the saliency information is generated by prediction of a saliency map of an input crowd sample image using a saliency prediction model, the saliency prediction model being constructed by:

carrying out Gaussian blur on head annotation data corresponding to the input crowd sample image to generate a truth value significance map;

predicting significance information of the input crowd sample image by using a visual significance prediction network to generate a prediction significance map;

4. The method according to claim 3, wherein the Gaussian blur of the head labeling data corresponding to the input human sample image is performed by using a Gaussian kernel function with a standard deviation of 19.

5. The population counting method based on depth information and significance information according to claim 2, wherein the density map prediction network training comprises:

for the input crowd sample image R, the depth map D and the saliency map S corresponding to the crowd sample image R, at the l-th layer of the encoder, let R ^l 、D ^l And S ^l Respectively, the output characteristic maps of the previous convolution layers of the encoder, and the significance characteristics of the corresponding layers are used for correcting the depth characteristics, and the correction method comprises the following steps:

V ^l ＝sigmoid(Φ _s (S ^l ))

D ^l ＝V ^l ⊙D ^l

wherein phi _s Represents a 1X 1 convolutional layer, V ^l A weight for encoder layer l, <' > indicates an element level multiplication;

the corrected depth information D ^l To R ^l Weighted as

R ^l ＝R ^l ⊙D ^l

Wherein, an element level multiplication;

6. The crowd counting method based on depth information and significance information according to claim 5, wherein the density map prediction network comprises: the device comprises a coding module, a depth correction and embedding module, an enhanced multi-scale module and a decoding module;

the encoding module is used for extracting multi-level features of an input image;

7. The people counting method based on depth information and saliency information of claim 6, characterized in that said encoder module is a pre-trained VGG16 network; the enhanced multi-scale module comprises a multi-branch 3 × 3 convolution with different expansion rates; the decoding module is a 7-layer expansion convolution network and is used for outputting a prediction density map with the same size as the input image.

8. The method of claim 1, wherein the outputting the population in the population sample image comprises outputting the population

9. A crowd counting system based on depth information and saliency information, characterized by: comprising a processor and a storage medium;

the storage medium is used for storing instructions;

the processor is configured to operate in accordance with the instructions to perform the steps of the method of any one of claims 1 to 8.