CN114155210B

CN114155210B - Crowd counting method based on attention mechanism and standardized dense cavity space multi-scale fusion network

Info

Publication number: CN114155210B
Application number: CN202111358807.9A
Authority: CN
Inventors: 王巍; 孟佳娜; 刘爽; 云健; 张建新; 刘勇奎; 多俊杰
Original assignee: Dalian Minzu University
Current assignee: Dalian Minzu University
Priority date: 2021-11-17
Filing date: 2021-11-17
Publication date: 2024-04-26
Anticipated expiration: 2041-11-17
Also published as: CN114155210A

Abstract

The invention discloses a crowd counting method based on SDA-SFANet, which comprises the following steps: acquiring a crowd counting data set, and preprocessing the data in the training process; establishing a multi-scale perceived crowd counting network model, and integrating an attention mechanism module into a VGG network; introducing an HDC standardized design to optimize and improve an ASPP module to obtain a standardized SDASPP module, and adding a SDASPP module into a VGG network; training the model using the collected data set; the feature images output by the VGG network extractor, the context sensing module and the SDASPP module are restored to the original image size in a cascading and up-sampling mode to obtain a density image; groud truth D (x) is generated using a gaussian method with a fixed standard deviation kernel; the density map is constructed by convolution with a gaussian kernel; calculating a loss value of the predicted crowd count by using a Bayesian loss function, and feeding back the loss value to the SDA-SFANet; the network is continuously trained and parameters are optimized until a model satisfying the prediction of crowd counting results is obtained.

Description

Crowd counting method based on attention mechanism and standardized dense cavity space multi-scale fusion network

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a crowd counting method based on a Multi-scale fusion network (Multi-Scale Fusion Network with Attention AND STANDARD DENSE Atrous SPATIAL PYRAMID Pooling, abbreviated as SDA-SFAnet) of an attention mechanism and a standardized dense cavity space, which is applied to crowd counting statistical monitoring in monitored public places such as cities, roads, scenic spots and the like.

Background

In recent years, video monitoring technology has been developed, and all-weather monitoring cameras have been spread over all corners of cities. In consideration of the defect of manual supervision, students propose to combine the computer vision technology to count people, and apply the people to a scene with dense people, gradually replace manual screening and monitoring, so that the people detection precision and efficiency can be improved to a greater extent.

In combination with the target detection method in the deep learning, the video monitoring technology based on all-weather monitoring is widely applied to the fields of intelligent video monitoring and tracking, traffic flow and traffic flow control, public area safety maintenance and the like, but in a scene requiring crowd counting, the influence on the accuracy of crowd counting is larger due to the problems of different individual scales, perspective distortion, uneven light and the like in the crowd. According to the requirements of real-time monitoring management in the crowded scenes, the model which is based on deep learning and is suitable for different scenes is built, the problem of precision reduction caused by the conditions of different individual scales, shielding, uneven illumination and the like in the scenes where the crowds are located can be effectively solved, density map output of the crowds can be realized, and the predicted value output is reported to corresponding supervisory personnel. Therefore, research and development can realize the intelligent automatic real-time counting algorithm of the crowd-intensive places, and the safety of the crowd-intensive places is guaranteed through an early warning mechanism, so that the method has very important significance.

The crowd counting implementation method is that a computer outputs a crowd image or a crowd image to a crowd in a scene, the problems of shielding, illumination and the like are solved, the scene with large scale change is perceived, a threshold value is set for a fixed scene, so that whether the crowd has potential risk hidden danger due to over-density is judged, and the risk hidden danger is reported. The task essentially belongs to the segmentation and statistics tasks in the computer vision technology, and the crowd is output by a density map through a corresponding algorithm model, so that the number of people in a scene is predicted. The crowd counting method is realized at a higher detection speed, and meanwhile, the counting has to be ensured to have higher precision, so that the requirements of application places can be met, the potential risk hidden danger can be truly predicted in real time, the existing risk hidden danger is timely eliminated, and the safety of the application places is ensured in real time.

Disclosure of Invention

The invention aims to provide a crowd counting method based on an attention mechanism and a standardized dense cavity space multiscale converged network (SDA-SFANet), which has higher detection speed and precision.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

a crowd counting method based on an attention mechanism and a standardized dense void space multiscale converged network (SDA-SFANet), the content of which comprises:

Step 1: acquiring a crowd counting data set, and preprocessing the data in the training process; processing an image with a shorter side smaller than 512 by adopting an image enhancement strategy, and firstly adjusting the shorter side of the image to 512; clipping the image into 400×400 fixed-size image blocks at random positions, then randomly horizontally flipping the image with a probability of 0.5, and setting the probability of data enhancement to 0.3; for a dataset with gray-scale images, randomly changing the color image to a gray-scale image with a probability of 0.1;

Step 2: establishing a multi-scale perceived crowd counting network model, cutting off VGG16 and VGG19 in a VGG network, selecting a convolution layer of a front 13 layer of the VGG16 network and a convolution layer of a front 16 layer of the VGG19 network as extractors, and extracting image multi-scale characteristic information; integrating an attention mechanism module (Convolutional Block Attention Module, abbreviated CBAM) into the VGG network; adding a multi-scale sensing module into the VGG network to perform multi-scale sensing on the crowd image; the multi-scale sensing module comprises a Context-Aware Network (CAN module for short) and a dense cavity space golden-word tower pooling module (Atrous SPATIAL PYRAMID Pooling Module ASPP module for short); the dense cavity space pyramid pooling module is integrated into the VGG network, so that the average absolute error (Mean Absolute Error, abbreviated as MAE) of the system performance can be reduced by about 3%, and the performance of the method in a scene with large scale change, especially in a sparse scene, is improved; establishing a multi-scale fusion network (SDA-SFANet) model integrating an attention mechanism and a dense cavity space;

Step 3: according to the characteristics of crowd counting, a mixed cavity convolution (Hybrid Dilated Convolution, abbreviated as HDC) standardized design is introduced to optimize and improve a dense cavity space pyramid pooling module, a standardized dense cavity space metal-word tower pooling module (STANDARD DENSE Atrous SPATIAL PYRAMID Pooling Module, abbreviated as SDASPP) with a cavity rate combination of [2,3,7,17,2,3,7,17] is used to solve the grid effect generated during cavity convolution combination, solve the problem that pixel points which cannot participate in operation all the time can participate in operation and local information is lost, obtain a standardized dense cavity space pyramid pooling module (SDASPP module), and further obtain an improved integration attention mechanism and a standardized dense cavity space multi-scale fusion network (SDA-SFANet) model;

Step 4: training a multiscale fused network (SDA-SFANet) model of the improved fused attention mechanism and standardized dense void space pooling module described in step 3 using the collected SHANGHAITECH dataset and UCF-QNRF dataset;

Step 5: 2 times up-sampling the output feature map from SDASPP module by bilinear interpolation, and then connecting with the output feature map from context sensing module; connecting the connected characteristic diagram with conv3-3 and up-sampled characteristics through 1x 256 and 3 x 256 convolution layers; multiplying the signature from SDASPP module by 4 before passing through 1x 128 and 3 x 128 convolutional layers; then up-sampling 128 fusion features according to a factor of 2 and connecting the up-sampling with conv2-2, and connecting the output feature graphs through 1x 64, 3 x 64 and 3 x 32 convolution layers respectively in a cascading mode to obtain a final density graph; due to the fact that 3 upsampling layers are used, the SDA-SFANet model can search out a high-resolution characteristic diagram with the original input size of 1/2;

Step 6: generating groud truth D (x) the final density map of the cascade and up-sampled output in step 5 using a gaussian method with a fixed standard deviation kernel, and adaptively determining a propagation parameter δ for each person in the picture based on the average distance between each person and its neighbors; the final density map is constructed by convolution with a gaussian kernel, a process expressed as:

Wherein C represents the population, G _σ represents the Gaussian kernel, and x _i represents a head annotation;

Step 7: calculating a loss value of the predicted crowd count by using a Bayesian loss function L ^Bayes, and feeding the loss value back to a multi-scale fusion network model based on an attention mechanism and a standardized dense cavity space; the bayesian penalty function L ^Bayes can be expressed as:

Wherein L ^Bayes represents a Bayesian loss function, F represents the average absolute error loss thereat, and E [ c _n ] represents the expected value thereat;

step 8: and 5, repeating the steps 6 and 7, continuously optimizing the network, and optimizing the parameters until a model satisfactory to the crowd counting result prediction is obtained.

Step 9: population counts were performed using the model obtained in step 8.

By adopting the technical scheme, the crowd counting method based on the attention mechanism and the standardized dense cavity space multiscale converged network (SDA-SFANet) has the following beneficial effects:

Aiming at the limitation of the traditional method under the high-resolution image in the crowd counting, the invention uses the crowd counting algorithm based on the convolution neural network to obtain higher accurate value and better robustness on the crowd counting. The invention continuously researches experiments around the problems of overlapping shielding, uneven light rays, different scales and the like in crowd counting, and how to solve the problem of the decline of the accuracy of a crowd counting algorithm in a sparse scene, increase the sense field of a convolution kernel and provide more feature diversity is the key point of the research of the invention. The network is locally optimized and adjusted according to different scenes and requirements, and a simplified network is provided under the condition of less resources so as to meet the requirements of practical application.

Compared with the prior art, the method has the advantages that the defects and the shortcomings in the prior art are overcome mainly in the following three aspects aiming at the problem of counting people in the image, and the beneficial effects are achieved:

(1) The SDA-SFANet model is designed and realized, and a standardized dense cavity space pyramid pooling module (SDASPP module) is used for replacing the original cavity space pyramid pooling module; in the cavity convolution combination of the same cavity rate combination, dense connection can bring a larger receptive field than parallel cavity convolution, more feature diversity is provided, and the cavity convolution using dense connection in a sparse scene can provide higher accuracy and better performance for the conditions of uneven illumination, shielding and the like; the advantage that the characteristic pyramid can output images with fixed sizes no matter the sizes of the input images is combined, so that the dense cavity space pyramid pooling module is more suitable for being applied to sparse scenes of crowd counting; the up-sampling operation is combined with the multi-scale information provided by the VGG extractor and the context sensing model to obtain a density image and a predicted value based on the output of the multi-scale convolutional neural network, so that the influence caused by the problems of different individual scales, perspective distortion and the like in crowd counting is effectively improved, and better performance is realized under a high-resolution image;

(2) Aiming at the grid effect problem when the void ratio design in the void convolution combination is poor, the invention introduces the HDC standardized design to adjust the dense void space and the void ratio combination, provides the void ratio combination [2,3,7,17,2,3,7,17], satisfies the common divisor that the void ratio in the HDC standardized design is not more than 1 and the maximum value of two non-zero distances of the second layer in the void ratio combination is less than the size of the convolution kernel, and sets the void ratio into a saw-tooth shape, thereby effectively avoiding the grid effect problem of the traditional dense void space pyramid pooling module; the dense cavity space pyramid pooling modules with different cavity rate combinations are respectively designed for different scenes, and the accuracy and the time are respectively improved in a targeted manner;

(3) Because of the remarkable influence of the attention mechanism in the neural network, the attention mechanism module (CBAM) is added in the VGG extractor, the attention mechanism is added in the channels and the spaces, the network is efficiently distributed in limited resources, more attention is added to the important channels and the space positions, and the most critical information is selected from a plurality of information in the network before the context sensing module and the dense cavity space pyramid pooling module are input, so that the accuracy of the network is improved, and the efficiency is improved.

Multiple experimental tests prove that the invention has higher detection speed and counting precision.

Drawings

FIG. 1 is a schematic diagram of a crowd counting network model of the present invention;

In the figure: 1 is an input crowd figure; 2.3, 4,5 and 6 are all convolution parts of the VGG network, the step length of the convolution layer is 1, wherein the step length of the pooling layer of 2, 3,4 and 5 is 2, and no pooling layer in 6 is only the convolution layer; 7. a Module (CBAM) for an attention mechanism; 8 is a context awareness module; 9 is SDASPP modules; 10. 12, 14 are both upsampled; 11 is a first connection block, consisting of conv1 x 256 and conv3 x 256; 13 is a second connection block, consisting of conv1 x 128 and conv3 x 128; 15 is a third connection block consisting of conv1 x 64, conv3 x 64 and conv3 x 32; and 16 is the final density map of the population and the predicted output.

FIG. 2 is a schematic diagram of the network architecture of the attention mechanism module (CBAM);

FIG. 3 is a schematic diagram of a network architecture of a context awareness module;

FIG. 4 is a network architecture schematic of SDASPP modules;

Fig. 5 is an effect diagram of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

The invention provides a crowd counting method based on an attention mechanism and a standardized dense cavity space multiscale converged network (SDA-SFANet). A crowd counting network model structure schematic diagram of the method is shown in figure 1, and in figure 1: an attention mechanism module (CBAM) 7, the network structure of which is shown in fig. 2; the context awareness module 8 has a network structure schematically shown in fig. 3; SDASPP module 9, the network structure of which is shown in fig. 4; the method comprises the following steps:

Step 1: acquiring a crowd counting data set from an actual scene, preprocessing data in a training process, adopting an image enhancement strategy, and firstly adjusting the short side of an image to 512 if the short side of the image is smaller than 512; clipping the image into image blocks of a fixed size at random positions, i.e. into 400 x 400 image blocks; the image was then flipped randomly horizontally with a probability of 0.5 and the probability of data enhancement was set to 0.3. For a dataset with gray-scale images, randomly changing the color image to a gray-scale image with a probability of 0.1; the image is resized to a fixed 1024 x 768 prior to any data enhancement on the UCF-QRNF dataset; because of the large image size in the UCF-QRNF dataset, the image size is adjusted to 1024X 768 for training and testing; the strategies adopted are small-fraction random resizing, image cropping, horizontal flipping and gamma adjustment.

Step 2: establishing a multi-scale perceived crowd counting network model, firstly, inputting an image into an encoder to learn useful high-level semantics; then, the feature map is sent to a multi-scale perception module to highlight multi-scale features of the target object and the context; and connects layer 7 of VGG16 with layer 9 of VGG19 using an attention mechanism module (CBAM), adding an attention mechanism to the channel and space. The SDA-SFANet architecture has two multi-scale sensing modules, one module connected with the 13 th layer of VGG16 or the 16 th layer of VGG19 after being integrated with the attention mechanism module is a standardized dense hollow space pyramid pooling module (SDASPP module), and the other module connected with the 10 th layer of VGG16 or the 12 th layer of VGG19 after being integrated with the attention mechanism module is a context sensing module (CAN); finally, the decoder path fuses the multi-scale features into a density map using cascading and bilinear upsampling; before the final density map is generated, the crowd can be more easily segmented from the background due to the use of an attention mechanism; therefore, by adopting the mechanism, the noise background can be separated, and the network model is focused on the region of interest more; except for the last convolution layer used for predicting the final density map, each convolution layer performs batch normalization and uses a ReLU activation function, so that a multi-scale fusion network model based on an attention mechanism and a dense hole space is established.

Step 3: according to the characteristics of crowd counting, an HDC standardized design is introduced to improve a pyramid pooling module of a dense cavity space, so that the problems that grid effect is generated during cavity convolution combination and pixels which cannot participate in operation all the time can participate in operation and local information is lost are solved, a standardized dense cavity space golden character tower pooling module is obtained, and an improved integration attention mechanism and a multi-scale fusion network model of a standardized dense cavity space are further obtained; according to the HDC requirement, the combination and the alternate use of the hole convolution with different hole ratios are used, so that the influence of grid problems that pixel points which cannot participate in operation all the time can participate in operation, local information is lost and the like is solved to a certain extent. The HDC standardized design strategy must meet the following three points:

1. the void ratio of the void convolution combination cannot have a common divisor greater than 1;

2. the maximum distance between two non-zero points of the second layer in the cavity convolution combination is smaller than or equal to the convolution kernel size; the maximum distance formula between non-zero points is:

MAX(DT_n)＝MAX[DT_n+1-2d_n,DT_n+1-2(DT_n+1-d_n),d_n] (3)

Wherein DT _n is the distance between two non-zero points of the nth layer, and d _n is the size of the hole convolution void rate of the nth layer;

3. The void fraction is set to be zigzag.

According to the method, according to the three conditions, the air hole rate combination of SDASPP modules is designed to be [2,3,7,17,2,3,7,17] aiming at the crowd counting characteristics, so that the grid effect and the zero filling time are avoided, and the network precision and the time are improved.

Step 4: the improved fused attention mechanism and normalized dense void space multiscale fused network model described in step 3 was trained using the collected SHANGHAITECH dataset and UCF-QNRF dataset, which model was trained using Adam and lookahead optimizer optimizers because it showed a faster convergence rate than the standard Adam optimizers, and experiments demonstrated that the performance of the model could be improved.

Step 5: in the SDA-SFANet network structure, the output characteristic diagram from the SDASPP module is up-sampled by 2 times by using bilinear interpolation firstly, and then is connected with the output characteristic diagram from the CAN; next, the connected feature map is passed through 1×1×256 and 3×3×256 convolutional layers; likewise, the fused features are up-sampled 2 times and connected with conv3-3 and the up-sampled features; the hopping connection after multiplying the signature from SDASPP block by 4 before passing through the 1x 128 and 3 x 128 convolutional layers; such a skip connection helps the network to alert itself to multi-scale features learned from the advanced image representation; finally, 128 fusion features were up-sampled by a factor of 2 and concatenated with conv2-2, and then passed through 1x 64, 3 x 64 and 3 x 32 convolutional layers, respectively. Since 3 upsampling layers are used, the model can retrieve a high resolution feature map, i.e. a final density map, of 1/2 the original input size.

Step 6: to generate ground truth D (x), the method of the present invention processes the density map output by the cascade and upsampling of step 5 by referencing a gaussian method with a fixed standard deviation kernel. The propagation parameter delta of each person is adaptively determined according to the average distance between each person and the adjacent person in the picture. Because the self-adaptive Gaussian kernel can generate an accurate crowd final density map under the condition that the perspective view of the input image is unknown, a lot of information can be reserved even under the condition that the crowd picture is seriously blocked, and a relatively accurate crowd density map is generated, an effect map of the invention is shown in figure 5; this is the advantage of using adaptive gaussian kernels to generate a final density map of the population.

Assuming a head annotation at pixel x _i, denoted delta (x-x _i), the final density map can be constructed by convolution with a gaussian kernel; where C represents the population and G _σ represents the Gaussian kernel, the process can be expressed as:

Step 7: calculating a loss value of the predicted crowd count by using a Bayesian loss function L ^Bayes, and feeding the loss value back to a multiscale sensing network based on a fused attention mechanism and a dense cavity space; aiming at the probability that a head possibly exists at a certain space position, solving an inner product of the probability vector and a density map to obtain the crowd quantity expectation of the whole picture; the bayesian penalty function L ^Bayes can be expressed as:

Where L ^Bayes represents the Bayesian loss function, F represents the average absolute error loss there, and E [ c _n ] represents the expected value there.

Step 8: and (5) continuously repeating the steps (5), 6 and 7), continuously optimizing the network, and optimizing the network parameters until a model with better crowd counting result prediction is obtained.

Step 9: population counts were performed using the model obtained in step 8.

For the performance evaluation index, the method uses two evaluation criteria of Mean absolute Error (Mean Absolute Error, MAE) and Mean Square Error (MSE) to compare the predicted value and the actual value of the density map accumulation of the network output. The MAE and MSE formulas are as follows:

Wherein y _i represents the true value of, Representing the predicted value.

Claims

1. A crowd counting method based on a concentration mechanism and a standardized dense hole space multi-scale fusion network is characterized by comprising the following steps of: the method comprises the following steps:

Step 1: acquiring a crowd counting data set, and preprocessing the data in the training process; processing an image with a shorter side smaller than 512 by adopting an image enhancement strategy, and firstly adjusting the shorter side of the image to 512; clipping the image into 400×400 fixed-size image blocks at random positions, then randomly and horizontally flipping the image with a probability of 0.5, and setting the probability of data enhancement to 0.3; for a dataset with gray-scale images, randomly changing the color image to a gray-scale image with a probability of 0.1;

Step 2: establishing a multi-scale perceived crowd counting network model, cutting off VGG16 and VGG19 in a VGG network, selecting a convolution layer of a front 13 layer of the VGG16 network and a convolution layer of a front 16 layer of the VGG19 network as extractors, and extracting image multi-scale characteristic information; integrating an attention mechanism module into the VGG network; adding a multi-scale sensing module into the VGG network to perform multi-scale sensing on the crowd image; the multi-scale sensing module comprises a context sensing module and a dense cavity space pyramid pooling module; a dense cavity space pyramid pooling module is integrated into a VGG network, and a multiscale integrated network model integrating an attention mechanism and a dense cavity space is established;

Step 3: according to the characteristics of crowd counting, a mixed cavity convolution standardization design is introduced to optimize and improve a dense cavity space pyramid pooling module, a standardized dense cavity space pyramid pooling module with a cavity rate combination of [2,3,7,17,2,3,7,17] is used, a grid effect generated during cavity convolution combination is solved, the problem that pixels which cannot participate in operation all the time can participate in operation and local information is lost is solved, a standardized dense cavity space pyramid pooling module is obtained, and an improved integration attention mechanism and a standardized dense cavity space multiscale fusion network model are obtained;

Step 4: training the multiscale fusion network model of the improved fused attention mechanism and standardized dense cavity space pooling module described in step 3 by utilizing the collected SHANGHAITECH dataset and UCF-QNRF dataset;

Step 5: 2 times up-sampling the output characteristic diagram from the standardized dense cavity space pyramid pooling module by using bilinear interpolation, and then connecting with the output characteristic diagram from the context sensing module; connecting the connected characteristic diagram with conv3-3 and up-sampled characteristics through 1 x 256 and3 x 256 convolution layers; multiplying the feature map from the normalized dense hole space pyramid pooling module by 4 before passing through 1 x 128 and3 x 128 convolutional layers; then up-sampling 128 fusion features according to a factor of 2 and connecting the up-sampling with conv2-2, and connecting the output feature graphs by using a cascading mode through 1 x 64, 3 x 64 and3 x 32 convolution layers respectively to obtain a final density graph;

Step 7: calculating a loss value of the predicted crowd count by using a Bayesian loss function L ^Bayes, and feeding back the loss value to a multiscale fusion network model based on a fused attention mechanism and a standardized dense cavity space; the bayesian penalty function L ^Bayes is expressed as:

Step 8: repeating the step 5, the step 6 and the step 7, continuously optimizing the network, and optimizing the parameters until a model satisfactory to the crowd counting result prediction is obtained;

Step 9: population counts were performed using the model obtained in step 8.

2. The crowd counting method based on the attention mechanism and the standardized dense hole space multiscale converged network according to claim 1, wherein the crowd counting method is characterized in that:

In step 3, the hybrid cavity convolution normalization design must meet the following three points:

1) The void ratio of the void convolution combination cannot have a common divisor greater than 1;

2) The maximum distance between two non-zero points of the second layer in the cavity convolution combination is smaller than or equal to the convolution kernel size; the maximum distance formula between non-zero points is:

MAX(DT_n)＝MAX[DT_n+1-2d_n,DT_n+1-2(DT_n+1-d_n),d_n] (3)

3) The cavity rate is set to be zigzag;

according to the three-point conditions which the mixed cavity convolution standardized design must meet, the cavity rate combination of the SDASPP module is designed to be [2,3,7,17,2,3,7,17] according to the crowd counting characteristics.