CN113887473A

CN113887473A - Improved normalized deformable convolution population counting method

Info

Publication number: CN113887473A
Application number: CN202111204377.5A
Authority: CN
Inventors: 吕伟刚; 仲芯; 覃静; 张树刚
Original assignee: Ocean University of China
Current assignee: Ocean University of China
Priority date: 2021-10-15
Filing date: 2021-10-15
Publication date: 2022-01-04
Anticipated expiration: 2041-10-15
Also published as: CN113887473B

Abstract

The invention belongs to the technical field of computer vision, and particularly relates to a method for estimating the number of people by using a convolutional neural network. A method for population counting based on improved normalized deformable convolution comprising: constructing a normalized deformable convolution neural network; and constraining the position of the characteristic graph sampling point of the input image by utilizing the normalized deformable convolution neural network to obtain the accurate human head characteristic. The invention provides a normalized deformable convolution (NDConv), which limits the offset of a sampling point to a certain extent, can obtain information in an effective area of the sampling point, does not increase extra calculated amount, and improves the accuracy of the neural network prediction crowd counting.

Description

Improved normalized deformable convolution population counting method

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a method for estimating the number of people by using a convolutional neural network.

Background

With the rapid expansion of the world population and the rapid progression of urbanization, the likelihood of population crowding has increased dramatically. In some casesNext, the stepping event in a large-scale activity may cause harm to public safety, and a prerequisite for preventing public safety crisis caused by an excessively large number of people is to monitor the number of people and keep the scale thereof within a certain range. The population has been counted in some studies^[1]-[5]The above idea is implemented by estimating the number of people in a public area (such as a campus, a shopping mall, and a train station). Moreover, remote intelligent camera equipment^[6]-[8]The improvement of the advancement provides a good hardware foundation for the research related to the population counting. Thus, people counting has gained a wide range of attention in the field of computer vision.

Dense crowd has the characteristics of irregular shape, high noise, partial shielding and the like, so that the crowd still has a larger promotion space in the aspects of accurately counting and constructing data sets of different scenes and people numbers. At present, population counting algorithms can be divided into three categories: method based on detection^[9]-[10]Feature-based regression method^[11]-[12]And a method based on a convolutional neural network^[13]-[15]. In contrast, the convolutional neural network-based approach performs better in terms of computational accuracy, training efficiency, and robustness. Although the research of population counting by using the convolutional neural network has made great progress, the depth model in the network cannot adapt to the geometric change of the scale, and is still an obstacle to the improvement of population counting precision. Method based on multi-scale feature fusion^[16]-[17]The above problems are effectively solved, however, there are some limitations. Firstly, the network parameters and the computation time are significantly increased and the training efficiency is greatly reduced. Furthermore, these methods do not effectively solve the convolution geometry transformation problem.

To address the above problems, Microsoft Asian institute has proposed a deformable convolution (Deformable convolution)^[18]To enhance the ability of convolutional neural networks to model geometric transformations, it can model continuous scale features to handle population counting problems. Then, they improve the deformable convolution by providing the network with additional convolution layer and a method of adding weight to offset for accurate feature extraction, and propose a deformable convolution v2 version (deformable convolution v2)^[19]More advanced performance is achieved. However, it is difficult to sample the crowd-sourcing feature directly and uniformly, since neither version of the deformable convolution can control the offset of the sample points. Thus, there is still room for great improvement in the performance of deformable convolution. Based on the foregoing discussion, the present invention recognizes that it is difficult for a deformable convolutional network to uniformly sample and collect rich information. Because the offset is not restricted, the shape of the sampling area is not a regular graph any more, so that the head characteristics of the crowd can not be uniformly sampled, and the accurate prediction of the number of people in the crowd can not be realized.

Disclosure of Invention

The invention aims to provide a population counting method based on improved normalized deformable convolution, which realizes the uniform distribution of sampling points in the deformable convolution and controls the offset of the sampling points by adding normalized deformable loss in the deformable convolution, thereby obtaining more complete human head characteristic information and finally realizing the remarkable improvement of performance.

In order to achieve the purpose, the invention adopts a technical scheme that: a method for population counting based on improved normalized deformable convolution comprising: constructing a normalized deformable convolution neural network; and constraining the position of the characteristic graph sampling point of the input image by utilizing the normalized deformable convolution neural network to obtain the accurate human head characteristic.

Further, the normalized deformable convolution neural network mainly consists of a modified VGG-16 network; 5 layers of expansion convolution and the last layer of normalized deformable convolution are arranged before the pooling layer of the improved VGG-16 network; the normalized deformable convolution uses the following loss function:

wherein the content of the first and second substances,

which represents the total loss of training,

is the loss of density;

to normalize the deformable losses; λ is a regularization coefficient, and the value range is (0, 1).

Further, the step of calculating the normalized deformable loss comprises:

(1) constraining the positions of the center sample point E, the horizontal sample point D, F, the vertical sample point B, H and the diagonal sample point A, C, G, I of the feature map obtained by convolution:

for the center sample point E, the loss equation is:

wherein, Delta E_x、ΔE_yRepresenting the offset of the central sampling point E in the horizontal direction and the vertical direction relative to the sampling point E before offset;

for a horizontal sampling point, the loss formula is:

wherein, Δ D_x、ΔD_yRepresenting the offset amount of the horizontal sampling point D in the horizontal direction and the vertical direction relative to the sampling point D before offset; Δ F_x、ΔF_yRepresenting the offset amount of the horizontal sampling point F in the horizontal direction and the vertical direction relative to the sampling point F before offset;

for a vertical sampling point, the loss formula is:

wherein, Delta B_x、ΔB_yRepresenting the offset amount of the horizontal sampling point B in the horizontal direction and the vertical direction relative to the sampling point B before offset; Δ H_x、ΔH_yRepresenting the offset of the horizontal sampling point H in the horizontal direction and the vertical direction relative to the sampling point H before offset;

for diagonal sampling points, the loss formula is:

wherein a, b, c, d, e, f, g, h and i respectively represent coordinates of sampling points before shifting;

(2) calculating normalized deformable loss:

further, the density loss formula is as follows:

wherein, Y_iIs a density map of the number of persons, P (I)_i(ii) a Φ) is the density map of the estimated population, and N is the batch size.

Further, the front-end convolutional layer of the improved VGG-16 network is subjected to batch normalization operations.

Compared with the prior art, the invention has the beneficial effects that:

(1) a normalized deformable convolution (NDConv) is provided, which limits the offset of a sampling point to a certain extent, can obtain all information in the effective area of the sampling point, and does not increase extra calculation amount; (2) normalized deformable convolution (NDConv) achieves better performance gains over multiple crowd count data sets than existing methods.

Drawings

Fig. 1 (a) shows an offset of a post-offset sample with respect to a pre-offset sample obtained by a conventional deformable convolution; (b) the offset of the sampling point after the offset relative to the sampling point before the offset is obtained by the normalized deformable convolution;

FIG. 2 is a schematic structural diagram of a normalized deformable convolutional neural network proposed in the present invention;

fig. 3 is a visualization picture display of part of the original picture and the mark point of the OUC _ crown data set proposed by the present invention.

Detailed Description

In order to facilitate an understanding of the invention, the invention is described in more detail below with reference to the accompanying drawings and specific examples. Preferred embodiments of the present invention are shown in the drawings. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.

The embodiment of the invention provides an improved normalized deformable convolution-based crowd counting method, which mainly comprises the following steps:

firstly, constructing a normalized deformable convolution neural network.

The normalized deformable convolution neural network constructed by the invention mainly consists of an improved CSRNet network, wherein the CSRNet network^[20]The original front-end network model is VGG-16, a Batch Normalization layer (Batch norm) is inserted between adjacent convolutional layers of the VGG-16 network, Batch Normalization operation is carried out, and the VGG-16-BN network model is generated and named as the VGG-16-BN network.

Before the VGG-16-BN pooling layer, 6 layers of convolution are added, wherein the first 5 layers are expansion convolution, and the last layer is normalized deformable convolution, so that the normalized deformable convolution neural network is formed, and is recorded as NDConv, and the network structure is shown in figure 2.

Secondly, utilize the normalized deformable convolution neural network (NDConv) of structure to retrain the characteristic map sampling point position offset of input image, obtain more accurate people's head characteristic to promote crowd's counting accuracy, specifically do:

(1) with convolution of 3 × 3, 9 shifted samples, namely a center sample E, a horizontal sample D, F, a vertical sample B, H, and a diagonal sample A, C, G, I, are obtained for each convolution.

Fig. 1 (a) shows the offset of a post-offset sampling point relative to a pre-offset sampling point obtained by using a conventional deformable convolution; (b) the offset of the sampling point after the offset relative to the sampling point before the offset, which is obtained by adopting the normalized deformable convolution network, is shown; the shifted sampling points are represented by capital letters A-I, as indicated by dots in (a), (b) of FIG. 1; the sample points before offset are indicated by lower case letters a-i, as shown by the triangles in (a), (b) of fig. 1; the length of the arrow indicates the magnitude of the offset.

(2) And carrying out normalized constraint on sampling points in the horizontal direction and the vertical direction, and enabling the sampling points to be located on the coordinate axes as much as possible, namely B, D, F and H to be close to the coordinate axes as much as possible. For the center sample point E, it is made as close to the origin E as possible. For the diagonal sampling points A, C, G and I, the parallelogram principle is used, so that the sampling points A, C, G and I and other points can respectively form 4 parallelograms. These four parallelograms are: EBAD, EBCF, EHIF, EHGD.

The coordinates of a-i are respectively: a (-r, -r), b (0, -r), c (r, -r), d (-r,0), e (0,0), f (r,0), g (-r, r), h (-r,0), i (r, r).

The offset of point a from point a is denoted as Δ a and is expressed as: (Delta A)_x，ΔA_y) Then, the coordinates of point a are as shown in equation (1):

A_x＝-r+ΔA_x；

A_y＝-r+ΔA_y； (1)

the offset of point E from sample point E is denoted as deltae,expressed as: (Δ E)_x，ΔE_y)；

The offset of point B from sample point B is denoted Δ B and is expressed as: (Delta B)_x，ΔB_y)；

The offset of point C from sample point C is denoted Δ C and is expressed as: (Δ C)_x，ΔC_y)；

The offset of point D from sample point D is denoted as Δ D and is represented as: (Δ D)_x，ΔD_y)；

The offset of point F from sample point F is denoted as Δ F and is expressed as: (Δ F)_x，ΔF_y)；

The offset of point G from sample point G is denoted Δ G and is expressed as: (Δ G)_x，ΔG_y)；

The offset of point H from sample point H is denoted Δ H and is expressed as: (Δ H)_x，ΔH_y)；

The offset of point I from sample point I is denoted as Δ I and is expressed as: (Delta I)_x，ΔI_y)；

Thus, the normalized deformable loss (NDLoss) includes 5 parts in total:

for sampling point E, the position of E point is kept unchanged as much as possible, and the loss is shown in formula (2):

for two horizontal sampling points D, F, trying to make D and F on the horizontal axis (i.e., the vertical component of the offset is as 0 as possible), while keeping both points D and F near or far from the origin at the same distance, the penalty is given by equation (3):

for two vertical sample points B, H, trying to make B and H on the vertical axis (i.e., the horizontal component of the offset is as much as 0), while keeping both points B and H as close to or as far from the origin as possible, the penalty is given by equation (4):

for the four diagonal sampling points A, C, G, I, the diagonal points a, C, G and I are made to form four parallelograms with other points around as shown by the dotted lines, and the loss is shown in equation (5):

in formula (5), a, b, c, d, e, f, g, h, and i respectively represent coordinates of corresponding points, that is, a ═ r, -r, b ═ 0, -r, c ═ r, -r, d ═ r,0, e ═ 0, f ═ r,0, g ═ r, h ═ r,0, and i ═ r, r.

Thus, the normalized deformable convolution (NDConv) loss is shown in equation (6):

the density estimation loss is shown in equation (7):

wherein, Y_iIs a density map of the number of persons, P (I)_i(ii) a Φ) is a density map of the estimated number of people, and N is the batch size (batch size).

The final loss is shown in equation (8):

wherein, λ is a regularization coefficient, and the value range is (0, 1).

In order to examine the performance of the normalized deformable convolution neural network and the method provided by the invention, the invention adopts four methodsThe data set was trained and validated as ShanghaiTech, respectively^[5]、UCF-QRNF^[21]、UCF_CC_50^[22]And OUC _ Crowd as proposed by the present invention. The details of the experiment are described below:

1. data set and implementation details

1) Data set

The difference in performance between the normalized deformable convolution population counting method of the present invention and other existing population count prediction methods was evaluated on three datasets ShanghaiTech, UCF-QNRF and UCFCC50 commonly used for population counting, as well as the OUC _ Crowd dataset proposed by the present invention.

Shanghaitech the Shanghaitech dataset consists of part A and part B. ShanghaiTech a contains 482 pictures, unlike ShanghaiTech a, ShanghaiTech B contains many more high resolution images, with a population number from 9 to 578, and a total of 716 pictures of the population.

UCF _ QNRF: compared to the ShanghaiTech dataset, UCF _ QNRF is a population count dataset containing a larger number of pictures, containing 1535 high resolution images and 125 ten thousand head marker points. In this data set there were 1201 images of extremely crowded scenes, the rest being test images, the maximum population count in one image being 12865.

UCF _ CC _ 50: the data set contains 50 images. Although the number of images is small, the number of people per image varies greatly, from 94 to 4543. According to the existing research^[22]The data set was divided into five subsets and then 5-fold cross-validation was performed.

OUC _ Crowd: OUC _ Crowd is a data set constructed by the invention, consists of 529 Crowd images shot in different scenes of a campus, is a data set with sparse Crowd and variable scenes, and has 379 training images in total, and the rest are test images, wherein the minimum number of people in each image is 1. The data set comprises crowd pictures of different indoor and outdoor scenes such as campus roads, playgrounds, classrooms and gymnasiums, in order to ensure the marking accuracy and reduce the interference of marking errors on experimental results, after the head of each picture is marked manually, secondary inspection is carried out by utilizing a marking point visualization mode to remove missed pictures, as shown in fig. 3, the first action is an original picture, the second action is a picture marked manually, and the marking points are shown as white dots in the picture.

2) Evaluation index

Reference to reported studies^[23]-[25]The present invention adopts the Mean Absolute Error (Mean Absolute Error) and the Root Mean Square Error (Root Mean Square Error) as evaluation indexes, and the definitions of the Mean Absolute Error and the Root Mean Square Error are shown in the following formulas (9) and (10):

where N is the number of pictures, Ci and

true population counts and predicted population counts.

3) Implementation details

CSRNet^[20]The original front-end network model is VGG-16^[26]In order to further improve the accuracy of the network model training result, Batch Normalization (Batch Normalization) operation is carried out on the convolution layer in the VGG-16 to generate the VGG-16-BN network model. Next, for CSRNet consisting of VGG-16-BN^[20]The main network is modified, namely 6 layers of expansion convolutions are added before the VGG-16-BN pooling layer, and the last layer of expansion convolution is replaced by the deformable convolution. After making the above changes, the new network CSRNet becomes the performance starting point (baseline) of the experiment.

And replacing the last expansion convolution of the 6 layers of expansion convolutions in the VGG-16-BN with the normalized deformable convolution, and recording the expansion convolution as NDConv, namely the normalized deformable convolution neural network constructed by the invention.

During training Adam is used^[27]The learning rate of the optimizer is 1e-4,the pixel size of all images was adjusted to 400 × 400, the batch size (batch size) was set to 4, and the regularization parameter λ was set to 0.001. The evaluation begins at the 100 th epoch and the training process ends at the 400 th epoch.

2. Evaluation and comparison

Results of experiments on ShanghaiTech a and B data sets are shown in table 1. Compared with the existing method, the method provided by the invention has the best performance, and proves the advantage of restricting the offset in the deformable convolution. For shanghaiitecha, the method of the invention achieved an optimum mean absolute error of 61.4, a 5.5% improvement in performance compared to the starting point of performance (baseline) CSRNet. Evaluation of Shanghai TechB did not yield expectations, CFF^[28]And SPN^[29]When the prior method reaches the average absolute error of 7.2, the method of the invention obtains 7.8, but compared with CSRNet, the performance of the normalized deformable convolution reaches 13.3 percent.

Table 1: comparison of normalized Deformable convolution and results of other prior art methods in ShanghaiTechA and ShanghaiTechB

The results of the comparison between the method of the present invention and other prior art methods for the data sets UCF _ QNRF and UCF _ CC _50 are recorded in table 2. Compared to the performance starting point CSRNet, the method of the invention achieves a significant gain in mean absolute error. On both data sets UCF _ QNRF and UCF _ CC _50, mean absolute errors of 91.2 and 167.2 were obtained, respectively, achieving 4.5% and 4.2% performance improvement. It is worth mentioning that, compared with other existing methods, on the premise that the CSRNet performance is greatly improved, the method further reduces the average absolute error, and thus, the effectiveness of the normalized deformable convolution provided by the invention in constraining the offset can be proved.

Table 2: comparison of experimental results of normalized deformable convolution and other existing methods at UCF _ QNRF and UCF _ CC _50

3. Ablation experiment

In the experiment, the influence of the number of deformable convolution layers in the network on the training result is shown firstly, so that the reliability of parameter selection in the experiment is proved. On the other hand, the significance of constraint offsets and the effectiveness of normalizing deformable losses are verified compared to hard constraints.

(1) The original network expansion convolution is replaced by the deformable convolution and the normalized deformable convolution, and the influence of the replacing layer number on the network performance

In CSRNet^[20]The network structure of (1) has 6 layers of expansion convolution, and each layer of the expansion convolution is gradually replaced by deformable convolution from back to front. Each layer of the deformable convolution is then replaced with a normalized deformable convolution and the performance of the network is shown in table 3. As the number of alternative convolution layers increases, the performance of both the performance starting point CSRNet and the normalized deformable convolution decreases, with the mean absolute error result peaks at 167.2 and 184.6, respectively, and the CSRNet peak at 172.9 and 187.4 MAEs, respectively. As can be seen from table 3, the best experimental results can be obtained by replacing only the last layer of the dilation convolution.

Table 3: influence of number of layers of deformable convolution on experimental results

2) Comparison with hard constraints

To verify the effectiveness of the present invention in normalizing the deformable loss (NDloss), only the x-component of the offset on the y-axis is removed, which is referred to as the hard constraint. The performance effect of the hard constraint is then compared to the normalized deformation loss (NDloss), which results in a better performance effect than the hard constraint, as shown in table 4. For the hard constraint, the mean absolute error is found to be 96.1. Hard constraints did not have significant performance improvement compared to baseline CSRNet. The normalized deformable loss (NDloss) proposed by the present invention is demonstrated to be superior to the existing deformable convolution by comparing performance with hard constraints.

Table 4: comparison of hard constraint and NDConv Experimental results

The final experimental results are shown in table 5, the performance starting point (CSRNet) and NDConv proposed by the invention both reach a better average absolute error of 15.3, and the experimental results are not changed after many times of training and testing. This result indicates that the normalized deformable convolution NDConv still has great improvement and room for improvement in improving the performance of sparse crowd and scene-variant datasets.

Table 5: experimental results on the OUC _ Crowd data set

The invention provides a normalized deformable convolution NDConv, and a new normalized deformable loss (NDloss) plays an important role in the normalized deformable convolution (NDConv). On the premise of not increasing the calculated amount, the offset of sampling points in the deformable convolution is limited, so that the sampling points are more uniform, and the information sampling is more sufficient. The normalized deformable convolution (NDConv) has the advantages of rapid and accurate information acquisition, small information loss and the like. The experiment of the invention takes the crowd counting network as an example, and the result shows that the normalized deformable convolution (NDConv) can effectively improve the network performance and improve the accuracy of crowd counting prediction.

Reference documents:

[1]Wang Q,Gao J,Lin W,et al.NWPU-Crowd:A Large-Scale Benchmark for Crowd Counting and Localization[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2020:3013269.

[2]Mazzeo P L,Contino R,Spagnolo P,et al.MH-MetroNet—A Multi-Head CNN for Passenger-Crowd Attendance Estimation[J].Journal of Imaging,2020,6(7):62.

[3]Sindagi V A,Patel V M.Generating High-Quality Crowd Density Maps Using Contextual Pyramid CNNs[C]//2017IEEE International Conference on Computer Vision(ICCV),2017,206:1879-1888.

[4]Zan S,Yi X,Ni B,et al.Crowd Counting via Adversarial Cross-Scale Consistency Pursuit[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR),2018:00550.

[5]Zhang Y,Zhou D,Chen S,et al.Single-Image Crowd Counting via Multi-Column Convolutional Neural Network[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR),2016,70:589-597.

[6]Feris,R.S,et al.Large-Scale Vehicle Detection,Indexing,and Search in Urban Surveillance Videos[J].Multimedia,IEEE Transactions on,2012,14(1):28-42.

[7]Wang G,Li B,Zhang Y,et al.Background Modeling and Referencing for Moving Cameras-Captured Surveillance Video Coding in HEVC[J].IEEE Transactions on Multimedia,2018:2921-2934.

[8]Ran E,Moses Y.Tracking in a Dense Crowd Using Multiple Cameras[J].International Journal of Computer Vision,2010,88(1):129-143.

[9]Brostow G J,Cipolla R.Unsupervised Bayesian Detection of Independent Motion in Crowds[C]//2006 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR),2006:320.

[10]Rabaud V,Belongie S.Counting Crowded Moving Objects[C]//2006 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR),2006:92.

[11]Chan A B,Vasconcelos N.Bayesian Poisson regression for crowd counting[C]//2010 IEEE International Conference on Computer Vision(ICCV),2010:5459191.

[12]Ke C,Chen C L,Gong S,et al.Feature Mining for Localised Crowd Counting[C]//2013 British Machine Vision Conference(BMVC),2013:1–11.

[13]Zitouni M S,Bhaskar H,Dias J,et al.Advances and trends in visual crowd analysis:A systematic survey and evaluation of crowd modelling techniques[J].Neurocomputing,2016,186:139-159.

[14]Sindagi V A,Patel V M.A Survey of Recent Advances in CNN-based Single Image Crowd Counting and Density Estimation[J].Pattern Recognition Letters,2017,107:3-16.

[15]Chong S,Ai H,Bo B.End-to-end crowd counting via joint learning local and global count[C]//2016 IEEE International Conference on Image Processing(ICIP),2016:7532551.

[16]Wang X,Lv R,Zhao Y,et al.Multi-Scale Context Aggregation Network with Attention-Guided for Crowd Counting[C]//2020 15th IEEE International Conference on Signal Processing(ICSP),2020:9321067.

[17]Zhu L,Zhang H,Ali S,et al.Crowd counting via Multi-Scale Adversarial Convolutional Neural Networks[J].Journal of Intelligent Systems,2020,30(1):180-191.

[18]Dai J,Qi H,Xiong Y,et al.Deformable Convolutional Networks[C]//2017 IEEE International Conference on Computer Vision(ICCV),2017:764-773.

[19]Zhu X,Hu H,Lin S,et al.Deformable ConvNets V2:More Deformable,Better Results[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR),2019:00953.

[20]Li Y,Zhang X,Chen D.CSRNet:Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR),2018:00120.

[21]Idrees H,Tayyab M,Athrey K,et al.Composition Loss for Counting,Density Map Estimation and Localization in Dense Crowds[J].Lecture Notes in Computer Science,2018:544-559.

[22]Idrees H,Saleemi I,Seibert C,et al.Multi-Source Multi-Scale Counting in Extremely Dense Crowd Images[C]//2013 IEEE Conference on Computer Vision and Pattern Recognition(CVPR),2013:329.

[23]Yang Y,Li G,Du D,et al.Embedding Perspective Analysis Into Multi-Column Convolutional Neural Network for Crowd Counting[J].IEEE Transactions on Image Processing,2021,30:1395-1407.

[24]Yan Z,Y Yuan,Zuo W,et al.Perspective-Guided Convolution Networks for Crowd Counting[C]//2019 IEEE/CVF International Conference on Computer Vision(ICCV),2019:00104.

[25]Liu N,Long Y,Zou C,et al.ADCrowdNet:An Attention-Injective Deformable Convolutional Network for Crowd Understanding[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR),2019:3220-3229.

[26]Simonyan K,Zisserman A.Very Deep Convolutional Networks for Large-Scale Image Recognition[J].Computer Science,arXiv preprint arXiv:1409.1556,2014.

[27]Kingma D,Ba J.Adam:AMethod for Stochastic Optimization[J].Computer Science,2014:273-297.

[28]Shi Z,Mettes P,Snoek C.Counting with Focus for Free[C]//2019 IEEE International Conference on Computer Vision(ICCV),2019:00430.

[29]Xu C,Qiu K,Fu J,et al.Learn to Scale:Generating Multipolar Normalized Density Maps for Crowd Counting[C]//2019 IEEE International Conference on Computer Vision(ICCV),2019:00847.

[30]Cao X,Wang Z,Zhao Y,et al.Scale Aggregation Network for Accurate and Efficient Crowd Counting[J].Lecture Notes in Computer Science,2018:757-773.

[31]Jiang X,Xiao Z,Zhang B,et al.Crowd Counting and Density Estimation by Trellis Encoder-Decoder Networks[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR),2019:00629.

[32]Liu Y,Shi M,Zhao Q,et al.Point in,Box Out:Beyond Counting Persons in Crowds[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR),2019:00663.

[33]Jiang X,Zhang L,Zhang T,et al.Density-Aware Multi-Task Learning for Crowd Counting[J].IEEE Transactions on Multimedia,2020:2980945.

[34]Liu L,Jiang J,Jia W,et al.DENet:AUniversal Network for Counting Crowd with Varying Densities and Scales[J].IEEE Transactions on Multimedia,2020:2992979.

[35]Krizhevsky Alex,Sutskever I,Hinton G.ImageNet Classification with Deep Convolutional Neural Networks[J].Communications of the ACM,2017,60(6):84-90.

Claims

1. a population counting method based on improved normalized deformable convolution is characterized by comprising the following steps: constructing a normalized deformable convolution neural network; and constraining the position of the characteristic graph sampling point of the input image by utilizing the normalized deformable convolution neural network to obtain the accurate human head characteristic.

2. The improved normalized deformable convolution population counting based method of claim 1, wherein the normalized deformable convolution neural network consists essentially of an improved VGG-16 network; 5 layers of expansion convolution and the last layer of normalized deformable convolution are arranged before the pooling layer of the improved VGG-16 network; the normalized deformable convolution uses the following loss function to constrain the training loss:

wherein the content of the first and second substances,

which represents the total loss of training,

is the loss of density;

3. The improved normalized deformable convolution based population counting method of claim 2, wherein the normalized deformable loss is calculated by the steps of:

for the center sample point E, the loss equation is:

for a horizontal sampling point, the loss formula is:

for a vertical sampling point, the loss formula is:

for diagonal sampling points, the loss formula is:

(2) calculating normalized deformable loss:

4. the improved normalized deformable convolution based population counting method of claim 2, wherein the density loss formula is:

wherein, Y_iIs a density map of the number of real persons, P (I)_i(ii) a Φ) is the density map of the estimated population, and N is the batch size.

5. The improved normalized deformable convolution population counting based method according to any one of claims 2-4, wherein the front-end convolution layer of the improved VGG-16 network is subjected to a batch normalization operation.