CN107330390B

CN107330390B - People counting method based on image analysis and deep learning

Info

Publication number: CN107330390B
Application number: CN201710492597.XA
Authority: CN
Inventors: 黄建华; 俞启尧
Original assignee: Shanghai Nuclear Furstate Software Technology Co ltd
Current assignee: Shanghai Nuclear Furstate Software Technology Co ltd
Priority date: 2017-06-26
Filing date: 2017-06-26
Publication date: 2020-12-01
Anticipated expiration: 2037-06-26
Also published as: CN107330390A

Abstract

The invention discloses a people counting method based on image analysis and deep learning, which comprises the following steps: A. performing pyramid model calculation on an input image to generate images with a plurality of resolutions and sizes; B. performing window sliding on each layer of the pyramid, calculating the HOG characteristic value of a window area, classifying through a linear SVM classifier, and judging whether the window is a head-shoulder area or not; C. for each head and shoulder area given in the step B, extracting a corresponding image, normalizing to a set same size, and inputting the image into a deep neural network to obtain classified output; D. and C, performing non-maximum suppression on all the head and shoulder windows in the output of the step C to combine the detection results of the overlapping of adjacent areas and scales. The invention can improve the defects of the prior art and can achieve higher people counting performance at a higher speed.

Description

People counting method based on image analysis and deep learning

Technical Field

The invention relates to the technical field of image target recognition and deep learning, in particular to a people counting method based on image analysis and deep learning.

Background

The computer vision technology is utilized to carry out people counting on the monitored images or videos, and the system can be widely applied to project scenes such as trampling early warning, traffic dispersion, shop pedestrian flow assessment, attendance counting and the like. However, the existing people counting system has large errors in crowded environments. This is because there are usually a large number of shadows in a crowded environment, and the features of the area below the shoulders of the human body are hardly reliably and efficiently used. However, if only the head-shoulder feature is extracted and located, because the head-shoulder shape curve is relatively simple, the extracted features of the traditional hand-designed feature extraction algorithm such as HOG, LBP, HAAR, etc. are easily confused with the corresponding features of other parts of the body or the shape of the background texture, which results in a large number of false detections, as described in "history of oriented graphics for human detection" (n.dalal and b.triggs, in IEEE Conference on Computer Vision and Pattern Recognition, 2005). On the other hand, feature extraction based on deep learning is described in Rich features for objective object detection and magnetic segmentation (r.girshick, j.donahue, t.darrell, et al, CVPR, 2014) and fast r-cnn: the methods of detecting objects with region pro-technical networks (S.ren, K.He, R.Girshick, et al, NIPS, 2015) have surpassed manual characterization in many image analysis areas. However, due to the large amount of calculation and the low speed, the method has not been widely applied to monitoring scenes with high real-time requirements.

Disclosure of Invention

The invention aims to provide a people counting method based on image analysis and deep learning, which can overcome the defects of the prior art and can achieve high people counting performance at a high speed.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows.

A people counting method based on image analysis and deep learning comprises the following steps:

A. performing pyramid model calculation on an input image to generate images with a plurality of resolutions and sizes;

B. performing window sliding on each layer of the pyramid, calculating the HOG characteristic value of a window area, classifying through a linear SVM classifier, and judging whether the window is a head-shoulder area or not;

C. for each head and shoulder area given in the step B, extracting a corresponding image, normalizing to a set same size, and inputting the image into a deep neural network to obtain classified output;

D. and C, performing non-maximum suppression on all the head and shoulder windows in the output of the step C to combine the detection results of the overlapping of adjacent areas and scales.

Preferably, in step a, the original image is gaussian smoothed and an image with 10% reduced resolution is generated, and this process is repeated for the newly generated low resolution image until a given number of layers of pyramid models are generated.

Preferably, in the step B, the step C,

carrying out target detection on each layer of the pyramid; in the detection process, a window with a fixed size of WXH slides in an image space, the HOG characteristic is calculated in an image area under the window and is input into a linear SVM classifier, and a judgment result of whether the window is a head-shoulder target is obtained; in the HOG calculation, the horizontal and vertical gradients of each pixel point (x, y) are respectively

G_x(x，y)＝I(x+1，y)-I(x-1，y)

G_y(x，y)＝I(x，y+1)-I(x，y-1)

In the formula, I (x, y) represents the pixel value at (x, y), and the gradient amplitude and direction of the pixel point (x, y) are respectively

α(x，y)＝tan^-1(G_y(x，y)/G_x(x，y))

The calculation of HOG is to divide the window into many cells, each cell is 4x4 pixels, there is no overlap between cells, and for each cell, the corresponding characteristic is generated by the formula

H_o(m，n)＝∑_{4m≤x＜4m+4，4n≤y＜4n+4}G(x，y)Δ_o(x，y)/Z

Wherein H_o(m, n) is a characteristic value of the cell (m, n) with the gradient direction of o (0 is more than or equal to o and less than 9); z is a certain normalization parameter; HOG calculates the characteristics of each block and connects in series; here, each block then comprises 2 × 2 adjacent cells, and there may be overlap between blocks; the features of each block include the normalized 9 gradient direction histograms of each cell thereunder, forming a 36-dimensional feature; the features of all blocks form HOG features with dimensions of 36 (W/4-1) x (H/4-1).

Preferably, in step C, for each head and shoulder region obtained in step B, the intra-region image is extracted, enlarged or reduced to a size of 48 × 48, and sent to the deep neural network to obtain a judgment whether the intra-region image is a head and shoulder target.

Preferably, the deep neural network comprises 3 sets of convolutional layers and sampling layers, 2 fully-connected layers and 1 output layer.

Preferably, in the 3D convolution operation of the convolutional layer, for each output channel O of the convolutional layer_nEach pixel (x, y),

wherein I_mIs an input channel, M is the number of input channels, H_m，nTwo-dimensional filter of 5x5, alpha_m，nIs the channel weight; h_m，nAnd alpha_m，nTogether forming a 3-dimensional filter; by PCA to I_m5x5 neighborhood of I, then I_mMay be represented as a weighted sum of a plurality of principal components,

where (i, j) belongs to the 5x5 neighborhood of (x, y), β_kFor the kth PCA projection coefficient, U_kIs the K-th PCA principal component, and K is the number of principal components, then

Wherein the content of the first and second substances,

adopt the beneficial effect that above-mentioned technical scheme brought to lie in: the deep neural network designed by the invention adopts a brand-new structure, greatly reduces model parameters and improves the operation speed. Different from the general target detector based on deep learning, the invention abandons the selective search, RPN (region protocol network) waiting for selecting the region extraction method commonly adopted by deep learning, and adopts the output of the traditional HOG detector as a candidate region, thereby having certain superiority for the head and shoulder scene and the small target scene in the crowded environment. And the artificial field depth calibration is carried out on the scene, so that the scale space search range of the HOG detection is greatly reduced. The invention has good universality and good detection performance for both crowded environment and uncongested environment; because a simpler deep neural network, PCA decomposition acceleration, HOG pre-screening mechanism and artificial depth of field calibration are adopted, the speed is higher.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention.

FIG. 2 is a block diagram of a deep neural network in accordance with one embodiment of the present invention.

Fig. 3 is a schematic view of an artificial depth of field calibration according to an embodiment of the present invention.

Detailed Description

Referring to fig. 1, one embodiment of the present invention includes the steps of:

In step A, Gaussian smoothing is carried out on the original image, an image with the resolution reduced by 10% is generated, and the process is repeated on the newly generated low-resolution image until a pyramid model with a given layer number is generated.

In the step (B), the step (A),

G_x(x，y)＝I(x+1，y)-I(x-1，y)

G_y(x，y)＝I(x，y+1)-I(x，y-1)

α(x，y)＝tan^-1(G_y(x，y)/G_x(x，y))

The calculation of HOG is to divide the window into a number of cells, each cell being 4x4 pixels, the cells

H_o(m，n)＝∑_{4m≤x＜4m+4，4n≤y＜4n+4}G(x，y)Δ_o(x，y)/Z

Does not overlap with the cells, the corresponding feature is generated for each cell according to the formula

In the step C, for each head and shoulder area obtained in the step B, extracting the image in the area, amplifying or reducing the image to the size of 48x48, and sending the image into a deep neural network to judge whether the image is a head and shoulder target.

The deep neural network comprises 3 groups of convolutional layers and sampling layers, 2 full-connection layers and 1 output layer.

In a 3D convolution operation of a convolutional layer, for each output channel O of the convolutional layer_nEach pixel (x, y),

Wherein the content of the first and second substances,

referring to fig. 2, wherein C1, C3, C5 and C7 are convolutional layers, S2, S4 and S6 are sampling layers (including nonlinear activation operation), and F8 and F9 are full-link layers. All convolutional layers use a filter length of 5. The filling length of the C1 and C3 layers is 2, the filling length of the C5 layer is 1, and the C7 layer is not filled. The number of nodes of F8 is 128. The number of nodes of F9 is 2.

To further increase its speed, we decompose the 3D convolution in each convolutional layer operation (C1, C3, C5, C7) into a number of convolution operations of 2-dimensional convolution and 1x 1.

Standard convolution operation: m is the number of input channels, N is the number of output channels, and the filter is 5x5, so for each pixel position, 5x5xMxN multiplications are performed.

PCA projection: for each pixel position of each input channel, 5x5 neighborhood is projected to 6 principal component directions, which is equivalent to 6 convolutions of 5x5, so that for each pixel position, 5x5x6xN multiplications are performed.

1x1 convolution: for each pixel location, the generated 6M-dimensional vector is subjected to weighted summation, which is equivalent to a standard convolution operation of 1x1, so that for each pixel location, 6xMxN multiplications are performed.

Referring to fig. 3, the head of a person located at a near place and the head of a person located at a far place are respectively selected, and two square frames are drawn according to the size of the head. The size of the human head at any position in the scene can be obtained by linear interpolation according to the sizes of the two frames and the longitudinal positions of the two frames. The calibration and estimation method is established on the premise that all people are in the same main plane, and the horizontal direction of the camera imaging is parallel to the main plane in the scene. Such a precondition can usually be met. The estimation of the size of the human head at any position of the scene can greatly reduce the search range of the scale space and improve the speed of image analysis.

The embodiment is verified in the classroom monitoring systems of a plurality of universities, and the average people counting accuracy rate of more than 89% can be achieved.

The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A people counting method based on image analysis and deep learning is characterized by comprising the following steps:

A. performing pyramid model calculation on the input image to generate images with multiple resolutions and sizes, specifically including,

performing Gaussian smoothing on an original image, generating an image with resolution reduced by 10%, and repeating the process on a newly generated low-resolution image until a pyramid model with a given layer number is generated;

B. performing window sliding on each layer of the pyramid, calculating HOG characteristic value of a window region, classifying by a linear SVM classifier, judging whether the window is a head-shoulder region, specifically comprising,

G_x(x,y)＝I(x+1,y)-I(x-1,y)

G_y(x,y)＝I(x,y+1)-I(x,y-1)

α(x,y)＝tan^-1(G_y(x,y)/G_x(x,y))

The calculation of HOG divides the window into a number of cells, each cell being 4x4 pixels, there is no overlap between cells, and for each cell, the corresponding feature is generated by the formula,

H_o(m,n)＝∑_{4m≤x＜4m+4,4n≤y＜4n+4}G(x,y)Δ_o(x,y)/Z

wherein H_o(m, n) is a characteristic value of the cell (m, n) with the gradient direction of o (0 is more than or equal to o and less than 9); z is a certain normalization parameter; HOG calculates the features of each block and concatenates them(ii) a Here, each block then comprises 2 × 2 adjacent cells, and there may be overlap between blocks; the features of each block include the normalized 9 gradient direction histograms of each cell thereunder, forming a 36-dimensional feature; the features of all blocks form HOG features with dimensions of 36 (W/4-1) x (H/4-1);

C. for each head and shoulder area given in the step B, extracting a corresponding image, normalizing to a set same size, and inputting the image into a deep neural network to obtain classified output; b, extracting the image in the region of each head and shoulder region obtained in the step B, amplifying or reducing the image to the size of 48x48, sending the image into a deep neural network, and obtaining the judgment whether the image is a head and shoulder target; the deep neural network comprises 3 groups of convolution layers and sampling layers, 2 full-connection layers and 1 output layer;

wherein I_mIs an input channel, M is the number of input channels, H_m,nTwo-dimensional filter of 5x5, alpha_m,nIs the channel weight; h_m,nAnd alpha_m,nTogether forming a 3-dimensional filter; by PCA to I_m5x5 neighborhood of I, then I_mMay be represented as a weighted sum of a plurality of principal components,

Wherein the content of the first and second substances,