CN110059667A

CN110059667A - Pedestrian counting method

Info

Publication number: CN110059667A
Application number: CN201910353398.XA
Authority: CN
Inventors: 黄良军; 张晓宁; 张亚妮; 谢福
Original assignee: Shanghai Institute of Technology
Current assignee: Shanghai Institute of Technology
Priority date: 2019-04-28
Filing date: 2019-04-28
Publication date: 2019-07-26

Abstract

The present invention provides a kind of pedestrian counting methods, comprising: S1: obtaining monitor video image to be detected；S2: inputting improved depth convolutional neural networks for image to be detected, i.e., improved Cascade R-CNN network, extracts feature；S3: by RPN network training, Softmax classification is carried out to image and is determined with target area；S4: by Pooling layers of ROIs, the IoU threshold value changed in returning every time carries out multiple regression to frame coordinate, and multiple regression network structure is cascade structure；S5: after returning more each time as a result, the best cascade number of selection, exports optimal number of people prediction segmentation result and regression forecasting result；S6: carrying out pedestrian detection post-processing, and the prediction of crowd's foreground segmentation carries out Hadamard product with the prediction of crowd density figure；S7: exporting final pedestrian's amount detection and pedestrian density schemes.By the above-mentioned means, the present invention can be suitable for the pedestrian counting and Density Detection of different occasions, it is effective to improve testing result precision and speed.

Description

Pedestrian counting method

Technical field

The present invention relates to a kind of pedestrian counting methods.

Background technique

Accurately estimating real-time number in monitoring scene can help related personnel to carry out emergency event early warning and thing in advance The safety of life and property of decision afterwards, people will be protected.

The current existing pedestrian counting method based on deep learning is broadly divided into two kinds: 1) based on network structure characteristic Method；2) network training process is different.Pedestrian counting method based on deep learning has some limitations.Method 1) multiple row convolutional neural networks are used more, need great amount of samples, complexity is high；Method 2) training time is too long, human body target point The problems such as resolution is lower, feature differentiation is more difficult.

Summary of the invention

The purpose of the present invention is to provide a kind of pedestrian counting methods.

To solve the above problems, the present invention provides a kind of pedestrian counting method, comprising:

Step S1 obtains monitor video image to be detected；

The monitor video image to be detected is inputted improved depth convolutional neural networks, that is, improved by step S2 Cascade R-CNN network is to extract feature；

The feature extracted is trained by step S3 by RPN network, to the monitor video image to be detected It carries out Softmax classification to determine with target area, the judgement result B0 of the result C0 and target area to be classified；

Step S4, based on Softmax classification with the judgement of target area as a result, and by ROIs Pooling layers, The IoU threshold value changed in returning every time carries out multiple regression to frame coordinate, wherein the network structure of the multiple regression is grade It is coupled structure；

Step S5 compares after returning each time in multiple regression as a result, the best cascade number of selection, exports optimal pre- Survey number of people segmentation result and regression result；

Step S6 is based on optimum prediction number of people segmentation result and regression result, and carrying out pedestrian detection post-processing includes: people The prediction of group's foreground segmentation carries out Hadamard product with the prediction of crowd density figure；

Step S7 exports final pedestrian's predicted quantity and pedestrian density on the basis of pedestrian detection post-processing Figure.

Further, in the above-mentioned methods, the monitor video image to be detected is inputted improved depth by step S2 Convolutional neural networks, that is, improved Cascade R-CNN network is spent to extract feature, comprising:

The monitor video image to be detected first passes through five convolution stages, the Conv for being 3*3 including 13 kernels, and 3 A kernel is the Maxpool of 2*2；

It is improved improved depth convolutional neural networks are inputted by the monitor video image in five convolution stages Cascade R-CNN network to extract feature.

Further, in the above-mentioned methods, step S4, the judgement knot based on Softmax classification and target area Fruit, and by ROIs Pooling layers, the IoU threshold value changed in returning every time carries out multiple regression to frame coordinate, wherein institute The network structure for stating multiple regression is cascade structure, comprising:

S41: being a by the judgement result B0 input IoU threshold value of target area₁ROIs Pooling, train detector H1 predicts the result C1 of classification and the judgement result B1 of target area；

S42: it will determine that result B1 input IoU threshold value is a₂=a₁+x₁The ROIs Pooling of (0 < x < 1), trains Detector H2 predicts the result C2 of classification and the judgement result B2 of target area；

S43: it will determine that result B2 input IoU threshold value is a₃=a₂+x₂The ROIs Pooling of (0 < x < 1), trains Detector H3 predicts the result C3 of classification and the judgement result B3 of target area.

Further, in the above-mentioned methods, step S5 compares after returning each time in multiple regression as a result, selecting most Good cascade number exports optimum prediction number of people segmentation result and regression result, comprising:

S51: using the index for including AP, the size of C0~C3 of result B0~B3 and classification is compared to determine, is selected optimal The number of people divide prediction result B_n(n=1,2,3) and regression forecasting result C_n(n=1,2,3)；

S52: by B_bWith C_bCarry out training objective classification using multitask assembling loss function to return with detection block.

Further, in the above-mentioned methods, step S6 is based on optimum prediction number of people segmentation result and regression result, carries out Pedestrian detection post-processing includes: the prediction of crowd's foreground segmentation and crowd density figure prediction progress Hadamard product, comprising:

S61: by optimum prediction number of people segmentation result and regression result B_n(n=1,2,3) and C_n(n=1,2,3) is damaged Unwise calculation returns the loss function of loss item using detection, uses | | L_loc||₁Loss function Optimization Prediction biases t=(t_x, t_y, t_w, t_h) and target biasWherein, t represents detection block, x, and y represents the upper left position of detection block, W, h respectively represents the width and height of detection block；

S62:L_cls|objThe loss of each detection block classification is exported, totally 2 class, when providing target detection score p^obj, network Branch excludes score lower than threshold value O first_pRegion；With L_objEqually, L_cls|objEach position is exported by Softmax layers Probabilityα=β=1/3 obtains following loss function:

S63: by optimum prediction number of people segmentation result and regression result B_n(n=1,2,3) and C_nThe result of (n=1,2,3) Matrix carries out Hadamard product, i.e. two matrix corresponding elements are multiplied:

Output=F_reg⊙F_seg。

Compared with prior art, the beneficial effects of the present invention are:

1: the present invention uses full convolutional network, can input image to be detected of any size, improve traditional network Input the problem that size must be consistent；

2: the present invention uses improved Cascade R-CNN network, and multipass changes the size of different IoU threshold values, Can be with the quality of sample when limited guarantee training, and more accurate pedestrian counting detector is trained, it improves traditional Threshold value is single or excessively high causes over-fitting；

3: the present invention adapts to detect with the pedestrian countings of different scenes, can efficiently and accurately predict pedestrian's quantity with Density.

Detailed description of the invention

Fig. 1 is the flow chart of the pedestrian counting method based on concatenated convolutional neural network of one embodiment of the invention；

Fig. 2 is the convolutional neural networks structural schematic diagram of one embodiment of the invention；

Fig. 3 is that the cascade structure of one embodiment of the invention returns flowage structure schematic diagram；

Fig. 4 is the multitask training loss function calculation flow chart of one embodiment of the invention.

Specific embodiment

In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, with reference to the accompanying drawing and specific real Applying mode, the present invention is described in further detail.

As shown in Figure 1, the present invention provides a kind of pedestrian counting method, comprising:

Step S1 obtains monitor video image to be detected；

Here, the present invention is not subtracted using the IoU threshold value being continuously improved in concatenated convolutional neural network in guarantee sample number The detector of high quality is trained in the case where few, is shortened the training time, is effectively improved the accuracy rate of result.

In one preferred embodiment of pedestrian counting method of the invention, step S2, by the monitor video figure to be detected As inputting improved depth convolutional neural networks, that is, improved Cascade R-CNN network to extract feature, comprising:

In one preferred embodiment of pedestrian counting method of the invention, step S4, based on Softmax classification and mesh Mark the judgement in region as a result, and by ROIs Pooling layer, the IoU threshold value changed in returning every time is more to the progress of frame coordinate Secondary recurrence, wherein the network structure of the multiple regression is cascade structure, comprising:

In one preferred embodiment of pedestrian counting method of the invention, step S5 compares in multiple regression and returns each time Afterwards as a result, the best cascade number of selection, exports optimum prediction number of people segmentation result and regression result, comprising:

S52: by B_nWith C_nCarry out training objective classification using multitask assembling loss function to return with detection block.

In one preferred embodiment of pedestrian counting method of the invention, step S6 is based on optimum prediction number of people segmentation result With regression result, carrying out pedestrian detection post-processing includes: that the prediction of crowd's foreground segmentation is carried out with the prediction of crowd density figure Hadamard product, comprising:

Output=F_reg⊙F_seg。

It is carried out here, returning two tasks to number of people foreground segmentation and number of people density simultaneously using multitask loss function Multitask training, to obtain finer number of people density map prediction result.

Specifically, referring to Fig. 1, in one embodiment, a kind of pedestrian counting side based on concatenated convolutional neural network Method the following steps are included:

S1: any scene monitoring region real time monitoring video is obtained, and handles framing image；

S2: inputting improved depth convolutional neural networks for image to be detected, i.e., improved Cascade R-CNN network, Extract feature；

S3: by RPN network training, Softmax classification is carried out to image and is determined with target area；

S4: by Pooling layers of ROIs, the IoU threshold value changed in returning every time carries out multiple regression to frame coordinate, more Secondary Recurrent networks structure is cascade structure；

S5: after returning more each time as a result, the best cascade number of selection, exports optimum prediction number of people segmentation result With regression result；

S6: carrying out pedestrian detection post-processing, and the prediction of crowd's foreground segmentation predicts that carrying out Hadamard multiplies with crowd density figure Product；

S7: exporting final pedestrian's predicted quantity and pedestrian density schemes.

Further details of elaboration is done to a kind of pedestrian counting method based on concatenated convolutional neural network below, but not It should be as limit.

In step 1, it obtains any scene monitoring region and monitors video in real time, and be processed into (3,224,224) figure Picture, to carry out subsequent step.

In step 2, in input step 1 then the image data of (3,224,224) passes through into five convolution stages Liang Ge branch convolutional layer, output category result and coordinate frame regression result.

With reference to Fig. 2, a kind of pedestrian counting method based on concatenated convolutional neural network the following steps are included:

S21: the image data of (3,224,224) in input step 1, into first stage convolutional layer, this layer has 64 The convolution kernel of (3,3), activation primitive are Relu, and a convolution kernel has swept picture and generated a new matrix, and 64 convolution kernels are raw At 64 layer matrixes.Then data input convolutional layer, and image data is 64*224*224 at this time.Then data input pond layer, step Long (2,2) refer to laterally mobile 2 lattice every time, longitudinal 2 lattice mobile every time.After such pond, data become The wide height of (64,112,112), matrix is halved by original 224, becomes 112.

S22: similarly, second and third, four, five convolution kernel numbers successively become 128,256,512,1024.Per stage pond Later, matrix is reduced into the 1/2 of original matrix.

After 3 layers of pond of S23:13 layers of convolution sum, initial input image data becomes (512,7,7) and carries out Flatten It calculates, data is evened up into vector, become one-dimensional 512*7*7=25088.

Further, step S3 obtains the knot of foreground segmentation the following steps are included: enter data into concatenated convolutional network The matrix of consequence of fruit matrix and recurrence.

With reference to Fig. 3, the cascade structure in a kind of pedestrian counting method based on concatenated convolutional neural network is returned, including Following steps:

S41: target area is determined that result B0 input IoU threshold value is a₁ROIs Pooling, train detector H1, predicts classification results C1 and target area determines result B1；

S42: being a by B1 input IoU threshold value₂=a₁+x₁The ROls Pooling of (0 < x < 1), trains detector H2 It predicts classification results C2 and target area determines result B2；

S43: being a by B2 input IoU threshold value₃=a₂+x₂The ROIs Pooling of (0 < x < 1), trains detector H3, It predicts classification results C3 and target area determines result B3.

A kind of multitask training loss function with reference to Fig. 4, in the pedestrian counting method based on concatenated convolutional neural network Calculation flow chart, comprising the following steps:

S51: using indexs such as AP, comparing the size of B0~B3 and C0~C3, selects optimal number of people segmentation prediction result B_n(n=1,2,3) and regression forecasting result C_n(n=1,2,3)；

B52: by B_nCarry out training objective classification using multitask assembling loss function with Cn to return with detection block.

Further, S6 the following steps are included:

S61: first by optimal number of people segmentation result and regression result B_n(n=1,2,3) and C_n(n=1,2,3) is carried out Costing bio disturbance returns the loss function of loss item using detection, uses | | L_loc||₁Loss function Optimization Prediction biases t=(t_x, t_y, t_w, t_h) and target biasT represents detection block, x, and y represents the upper left position of detection block, w, h Represent the width and height of detection block.

S62:L_cls|objThe loss of each detection block classification is exported, totally 2 class.When providing target detection score p^obj, network Branch excludes score lower than threshold value O first_pRegion.With L_objEqually, L_cls|objEach position is exported by Softmax layers Probabilityα=β=1/3.

S63: matrix of consequence is subjected to Hadamard product, i.e.,

S64: the total losses function of the effect predicted by loss function detection model, this method is smaller, and model is preferable.

The beneficial effects of the present invention are:

1 each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with its The difference of his embodiment, the same or similar parts in each embodiment may refer to each other.

Professional further appreciates that, list described in conjunction with the examples disclosed in the embodiments of the present disclosure Member and algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, hard in order to clearly demonstrate The interchangeability of part and software generally describes each exemplary composition and step according to function in the above description. These functions are implemented in hardware or software actually, the specific application and design constraint depending on technical solution. Professional technician can use different methods to achieve the described function each specific application, but this reality Now it should not be considered as beyond the scope of the present invention.

Obviously, those skilled in the art can carry out various modification and variations without departing from essence of the invention to invention Mind and range.In this way, if these modifications and changes of the present invention belongs to the range of the claims in the present invention and its equivalent technologies Within, then the invention is also intended to include including these modification and variations.

Claims

1. a kind of pedestrian counting method characterized by comprising

Step S1 obtains monitor video image to be detected；

The feature extracted is trained by step S3 by RPN network, is carried out to the monitor video image to be detected Softmax classification and target area judgement, the judgement result B0 of the result C0 and target area to be classified；

Step S4, based on Softmax classification with the judgement of target area as a result, and by ROIs Pooling layer, change IoU threshold value in returning every time carries out multiple regression to frame coordinate, wherein the network structure of the multiple regression is level link Structure；

Step S5 compares after returning each time in multiple regression as a result, the best cascade number of selection, exports the optimum prediction number of people Segmentation result and regression result；

Step S6 is based on optimum prediction number of people segmentation result and regression result, and carrying out pedestrian detection post-processing includes: crowd's prospect Segmentation prediction carries out Hadamard product with the prediction of crowd density figure；

Step S7 exports final pedestrian's predicted quantity and pedestrian density schemes on the basis of pedestrian detection post-processing.

2. pedestrian counting method as described in claim 1, which is characterized in that step S2, by the monitor video to be detected Image inputs improved depth convolutional neural networks, that is, improved Cascade R-CNN network to extract feature, comprising:

The monitor video image to be detected first passes through five convolution stages, the Conv for being 3*3 including 13 kernels, in 3 Core is the Maxpool of 2*2；

Improved depth convolutional neural networks, that is, improved will be inputted by the monitor video image in five convolution stages Cascade R-CNN network is to extract feature.

3. pedestrian counting method as claimed in claim 2, which is characterized in that step S4, based on Softmax classification and mesh Mark the judgement in region as a result, and by ROIs Pooling layer, the IoU threshold value changed in returning every time carries out repeatedly frame coordinate It returns, wherein the network structure of the multiple regression is cascade structure, comprising:

S41: being a by the judgement result B0 input IoU threshold value of target area₁ROIs Pooling, train detector H1, in advance Measure the result C1 of classification and the judgement result B1 of target area；

4. pedestrian counting method as claimed in claim 3, which is characterized in that step S5 compares in multiple regression and returns each time It is after returning as a result, the best cascade number of selection, exports optimum prediction number of people segmentation result and regression result, comprising:

S51: using the index for including AP, the size of C0~C3 of result B0~B3 and classification is compared to determine, optimal people is selected Head segmentation prediction result B_n(n=1,2,3) and regression forecasting result C_n(n=1,2,3)；

5. pedestrian counting method as described in claim 1, which is characterized in that step S6, based on optimum prediction number of people segmentation knot Fruit and regression result, carrying out pedestrian detection post-processing includes: that the prediction of crowd's foreground segmentation is carried out with the prediction of crowd density figure Hadamard product, comprising:

S61: by optimum prediction number of people segmentation result and regression result B_n(n=1,2,3) and C_n(n=1,2,3) carries out loss meter Calculate, the loss function of loss item returned using detection, use | | L_loc||₁Loss function Optimization Prediction biases t=(t_x, t_y, t_w, t_h) With target biasWherein, t represents detection block, x, and y represents the upper left position of detection block, w, h difference Represent the width and height of detection block；

S62:L_cls|objThe loss of each detection block classification is exported, totally 2 class, when providing target detection score p^obj, network branches head It first excludes score and is lower than threshold value O_pRegion；With L_objEqually, L_cls|objThe probability of each position is exported by Softmax layersα=β=1/3 obtains following loss function:

S63: by optimum prediction number of people segmentation result and regression result B_n(n=1,2,3) and C_nThe matrix of consequence of (n=1,2,3) into Row Hadamard product, i.e. two matrix corresponding elements are multiplied:

Output=F_reg⊙F_seg。