CN109886085A

CN109886085A - People counting method based on deep learning target detection

Info

Publication number: CN109886085A
Application number: CN201910004771.0A
Authority: CN
Inventors: 陈友明
Original assignee: Sichuan Honghe Communication Co Ltd
Current assignee: Sichuan Honghe Communication Co Ltd
Priority date: 2019-01-03
Filing date: 2019-01-03
Publication date: 2019-06-14

Abstract

The invention discloses a kind of people counting methods based on deep learning target detection, comprising the following steps: building deep learning network model: using the YOLO3 network with DarkNet network for basic network；The processing of training data: it by obtaining crowd's image data under more scenes, is handled using image mirrors, random interception way expands training set scale on scale；Training network: training parameter is declined to optimize network by loss function and gradient.The present invention is directed to the deficiency that existing crowd counts, counting statistics are carried out to crowd using based on deep neural network object detection method in certain circumstances, solve the problems, such as that the accuracy in traditional characteristic extracting method is low, it also solves the problems, such as simultaneously larger based on the error in the sparse situation of crowd in deep learning feature homing method, and detecting speed has very big promotion, detection speed is 4 times based on 101 layers of residual error network RetinaNet (view film network) speed, and precision is but suitable with it.

Description

People counting method based on deep learning target detection

Technical field

The present invention relates to a kind of people counting methods of computer vision field, more particularly to one kind to be based on deep learning mesh Mark the people counting method of detection.

Background technique

With the growth of population, the acceleration of urbanization process, the behavior that crowd largely assembles is more and more, and scale is more next Bigger, thus bring tread event also increases increasingly.Manager is mounted with a large amount of in order to facilitate the management in city in city Camera.Realizing that crowd density estimation and accurate crowd count by monitor video at present is computer vision field One of research hotspot.The technology is commonly utilized in:

1, the complicated occasion of height that crowd concentrates.Such as: gymnasium, dining room, this large-scale public place in square.Pass through crowd Number system can estimate crowd density or the effective strength in specified region, can true grasp crowd's trend, prevent group different The generation of ordinary affair part.

2, procedural style workplace.Such as: airport, railway station.These places true can be obtained using people counting method To pedestrian's quantity and the accurate data of distribution, reliable basis is provided for science distribution service and management resource, rational management etc..

Traditional people counting method can be generally divided into three kinds:

1, pedestrian detection: judge to whether there is pedestrian in image or video sequence and give using computer vision technique It is accurately positioned.This method is more direct, in the sparse scene of crowd, by detecting each of video pedestrian, into And obtain the result of crowd's counting.

2, visual signature trajectory clustering: for video monitoring, generally be directed to sequence of video images, with KLT tracker and The method of cluster, the number obtained by the trajectory clustering of continuous two width figure is come estimated number.This method has very stringent It is required that brightness, crowd density etc. can all have a great impact to it.

3, based on the recurrence of feature:

Firstly, crowd is split from image convenient for subsequent feature extraction by foreground segmentation；

Then, a variety of different low-level image features of foreground extraction obtained from segmentation, common feature have: crowd's area and week Length, marginal information, textural characteristics etc.；

Finally, the feature extracted to be revert to the number in image.Common homing method has: linear regression, Gauss Process recurrence etc..

Due to using direct method to be easy the influence of difficulties such as being blocked under congested conditions, and indirect method is from crowd's Global feature sets out, and with the ability that large-scale crowd counts, is suitble to than more crowded scene.

The shortcomings that above-mentioned tradition method of counting, is as follows:

1, pedestrian detection is usually to be modeled with the boosting based on background and motion feature, using background modeling method, The target for extracting foreground moving is carried out feature extraction in target area, is then classified using classifier, judged whether Include pedestrian；The problem of background modeling is primarily present at present:

(1) it must adapt to the variation (such as the variation of illumination cause image chroma variation) of environment；

(2) other objects (such as leaf or trunk etc., correctly detected) intensively occurred in image；

(3) it is less applicable in the slightly intensive place of crowd.

The statistical learning pedestrian detection of background subtraction can be used in order to solve speed issue, on condition that background modeling Method is effective enough, i.e. the good speed of effect is fast, but there are above-mentioned more defects for background modeling.

2, visual signature method of trajectory clustering carries out crowd's counting, also will receive background in similar pedestrian detection method and builds The influence of mould.And this method is only suitable for using using under the fewer scene of number, for example uses on bus doorway, such as There are a large amount of circumstance of occlusion in fruit scene, using the method obtain the result is that undesirable.

3, the direct method in the homing method based on feature needs foreground segmentation, and the quality of segmentation performance directly affects To final calculated result, however foreground segmentation is originally a relatively difficult task, the performance of algorithm largely by Its influence, compare the place of aggregation in crowd in this approach, performance and precision will greatly reduce, therefore here it is limit this One key factor of method performance.Indirect method in homing method based on feature is that original image is first converted to crowd Density map establishes model further according to density map.The method works well in the case that the crowd is dense in large size, but medium and small Effect is declined under the less intensive place of the place of type, crowd.

The deep learning research tide of recent years is awfully hot, achieves breakthrough in various traditional fields.Convolution mind Through the training that network implementations is end-to-end, without carrying out foreground segmentation and extracting feature, by obtaining high level after multilayer convolution Abstract characteristics.Deep learning, which forms more abstract high level by combination low-level image feature, indicates attribute classification or feature, with hair The distributed nature of existing data indicates.The deep learning of early stage is mainly with BP neural network, from coding dimensionality reduction and sparse self-editing Based on the research of code device etc..By taking the model in ImageNet challenge match as an example, the deep learning breakthrough of AlexNet in 2012, The model of the various deep learnings such as the appearance of the deep neural network of VGGNet in 2014, GoogLeNet, ResNet in 2015 Occur, causes comprehensive outburst of deep learning research.

Target detection network mainstream currently based on deep learning is to pass through training with depth residual error network for basic network The coordinate and object category of bounding box realize the purpose of target detection, are with the shortcoming that depth residual error network is basic network Speed is slow, and network end is several layers of redundancy phenomena, is not best to feature extraction efficiency.

Summary of the invention

The object of the invention is that solve the above-mentioned problems and provide it is a kind of by deep learning applied to crowd count The people counting method based on deep learning target detection.

The present invention through the following technical solutions to achieve the above objectives:

A kind of people counting method based on deep learning target detection, comprising the following steps:

Step 1: building deep learning network model: using the YOLO3 network with DarkNet network for basic network；

Step 2: the processing of training data: by obtaining crowd's image data under more scenes, being handled using image mirrors, ruler Random interception way expands training set scale on degree；

Step 3: training parameter training network: being declined to optimize network by loss function and gradient.

Preferably, on this basic network of DarkNet network, adding three scales in the step 1 and extracting spy Sign, respectively Scale1, Scal e2, Scale3, wherein Scale1 adds some convolutional layers after basic network and exports again Box information；Scale2 is up-sampled from the convolutional layer of the layer second from the bottom in Scale1, then the spy with the last one 16x16 size Sign figure is added, and again by box information is exported after multiple convolution, scale is compared to Scale1 big twice: Scale3 and Scale2 class Seemingly, the characteristic pattern that last output size is 32 × 32.

Preferably, the step 2 specifically includes the following steps:

Step 2.1: obtaining the image including multiple portraits as input using camera shooting crowd under multiple scenes Image；

Step 2.2: existing sample data being subjected to sample equilibrium treatment, first ensures that the training sample under different scenes Quantity is close, secondly by image size, resolution processes at same size；

Step 2.3: using interception, brightness and the expansion of contrast processing mode are existing at random on image mirrors processing, scale Data；

Step 2.4: image labeling: needing to mark pedestrian place using deep learning object detection method detection crowd's quantity The rectangle frame of picture position, each frame are determined by two coordinates of rectangle frame diagonal line；Annotation tool use labelimg, every Picture can all generate an XML file after having marked, and image name in XML file and its pedestrian's coordinate information extraction are arrived TXT file, for being extracted when network training.

Further, it in the step 2.1, further by downloading data with existing collection on the net, and is downloaded using crawlers Crowd's image, with the image using camera shooting together as input picture.

Preferably, the step 3 specifically includes the following steps:

Step 3.1: input picture is divided into S × S grid, if the center of people is fallen in a grid, this A grid is just responsible for detecting this people；

Step 3.2: each grid predicts B rectangle frame and the score of these rectangle frames, this score is for reacting A possibility that model in this grid for predicting whether containing someone and being this people is how many；

Step 3.3: if people is not present in this grid, being scored at 0；Otherwise it is scored at 1, each rectangle frame packet Containing 5 predicted values, respectively b_x、b_y、b_w、b_hAnd confidence, coordinate (b_x, b_y) represent the center of rectangle frame, b_wAnd b_hRespectively Indicate that the width and height of rectangle frame, confidence indicate the probability value comprising people；

The target of e-learning is t, including t_x、t_y、t_wAnd t_h, calculation formula is as follows:

b_x=σ (t_x)+c_x, b_y=σ (t_y)+c_y,

Wherein c_xAnd c_yIt is the offset of coordinate, p respectively_wAnd p_hIt is the side length of preset rectangle frame respectively；

The calculating of step 3.4:LOSS function, that is, loss function:

The specific formula of LOSS function is as follows:

In above-mentioned formula, front two row is coordinate prediction loss, and third behavior includes that the probability of detection object box loses, the Four behaviors are lost not comprising the probability of detection object box, and last line is the loss of class prediction probability；Wherein LOSS is whole damage Lose function, λ_coordFor loss function coefficient,Expression judges whether j-th of box in i-th of grid is responsible for the object, x_iFor The center abscissa of actual frames,For the center abscissa of prediction block, y_iFor the center ordinate of actual frames,For prediction block Center ordinate, ω_iFor the width of actual frames,For the width of prediction block, C_iFor concrete class,To predict classification, p_i(c) For actual classification probability,To predict class probability；

Step 3.5: training parameter being declined by loss function and gradient, realizes optimization network purpose.

The beneficial effects of the present invention are:

The present invention is directed to the deficiency that existing crowd counts, in certain circumstances using based on deep neural network target detection Method carries out counting statistics to crowd, solves the problems, such as that the accuracy in traditional characteristic extracting method is low, while also solving Based on the larger problem of error in the sparse situation of crowd in deep learning feature homing method, and detect speed have it is very big Promoted, detection speed be based on 4 times of 101 layers of residual error network RetinaNet (view film network) speed, precision but with its phase When；More specific advantage is as follows:

1, it completes to be input to object space and class from original image using end-to-end deep neural network, that is, YOLO3 network Other output extracts features from three different convolutional layers on basic network, then by the feature integration of extraction export frame and Classification information, detection crowd are higher than traditional characteristic extracting mode precision；

2, it in terms of data processing, is handled using image mirrors, interception, brightness and contrast processing mode expand at random on scale Training set scale is filled, accuracy is improved；

3, it is detected using this patent method ratio based on deep neural network feature homing method under the sparse scene of crowd Crowd's quantity is more acurrate, speed faster, detection speed is based on 101 layers of residual error network RetinaNet (view film network) speed 4 times, precision but with its quite.

Detailed description of the invention

Fig. 1 is the structural schematic diagram of YOLO3 network of the present invention；

Fig. 2 is the schematic diagram that trained network of the present invention carries out grid segmentation to input picture in the process.

Specific embodiment

The present invention will be further explained below with reference to the attached drawings:

People counting method of the present invention based on deep learning target detection, comprising the following steps:

Step 1: building deep learning network model: using the YOLO3 net with existing DarkNet network for basic network Network；As shown in Figure 1, YOLO3 network is to add three scales on this basic network of DarkNet network and mention in this step Take feature, respectively Scale1, Scal e2, Scale3, wherein Scale1 adds some convolutional layers again after basic network Export box information；Scale2 from the convolutional layer of the layer second from the bottom in Scale1 up-sample, then with the last one 16x16 size Characteristic pattern be added, again by after multiple convolution export box information, scale compared to Scale1 big twice: Scale3 with Scale2 is similar, the characteristic pattern that last output size is 32 × 32.

DarkNet network its speed and accuracy rate compared with other networks is as shown in table 1:

Table 1

Using Backbone as backbone network in table 1, Top1, Top5 represent accuracy rate；Bn Ops/s is billion The abbreviation of ofoperations per second, i.e., number of processes per second, unit are 1,000,000,000 times per second；BEFLP/s is Billion floating point operations per second abbreviation, and the number of U.S. second processing floating point number operations, FPS is the abbreviation of Frames Per Second, i.e., transmission frame number per second.

Step 2: the processing of training data: the training and optimization of network need a large amount of training sample, the matter of training sample Amount determines that the quality of final mask, the present invention replace single field by obtaining crowd's image data under more scenes to a certain extent Data are acquired under scape can enhance the generalization ability of final mask, be handled using image mirrors, random interception way expands on scale Fill training set scale.

This step 2 specifically includes the following steps:

Step 2.1: obtained under multiple scenes using camera shooting crowd include multiple portraits image, in order into one Step expands data volume, further by downloading data with existing collection on the net, and crowd's image is downloaded using crawlers, by all figures As being used as input picture；

Step 2.2: existing sample data being subjected to sample equilibrium treatment, first ensures that the training sample under different scenes Quantity is close, secondly because the approach different images resolution sizes for obtaining sample are different, will cause final training result in this way Ideal effect cannot be reached, so by image size, resolution processes at same size；

This step 3 specifically includes the following steps:

Step 3.1: as shown in Fig. 2, input picture is divided into S × S grid, if the center of people is fallen in a grid It is interior, then this grid is just responsible for detecting this people；

b_x=σ (t_x)+c_x, b_y=σ (t_y)+c_y,

The calculating of step 3.4:LOSS function, that is, loss function:

LOSS=LOSS1+LOSS2+LOSS3+LOSS4,

The specific formula of LOSS function is as follows:

In above-mentioned formula, front two row is coordinate prediction loss, and third behavior includes that the probability of detection object box loses, the Four behaviors are lost not comprising the probability of detection object box, and last line, that is, fifth line is the loss of class prediction probability；Wherein LOSS For whole loss function, λ_coordFor loss function coefficient,Indicate to judge whether j-th of box in i-th of grid is responsible for being somebody's turn to do Object, x_iFor the center abscissa of actual frames,For the center abscissa of prediction block, y_iFor the center ordinate of actual frames,For The center ordinate of prediction block, ω_iFor the width of actual frames,For the width of prediction block, C_iFor concrete class,To predict class Not, p_iIt (c) is actual classification probability,To predict class probability；

Above-described embodiment is presently preferred embodiments of the present invention, is not a limitation on the technical scheme of the present invention, as long as Without the technical solution that creative work can be realized on the basis of the above embodiments, it is regarded as falling into the invention patent Rights protection scope in.

Claims

1. a kind of people counting method based on deep learning target detection, it is characterised in that: the following steps are included:

Step 2: the processing of training data: by obtaining crowd's image data under more scenes, being handled using image mirrors, on scale Random interception way expands training set scale；

2. the people counting method according to claim 1 based on deep learning target detection, it is characterised in that: the step In rapid 1, on this basic network of DarkNet network, adds three scales and extract feature, respectively Scale1, Scal E2, Scale3, wherein Scale1 adds some convolutional layers after basic network and exports box information again；Scale2 is from Scale1 In layer second from the bottom convolutional layer up-sampling, then be added with the characteristic pattern of the last one 16x16 size, again by multiple volumes Box information is exported after product, scale is similar with Scale2 compared to Scale1 big twice: Scale3, and last output size is 32 × 32 Characteristic pattern.

3. the people counting method according to claim 1 based on deep learning target detection, it is characterised in that: the step Rapid 2 specifically includes the following steps:

Step 2.1: obtaining the image including multiple portraits as input picture using camera shooting crowd under multiple scenes；

Step 2.2: existing sample data being subjected to sample equilibrium treatment, first ensures that the training samples number under different scenes It is close, secondly by image size, resolution processes at same size；

Step 2.3: using on image mirrors processing, scale, interception, brightness and contrast processing mode expand existing number at random According to；

Step 2.4: image labeling: image where needing to mark pedestrian using deep learning object detection method detection crowd's quantity The rectangle frame of position, each frame are determined by two coordinates of rectangle frame diagonal line；Annotation tool uses labelimg, every picture An XML file can be all generated after having marked, image name in XML file and its pedestrian's coordinate information extraction is literary to TXT Part, for being extracted when network training.

4. the people counting method according to claim 3 based on deep learning target detection, it is characterised in that: the step In rapid 2.1, further by the net download data with existing collection, and using crawlers download crowd's image, with utilize camera The image of shooting is together as input picture.

5. the people counting method according to claim 1 based on deep learning target detection, it is characterised in that: the step Rapid 3 specifically includes the following steps:

Step 3.1: input picture is divided into S × S grid, if the center of people is fallen in a grid, this lattice Son is just responsible for detecting this people；

Step 3.2: each grid predicts that B rectangle frame and the score of these rectangle frames, this score are used for reaction model A possibility that for being predicted whether in this grid containing someone and being this people is how many；

Step 3.3: if people is not present in this grid, being scored at 0；Otherwise being scored at 1, each rectangle frame includes 5 A predicted value, respectively b_x、b_y、b_w、b_hAnd confidence, coordinate (b_x, b_y) represent the center of rectangle frame, b_wAnd b_hTable respectively Show that the width and height of rectangle frame, confidence indicate the probability value comprising people；

The calculating of step 3.4:LOSS function, that is, loss function:

The specific formula of LOSS function is as follows:

In above-mentioned formula, front two row is coordinate prediction loss, and third behavior includes that the probability of detection object box loses, fourth line To lose not comprising the probability of detection object box, last line is the loss of class prediction probability；Wherein LOSS is whole loss letter Number, λ_coordFor loss function coefficient,Expression judges whether j-th of box in i-th of grid is responsible for the object, x_iFor reality The center abscissa of frame,For the center abscissa of prediction block, y_iFor the center ordinate of actual frames,For the center of prediction block Ordinate, ω_iFor the width of actual frames,For the width of prediction block, C_iFor concrete class,To predict classification, p_iIt (c) is real Border class probability,To predict class probability；