CN112380960A

CN112380960A - Crowd counting method, device, equipment and storage medium

Info

Publication number: CN112380960A
Application number: CN202011254152.6A
Authority: CN
Inventors: 林嘉鑫; 赖蔚蔚; 吴广财; 郑杰生; 郑颖龙; 周昉昉; 刘佳木
Original assignee: Guangdong Electric Power Information Technology Co Ltd
Current assignee: Guangdong Electric Power Information Technology Co Ltd
Priority date: 2020-11-11
Filing date: 2020-11-11
Publication date: 2021-02-19

Abstract

The application discloses a crowd counting method, a device, equipment and a storage medium, wherein the method comprises the following steps: sequentially inputting each frame of image in the acquired target video to a preset head and shoulder detection model for head and shoulder detection, and outputting a head and shoulder detection frame of each frame of image; matching each head and shoulder detection frame in two continuous frames of images, and judging that the two successfully matched head and shoulder detection frames are the same target; tracking the same target in the target video to obtain a tracking track; the number of the tracking tracks is calculated to obtain the people counting result in the target video, and the technical problem that the existing pedestrian detection method has large people detection error when the crowd is dense and the pedestrians are seriously shielded from each other is solved.

Description

Crowd counting method, device, equipment and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for people counting.

Background

The video monitoring system enters an intelligent video monitoring era after the result simulates video monitoring and digital video monitoring. In intelligent video monitoring system, crowd density detects a core personage, especially in scenes such as garden, station, gathers crowd image data through the camera, rapid analysis and statistics number to report an emergency and ask for help avoiding appearing overcrowded, trample safety accident such as even to the high density crowd scene.

In the prior art, people are counted by a pedestrian detection method, and the method has the problem of large detection error of people when people are dense and pedestrians are seriously shielded.

Disclosure of Invention

The application provides a crowd counting method, a device, equipment and a storage medium, which are used for solving the technical problem that the existing pedestrian detection method is dense in crowd and has large detection error when pedestrians are seriously shielded.

In view of the above, the present application provides, in a first aspect, a crowd counting method, including:

sequentially inputting each frame of image in the acquired target video to a preset head and shoulder detection model for head and shoulder detection, and outputting a head and shoulder detection frame of each frame of image;

matching each head and shoulder detection frame in two continuous frames of the images, and judging that the two successfully matched head and shoulder detection frames are the same target;

tracking the same target in the target video to obtain a tracking track;

and calculating the number of the tracking tracks to obtain the people counting result in the target video.

Optionally, the preset head and shoulder detection model includes: the characteristic diagram reducing module and the multi-scale receptive field expanding module are connected with the characteristic diagram reducing module;

correspondingly, the head and shoulder detection frame for sequentially inputting each frame of image in the acquired target video to a preset head and shoulder detection model for head and shoulder detection and outputting each frame of image comprises:

sequentially inputting each frame of image in the acquired target video to a preset head and shoulder detection model, enabling the feature map reduction module to perform feature extraction on the input image and reduce the size of the extracted feature map, performing multi-scale processing on the reduced feature map by the multi-scale receptive field expansion module, performing head and shoulder detection frame prediction based on the extracted multi-scale features, and outputting a head and shoulder detection frame of each frame of image.

Optionally, the feature map reduction module includes: a first convolutional layer, a second convolutional layer, a third convolutional layer and a fourth convolutional layer;

wherein the convolution kernel size of the first convolution layer is 7 × 7, and the convolution kernel size of the second convolution layer, the third convolution layer, and the fourth convolution layer is 3 × 3.

Optionally, the multi-scale receptive field expansion module includes: an inclusion layer, a convolution layer and 3 prediction layers.

Optionally, the matching each of the head and shoulder detection frames in the two consecutive frames of images, and determining that the two successfully matched head and shoulder detection frames are the same target, includes:

calculating the intersection ratio between each head and shoulder detection frame in two continuous frames of the images, and when the maximum intersection ratio is greater than a preset threshold value, successfully matching the two corresponding head and shoulder detection frames by the maximum intersection ratio;

and judging that the two head and shoulder detection frames which are successfully matched are the same target.

A second aspect of the present application provides a people counting device comprising:

the output unit is used for sequentially inputting each frame of image in the acquired target video into a preset head and shoulder detection model for head and shoulder detection and outputting a head and shoulder detection frame of each frame of image;

the matching unit is used for matching each head and shoulder detection frame in the two continuous frames of images and judging that the two successfully matched head and shoulder detection frames are the same target;

the tracking unit is used for tracking the same target in the target video to obtain a tracking track;

and the calculating unit is used for calculating the number of the tracking tracks to obtain the people counting result in the target video.

correspondingly, the output unit is specifically configured to:

Optionally, the matching unit is specifically configured to:

A third aspect of the application provides a people counting device comprising a processor and a memory;

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to perform the people counting method according to any of the first aspect according to instructions in the program code.

A fourth aspect of the present application provides a computer readable storage medium for storing program code for performing the people counting method of any one of the first aspect.

According to the technical scheme, the method has the following advantages:

the application provides a crowd counting method, which comprises the following steps: sequentially inputting each frame of image in the acquired target video to a preset head and shoulder detection model for head and shoulder detection, and outputting a head and shoulder detection frame of each frame of image; matching each head and shoulder detection frame in two continuous frames of images, and judging that the two successfully matched head and shoulder detection frames are the same target; tracking the same target in the target video to obtain a tracking track; and calculating the number of the tracking tracks to obtain the people counting result in the target video.

According to the method and the device, the head and the shoulder of each frame of image in the target video are detected through the preset head and shoulder detection model, so that false detection and missing detection caused by mutual shielding of people are avoided; the method comprises the steps of matching head and shoulder detection frames in two continuous frames of images, determining the head and shoulder detection frames belonging to the same target in the two continuous frames of images, tracking the head and shoulder detection frames, and finally determining the people counting result in a target video through the number of tracking tracks.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

Fig. 1 is a schematic flow chart of a crowd counting method according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a preset head and shoulder detection model according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a crowd counting apparatus according to an embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

For ease of understanding, referring to fig. 1, the present application provides an embodiment of a people counting method, comprising:

and 101, sequentially inputting each frame of image in the acquired target video into a preset head and shoulder detection model for head and shoulder detection, and outputting a head and shoulder detection frame of each frame of image.

The video of the crowd dense area is obtained through the camera to obtain the target video, and the target video can be subjected to frame division to obtain each frame of image.

The existing pedestrian detection network is large in size and high in calculation resource cost, and in order to solve the problem, the preset head and shoulder detection model in the embodiment of the application is a lightweight neural network model and mainly comprises two parts: a characteristic diagram reducing module and a multi-scale receptive field expanding module. And sequentially inputting each frame of image in the target video into a preset head and shoulder detection model, so that a feature map reduction module performs feature extraction on the input image, reduces the size of the extracted feature map, a multi-scale receptive field expansion module performs multi-scale processing on the reduced feature map, performs head and shoulder detection frame prediction based on the extracted multi-scale features, and outputs a head and shoulder detection frame of each frame of image.

Further, the preset head and shoulder detection model can be structured as shown in fig. 2, and the feature diagram reduction module can rapidly reduce the space size of the feature diagram and increase the network operation speed. The feature map reduction module comprises: a first convolutional layer, a second convolutional layer, a third convolutional layer and a fourth convolutional layer; the convolution kernel size of the first convolution layer Conv1 is 7 × 7, and the convolution kernel sizes of the second convolution layer Conv2, the third convolution layer Conv3 and the fourth convolution layer Conv4 are 3 × 3.

The characteristic diagram reducing module firstly adopts the convolution kernel size of 7 × 7 and the convolution layer with the step length of 4 to rapidly reduce the size of the input image, thereby greatly reducing the size of the characteristic diagram processed by the subsequent convolution layer and further reducing the calculated amount. Meanwhile, the convolution kernel parameter quantity and the receptive field of 7 × 7 are relatively large, the extracted features are richer, and the loss of feature information caused by the rapid reduction of the size of the feature map can be reduced. After the first convolution, the feature size is further reduced using a convolution layer of 3 x 3 convolution kernel size with step size 2. And then, a convolution layer with the convolution kernel size of 3 x 3 and the step length of 1 is adopted, so that the rapid loss of the characteristic information is reduced on one hand, and the network depth is deepened on the other hand, so that the network extracts more accurate depth characteristics. Finally, the fourth convolutional layer Conv4 quickly reduced the feature map size to 1/16 as input.

The rapid feature map reduction module can greatly increase the speed and reduce the precision problem caused by feature information loss, thereby not only accelerating the running speed of the model, but also keeping the model at higher precision. In addition, the network sets the number of convolution kernels of Conv1, Conv2, Conv3 and Conv4 to 12, 24 and 48 respectively, parameter redundancy is reduced, and operation efficiency is further improved.

Further, referring to fig. 2, the multi-scale receptive field expansion module includes: an inclusion layer, a convolutional layer (expansion convolutional layer), and 3 prediction layers. The multi-scale receptive field expansion module is used for expanding a target-associated receptive field and providing rich context semantic information for the head-shoulder target characteristics in combination with the form of the multi-scale receptive field, the multi-scale receptive field expansion module in the embodiment of the application designs a more adaptive expansion rate (the expansion rate is preferably 3) for head-shoulder data distribution, and greatly reduces the precision loss caused by the multi-branch expansion convolutional layer, and the scale of the receptive field can be increased by the expansion convolutional layer with the expansion rate of 3.

After receptive field scale amplification and multi-scale generation, the multi-scale receptive field expansion module predicts the head-shoulder target in a manner that predicts separately on 3 different convolutional layers. Setting a 1 st prediction layer after the inclusion, setting a second prediction layer after Conv6_1, setting a third prediction layer after Conv9_1, and adopting prior box design with different scales. Because in the head-shoulder detection, the head-shoulder target aspect ratio approaches 1: therefore, in order to efficiently regress the target and save the calculation amount by the prior frame, the aspect ratio of the embodiment of the present application is 1: 1, prior box. The robustness of the detector can be effectively improved by combining the layered prediction with the multi-scale prior frame design. The loss functions of the preset head and shoulder detection model in the embodiment of the application comprise a Softmax loss function and a Smooth L1 loss function, wherein the Softmax loss function is mainly used for performing loss calculation on a predicted target category; the Smooth L1 loss function is used to regress the predicted and actual detection boxes.

For the head and shoulder detection frame results of the three prediction layers, the non-maximum inhibition method is adopted to screen the head and shoulder detection frames, and the optimal head and shoulder detection frame is output.

And 102, matching each head and shoulder detection frame in two continuous frames of images, and judging that the two successfully matched head and shoulder detection frames are the same target.

After detection, each head and shoulder target of the current frame image corresponds to one head and shoulder detection frame, and then detection results may be missed and false due to the detection precision of the detector. Therefore, in order to improve the accuracy of the crowd counting method, the multi-target tracking algorithm is added on the basis of detection to correct the detection result, and the tracking track of each head and shoulder target in the continuous video frames is obtained.

In the embodiment of the application, the IOU (IOU is the intersection and parallel ratio between two frames) between the head and shoulder detection frames detected by the front and rear frames of images is used as the association basis, so that all the head and shoulder detection frames in the two frames are directly matched without considering the appearance information of the detection target and predicting the motion trail.

Further, the matching process may be: calculating the intersection ratio between each head and shoulder detection frame in two continuous frames of images, and when the maximum intersection ratio is greater than a preset threshold value, successfully matching the two corresponding head and shoulder detection frames by the maximum intersection ratio; and judging that the two head and shoulder detection frames which are successfully matched are the same target. Specifically, the IOU between each head and shoulder detection frame in the current frame image and each head and shoulder detection frame in the previous frame image is calculated, when each frame image is processed, for each tracked target, a maximum IOU between the detected head and shoulder detection frames and the position before the maximum IOU is selected from the detected head and shoulder detection frames, if the maximum IOU is larger than a preset threshold value, the two head and shoulder detection frames corresponding to the maximum IOU are judged to be matched, the two head and shoulder detection frames corresponding to the maximum IOU are judged to be the same target, and otherwise, the matching fails.

And 103, tracking the same target in the target video to obtain a tracking track.

The same target in the target video is tracked, and each target correspondingly obtains a tracking track tracklet. If a tracklet match fails, the target is considered to be off. If there is a head-shoulder detection box that does not match the tracklet, then the newly-appearing target is considered and a new tracklet is created for it.

The embodiment of the application tracks the head and shoulder target detection frame, when the same target is detected in continuous N frames of images (which can be continuous 3 frames of images), the target starts to be tracked, and if the target is not detected in continuous M frames of images (which can be continuous 10 frames of images) after the last detection, the tracking is finished.

And step 104, calculating the number of the tracking tracks to obtain the people counting result in the target video.

And finally, determining the number of people in the target video according to the number of the tracking tracks. Compared with the crowd counting strategy based on single-frame image target detection, the crowd counting strategy based on tracking and detection further improves the precision and robustness of crowd counting.

In the embodiment of the application, the head and the shoulder of each frame of image in the target video are detected through the preset head and shoulder detection model, so that false detection and missing detection caused by mutual shielding of people are avoided; the method comprises the steps of matching head and shoulder detection frames in two continuous frames of images, determining the head and shoulder detection frames belonging to the same target in the two continuous frames of images, tracking the head and shoulder detection frames, and finally determining the people counting result in a target video through the number of tracking tracks.

The above is a crowd counting method provided by the present application, and the following is a crowd counting device provided by the embodiment of the present application.

Referring to fig. 3, an embodiment of a crowd counting apparatus provided in the present application includes:

the output unit 201 is configured to sequentially input each frame of image in the acquired target video to a preset head and shoulder detection model for head and shoulder detection, and output a head and shoulder detection frame of each frame of image;

a matching unit 202, configured to match each head and shoulder detection frame in two consecutive frames of images, and determine that two successfully matched head and shoulder detection frames are the same target;

the tracking unit 203 is configured to track the same target in the target video to obtain a tracking track;

and the calculating unit 204 is used for calculating the number of the tracking tracks to obtain the people counting result in the target video.

As a further improvement, the preset head and shoulder detection model comprises: the characteristic diagram reducing module and the multi-scale receptive field expanding module are connected with the characteristic diagram reducing module;

correspondingly, the output unit 201 is specifically configured to:

and sequentially inputting each frame of image in the acquired target video into a preset head and shoulder detection model, so that a feature map reduction module performs feature extraction on the input image and reduces the size of the extracted feature map, a multi-scale receptive field expansion module performs multi-scale processing on the reduced feature map, performs head and shoulder detection frame prediction based on the extracted multi-scale features, and outputs a head and shoulder detection frame of each frame of image.

As a further improvement, the matching unit 202 is specifically configured to:

calculating the intersection ratio between each head and shoulder detection frame in two continuous frames of images, and when the maximum intersection ratio is greater than a preset threshold value, successfully matching the two corresponding head and shoulder detection frames by the maximum intersection ratio;

The embodiment of the application also provides crowd counting equipment, which comprises a processor and a memory;

the memory is used for storing the program codes and transmitting the program codes to the processor;

the processor is adapted to perform the people counting method of the aforementioned embodiments of the people counting method according to instructions in the program code.

An embodiment of the present application further provides a computer-readable storage medium, which is used for storing program codes, and the program codes are used for executing the crowd counting method in the aforementioned crowd counting method embodiment.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for executing all or part of the steps of the method described in the embodiments of the present application through a computer device (which may be a personal computer, a server, or a network device). And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method of population counting, comprising:

tracking the same target in the target video to obtain a tracking track;

2. The population counting method of claim 1, wherein said preset head and shoulder detection model comprises: the characteristic diagram reducing module and the multi-scale receptive field expanding module are connected with the characteristic diagram reducing module;

3. The population counting method of claim 2, wherein said profile reduction module comprises: a first convolutional layer, a second convolutional layer, a third convolutional layer and a fourth convolutional layer;

4. The population counting method of claim 2, wherein said multi-scale receptive field expansion module comprises: an inclusion layer, a convolution layer and 3 prediction layers.

5. The people counting method according to claim 1, wherein the matching each of the head and shoulder detection frames in two consecutive frames of the image and determining that two successfully matched head and shoulder detection frames are the same target comprises:

6. A people counting device, comprising:

7. The people counting device according to claim 6, wherein the preset head and shoulder detection model comprises: the characteristic diagram reducing module and the multi-scale receptive field expanding module are connected with the characteristic diagram reducing module;

correspondingly, the output unit is specifically configured to:

8. The people counting device according to claim 6, wherein the matching unit is specifically configured to:

9. A people counting device, characterized in that the device comprises a processor and a memory;

the processor is configured to perform the people counting method of any of claims 1-5 according to instructions in the program code.

10. A computer-readable storage medium for storing program code for performing the people counting method according to any one of claims 1-5.