CN112541469B

CN112541469B - Crowd counting method and system based on self-adaptive classification

Info

Publication number: CN112541469B
Application number: CN202011526392.7A
Authority: CN
Inventors: 吕蕾; 韩润; 庞辰; 陈梓铭; 吕晨
Original assignee: Shandong Normal University
Current assignee: Shandong Normal University
Priority date: 2020-12-22
Filing date: 2020-12-22
Publication date: 2023-09-08
Anticipated expiration: 2040-12-22
Also published as: CN112541469A

Abstract

The invention provides a crowd counting method and system based on self-adaptive classification, which belong to the technical field of machine vision, wherein crowd images are input into a trained classification model, the class to be identified of the crowd images is determined, the crowd images are input into a corresponding trained counting model according to the class to be identified of the crowd images, a crowd density map is obtained, and integral calculation is carried out on the crowd density map to obtain the number of people in the crowd images. The invention reduces the scale difference of the heads of the whole image, reduces the influence on the counting result in the aspect of scale, realizes the accurate counting of the image blocks, and can more accurately predict the density map of the corresponding scale; using hole convolution in branches, there are fewer parameters than standard convolution kernels with the same size receptive field; compared with convolution kernels with the same parameters, the convolution kernel has larger receptive field, so that the calculated amount is greatly reduced, and the calculation efficiency of crowd counting is improved.

Description

Crowd counting method and system based on self-adaptive classification

Technical Field

The invention relates to the technical field of machine vision, in particular to a crowd counting method and system based on self-adaptive classification.

Background

As the frequency of large public activities is increased, many potential safety hazards are also accompanied. Therefore, people in the scene need to be counted rapidly and accurately so as to make rapid command and dispatch and evacuate people stream in time.

The difficulty in crowd counting is mainly the problem of angles of the cameras, and the size of people far from the lens is smaller than that of people close to the lens, so that the people have larger scale difference; in addition, uneven population distribution in the image also creates a certain difficulty in accurate counting.

Traditional detection-based crowd counting algorithms are more effective for crowds in sparse scenes, but have poor crowd counting effects for highly dense scenes. Therefore, the current mainstream crowd counting algorithm mainly adopts a convolutional neural network, and the network structure of the current mainstream crowd counting algorithm comprises a single-column structure, a multi-column structure and the like. The training speed of the single-column structure is high, but the single-column structure is insensitive to multi-scale information, and the final obtained estimation effect is poor. The multi-column structure is fused by the feature maps of the multiple columns to generate a final density map. Although the structure considers multi-scale information, the parameters are numerous and redundant information is generated in the multi-feature map fusion process, and the problem of overlarge calculation amount exists.

Disclosure of Invention

The invention aims to provide a crowd counting method and system based on self-adaptive classification, which have the advantages of no redundant information, small calculated amount and high calculation speed, so as to solve at least one technical problem in the background technology.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

in one aspect, the invention provides a crowd counting method based on adaptive classification, comprising the following steps:

inputting crowd images into a trained classification model, and determining the class to be identified of the crowd images, wherein the class to be identified is divided into a first class and a second class according to the scale difference between heads in the crowd images; the classification model is trained by using multiple groups of data, the multiple groups of data comprise first-class data and second-class data, and each group of data in the first-class data comprises: a photograph including a population of people and a tag indicating that the photograph belongs to a first category; each set of data in the second class of data comprises: a photograph including a population of people and a tag indicating that the photograph belongs to a second category;

inputting the crowd images into corresponding trained counting models according to the categories to be identified of the crowd images to obtain crowd density maps, and carrying out integral calculation on the crowd density maps to obtain the number of people in the crowd images.

Preferably, the counting models are divided into a first type of counting model and a second type of counting model according to the scale difference between heads in the crowd image; the first class counting model is obtained through training of a first training set, and the first training set comprises: the method comprises the steps of including a photo of a crowd and a label which indicates that the absolute difference between the crowd density obtained by predicting the photo through a first type counting model and the actual crowd density in the photo is minimum; the second class of counting models are obtained through training of a second training set, and the second training set comprises: the method comprises a photo of a crowd and a label which indicates that the absolute difference between the crowd density predicted by the photo through a second type counting model and the actual crowd density in the photo is minimum.

Preferably, the training of the counting model comprises:

performing warping treatment on a plurality of crowd images, and performing Gaussian convolution treatment to obtain a corresponding real density map;

dividing each crowd image and a corresponding real density image into 4 image blocks with equal size;

sequentially and respectively inputting all the image blocks into a first cavity convolution network and a second cavity convolution network to respectively obtain corresponding prediction density maps, and obtaining the number of predicted persons through integration;

calculating absolute errors between the number of predicted persons and the number of actual persons of each image block on the first hole convolution network and the second hole convolution network respectively, and taking a network with the minimum absolute errors as a label of the image block; after all the image blocks are tested, dividing the image blocks into two groups, wherein each group of image blocks is provided with a label which indicates that the absolute error of the image block on a corresponding network is minimum;

training the two divided training image sets on the cavity convolution network corresponding to the labels, fitting the respective data sets, and predicting the number of people in the images with the respective corresponding scales.

Preferably, when the classification model is trained, two sets of training sets are input into the classifier in the model to train, so that the classifier can classify the input image block, predict the label to which the image block belongs, and transmit the image block to the corresponding cavity convolution network.

Preferably, the warping processing of the crowd image includes: uniformly stretching the region, exceeding the threshold value, of the distance lens outwards, and expanding the size of the human head; and (3) the area, smaller than the threshold value, of the distance lens is contracted inwards, the size of the human head is reduced, and a corresponding real density map is generated.

Preferably, the first hole convolution network and the second hole convolution network in the counting model are pre-trained, and the loss function is defined as:

wherein N is the number of training samples, I _i For the I-th input image block, w represents a parameter, F (I _i W) represents an estimated density map, G _i Representing a true density map.

Preferably, the classifier of the classification model adopts the first 16 convolution layers on the basis of the VGG19 network, a global average pooling layer is added, two full-connection layers are connected, and finally a softmax layer is connected.

In a second aspect, the present invention provides an adaptive classification-based population counting system comprising:

the acquisition module is used for acquiring crowd images;

the classification recognition module is used for inputting the crowd images into the trained classification model, determining the classes to be recognized of the crowd images, and classifying the classes to be recognized into a first class and a second class according to the scale difference between heads in the crowd images; wherein, the liquid crystal display device comprises a liquid crystal display device,

the classification model is obtained by training a plurality of groups of data, the plurality of groups of data comprise first-class data and second-class data, and each group of data in the first-class data comprises: a photograph including a population of people and a tag indicating that the photograph belongs to a first category; each set of data in the second class of data comprises: a photograph including a population of people and a tag indicating that the photograph belongs to a second category;

the calculation module is used for inputting the crowd images into the corresponding trained counting models according to the categories to be identified of the crowd images to obtain crowd density maps, and carrying out integral calculation on the crowd density maps to obtain the number of people in the crowd images.

In a third aspect, the invention provides a computer device comprising a memory and a processor, the processor and the memory being in communication with each other, the memory storing program instructions executable by the processor, the processor invoking the program instructions to perform a method as described above.

In a fourth aspect, the invention provides a computer readable storage medium storing a computer program which, when executed by a processor, implements a method as described above.

The invention has the beneficial effects that: the scale difference of the heads of the whole image is reduced, the influence on the counting result in the aspect of scale is reduced, the image blocks are accurately counted, and the density map of the corresponding scale can be accurately predicted; using hole convolution in the branches, there are fewer parameters than standard convolution kernels with the same size receptive field; compared with a convolution kernel with the same parameters, the convolution kernel has a larger receptive field, so that the calculated amount is greatly reduced, and the calculation efficiency of crowd counting is improved.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a crowd counting method based on adaptive classification according to an embodiment of the invention.

Fig. 2 is a functional block diagram of a classification predictor according to an embodiment of the present invention.

FIG. 3 is a schematic diagram of hole convolution according to an embodiment of the present disclosure.

Fig. 4 is a functional schematic block diagram of a branch network according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements throughout or elements having like or similar functionality. The embodiments described below by way of the drawings are exemplary only and should not be construed as limiting the invention.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, and/or groups thereof.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

In order that the invention may be readily understood, a further description of the invention will be rendered by reference to specific embodiments that are illustrated in the appended drawings and are not to be construed as limiting embodiments of the invention.

It will be appreciated by those skilled in the art that the drawings are merely schematic representations of examples and that the elements of the drawings are not necessarily required to practice the invention.

Example 1

The embodiment 1 of the invention provides a crowd counting system based on self-adaptive classification, which comprises:

the acquisition module is used for acquiring crowd images;

In this embodiment 1, a crowd counting method based on adaptive classification is implemented using the system described above, and includes the following steps:

In this embodiment 1, the counting models are divided into a first type of counting model and a second type of counting model according to the scale difference between heads in the crowd image; the first type counting model is obtained through training of a first training set.

The first training set includes: the method comprises the steps of including a photo of a crowd and a label representing that the absolute difference between the crowd density predicted by the photo through a first type counting model and the actual crowd density in the photo is minimum.

The second class of counting models are obtained through training of a second training set, and the second training set comprises: the method comprises a photo of a crowd and a label which indicates that the absolute difference between the crowd density predicted by the photo through a second type counting model and the actual crowd density in the photo is minimum.

In this embodiment 1, the training of the counting model includes:

performing warping processing on a plurality of crowd images, and processing the crowd images through Gaussian convolution to obtain corresponding real density images;

dividing each crowd image and a real density image into 4 image blocks with the same size;

When the classification model is trained, two groups of training sets are input into the classifier in the model to train, so that the classifier can classify the input image blocks, predict the labels of the image blocks, and transmit the image blocks to the corresponding cavity convolution network.

The warping processing of the crowd image comprises the following steps: uniformly stretching the region, exceeding the threshold value, of the distance lens outwards, and expanding the size of the human head; and (3) the area, smaller than the threshold value, of the distance lens is contracted inwards, the size of the human head is reduced, and a corresponding real density map is generated.

Pre-training a first hole convolution network and a second hole convolution network in the counting model, wherein a loss function is defined as follows:

wherein N is the number of training samples, I _i For the I-th input image block, w represents a parameter, F (I _i W) represents the estimatedDensity map, G _i Representing a true density map.

The classifier of the classification model adopts the first 16 convolution layers on the basis of the VGG19 network, a global average pooling layer is added, two full-connection layers are connected, and finally a softmax layer is connected.

Example 2

The embodiment 2 of the invention provides a crowd counting method based on self-adaptive classification, which comprises the following steps:

the crowd image is obtained, the image is subjected to twisting treatment, the scale difference of the head of a person in an image scene is reduced, a corresponding density map is generated, the image and the corresponding density map are divided into four image blocks with approximate sizes, and the image blocks are not overlapped;

a network model is built, and the model mainly comprises a classification predictor and two parallel branch networks. The classification predictor can predict a branch network to which an input image is transmitted, and the two parallel branch networks are respectively used for regressing density maps with different scales;

firstly, pre-training two branch networks, then dividing a training set into two groups according to different regression effects of the two branch networks on each image, and respectively performing targeted training on the two branch networks to enable the two branch networks to be capable of predicting images with corresponding scales better;

then training the classification predictor, and inputting two groups of training sets with labels into the classification predictor for training, so that the classifier can accurately predict a branch network to which an image block is transmitted, and more accurately predict a density map and the number of people;

the method comprises the steps of inputting crowd images to be detected into a trained model, predicting by a classification predictor, transmitting the images to a correct branch network, generating an estimated density map by the branch network, and obtaining the final predicted crowd through integration.

Example 3

As shown in fig. 1, embodiment 3 of the present invention provides a crowd counting method based on adaptive classification, which includes the following specific steps:

step one: the crowd image is acquired, and the image is subjected to distortion treatment, namely: the area far away from the lens is uniformly stretched outwards, so that the size of the human head is enlarged; the area near the lens is uniformly contracted inwards, the scale of the human head is reduced, the scale difference between the human heads is effectively reduced, a corresponding real density map is generated, then an image and a corresponding density map are divided into four image blocks with approximate sizes, the image blocks are not overlapped, and the crowd scene in each image block can be approximately regarded as consistent in scale, density and other aspects.

Step two: a network model is built, and the network model mainly comprises a classification predictor and two parallel branch networks.

The structure of the classification predictor is shown in fig. 2, and is improved on the basis of the VGG19 network structure, the first 16 convolution layers are adopted, a global average pooling layer is added for reducing data redundancy, two full-connection layers are connected at the rear, and finally a softmax layer is arranged, two output categories correspond to two branch networks, the classification predictor can predict an input image, and the corresponding label of the network to which the image should be transmitted is output;

the structure of the branch network is shown in fig. 4, the cavity convolution is integrated, two parallel branch networks with different receptive fields are designed and used for regression estimation of the density map, the cavity convolution principle is shown in fig. 3, and the cavity convolution principle has fewer parameters compared with a standard convolution kernel with receptive fields with the same size; it has a larger receptive field than a convolution kernel with the same parameters.

In this embodiment 3, two branches are designed, on the one hand, because although the distortion processing of the image greatly reduces the difference in scale between the heads of the person, there is still a difference in scale, so two columns of networks of different receptive fields are designed to cope with images of different scales; on the other hand, a multi-column structure with three columns and more is not adopted, because the scale difference between heads is greatly reduced after the image is distorted, and a complete image is quartered, the two-column structure can meet the corresponding feature extraction requirement, and the bulkiness of a model network structure is avoided, and a, b and c in the conv a-b-c structure in the figure respectively represent the size of a convolution kernel, the number of output channels and the void ratio.

Step three: firstly, pre-training two parallel branch networks in a network model, wherein a loss function is defined as follows:

After the two branch networks are fitted, all the image blocks are sequentially and respectively input into the two branch networks, an estimated density map is output, the estimated number of people is obtained through integration, the estimated number of people is compared with the actual number of people, the absolute error between the estimated number of people and the actual number of people on the two branches of each image block is calculated, the network with the minimum absolute error is taken as the label of the image block, after all the training images are tested, the initial training set is divided into two groups, and the image blocks in each group have the same label.

Step four: and performing targeted training on two parallel branch networks in the network model, and respectively training two groups of divided training image sets on the branch networks corresponding to the labels of the two branch networks, so that the two branch networks can better fit respective data sets, and the two branch networks can better predict images with respective corresponding scales.

Step five: the two sets of training sets are input into the classifier in the model for training, so that the classifier can classify the input image blocks, predict the labels to which the image blocks belong (namely, decide whether to input to Net1 or Net 2), and transmit the image blocks to a correct branch network so as to more accurately predict the density map and the number of people.

Step six: after the network training is completed, we can test. After the image block is input into the model, the classifier predicts the corresponding branch network label of the image, then the image is transmitted into the corresponding branch network for processing, a density map is output, and the final estimated number of people is obtained through integration of the density map.

Example 4

Embodiment 4 of the present invention provides a computer device, including a memory and a processor, where the processor and the memory are in communication with each other, the memory stores program instructions executable by the processor, and the processor invokes the program instructions to execute a crowd counting method based on adaptive classification, and the method includes:

Example 5

Embodiment 5 of the present invention provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements a crowd counting method based on adaptive classification, the method comprising:

In summary, according to the crowd counting method and system based on the self-adaptive classification cavity convolution network provided by the embodiment of the invention, the original image is subjected to the distortion treatment, so that the area with a larger scale is uniformly compressed and becomes smaller, the area with a smaller scale is uniformly stretched and becomes larger, and the scale difference of heads on the whole image is reduced. The image is divided uniformly into four portions that do not overlap so that the human head scale within each image block can be considered approximately equal, thereby reducing the impact on the count results in terms of scale. The self-adaptive classification predictor can predict the labels of each image block, and send each image block to a proper network branch, thereby realizing accurate counting of the image blocks. Through two parallel networks, the two networks do not work simultaneously, and after the classifier predicts the label to which the image block belongs, the network corresponding to the label carries out density map estimation on the image block, so that the density map of the corresponding scale can be predicted more accurately. Hole convolution is used in the branches with fewer parameters than a standard convolution kernel with the same size receptive field. Compared with convolution kernels with the same parameters, the convolution kernel has a larger receptive field, so that the calculated amount is greatly reduced, and the calculation efficiency of crowd counting is improved.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing description of the preferred embodiments of the present disclosure is provided only and not intended to limit the disclosure so that various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

While the foregoing embodiments of the present disclosure have been described in conjunction with the accompanying drawings, it is not intended to limit the scope of the disclosure, and it should be understood that, based on the technical solutions disclosed in the present disclosure, various modifications or variations may be made by those skilled in the art without requiring any inventive effort, and are intended to be included in the scope of the present disclosure.

Claims

1. The crowd counting method based on the self-adaptive classification is characterized by comprising the following steps of:

inputting the crowd images into corresponding trained counting models according to the categories to be identified of the crowd images to obtain crowd density maps, and carrying out integral calculation on the crowd density maps to obtain the number of people in the crowd images;

training of the counting model comprises:

sequentially and respectively inputting all the image blocks into a first cavity convolution network and a second cavity convolution network to respectively obtain corresponding prediction density maps, and obtaining the number of predicted persons through integration; the image blocks are not overlapped, and crowd scenes in each image block can be approximately regarded as consistent in scale and density;

training the two divided training image sets on a cavity convolution network corresponding to the labels, fitting the respective data sets, and predicting the number of people in the images with the respective corresponding scales;

when the classification model is trained, two groups of training sets are input into the classifier in the model to train, so that the classifier can classify the input image blocks, predict the labels of the image blocks and transmit the image blocks to the corresponding cavity convolution network;

2. The adaptive classification-based population count method of claim 1, wherein: the counting models are divided into a first type of counting model and a second type of counting model according to the scale difference between heads in the crowd images; the first class counting model is obtained through training of a first training set, and the first training set comprises: the method comprises the steps of including a photo of a crowd and a label which indicates that the absolute difference between the crowd density obtained by predicting the photo through a first type counting model and the actual crowd density in the photo is minimum; the second class of counting models are obtained through training of a second training set, and the second training set comprises: the method comprises a photo of a crowd and a label which indicates that the absolute difference between the crowd density predicted by the photo through a second type counting model and the actual crowd density in the photo is minimum.

3. The adaptive classification-based population counting method of claim 1, wherein the first and second hole convolutional networks in the counting model are pre-trained with a loss function defined as:

4. The adaptive classification-based crowd counting method of claim 3, wherein the classifier of the classification model adopts the first 16 convolution layers on the basis of a VGG19 network, a global average pooling layer is added, two full-connection layers are connected, and finally a softmax layer is connected.

5. A crowd counting system based on adaptive classification, comprising:

the acquisition module is used for acquiring crowd images;

the calculation module is used for inputting the crowd images into the corresponding trained counting models according to the categories to be identified of the crowd images to obtain crowd density maps, and carrying out integral calculation on the crowd density maps to obtain the number of people in the crowd images;

training of the counting model comprises:

dividing each crowd image and a corresponding real density image into 4 image blocks with equal size; the image blocks are not overlapped, and crowd scenes in each image block can be approximately regarded as consistent in scale and density;

6. A computer device comprising a memory and a processor, the processor and the memory in communication with each other, the memory storing program instructions executable by the processor, characterized in that: the processor invoking the program instructions to perform the method of any of claims 1-4.

7. A computer-readable storage medium storing a computer program, characterized in that: the computer program implementing the method according to any of claims 1-4 when executed by a processor.