WO2016183766A1 - Method and apparatus for generating predictive models - Google Patents
Method and apparatus for generating predictive models Download PDFInfo
- Publication number
- WO2016183766A1 WO2016183766A1 PCT/CN2015/079178 CN2015079178W WO2016183766A1 WO 2016183766 A1 WO2016183766 A1 WO 2016183766A1 CN 2015079178 W CN2015079178 W CN 2015079178W WO 2016183766 A1 WO2016183766 A1 WO 2016183766A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- distribution
- patches
- cnn
- truth
- ground
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/52—Surveillance or monitoring of activities, e.g. for recognising suspicious objects
- G06V20/53—Recognition of crowd images, e.g. recognition of crowd congestion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
- G06V10/443—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
- G06V10/449—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
- G06V10/451—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
- G06V10/454—Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/192—Recognition using electronic means using simultaneous comparisons or correlations of the image signals with a plurality of references
- G06V30/194—References adjustable by an adaptive method, e.g. learning
Definitions
- the present application relates to an apparatus and a method for generating a predictive model to predict a crowd density distribution and counts of persons in image frames.
- Counting crowd pedestrians in videos draws a lot of attention because of its intense demands in video surveillance, and it is especially important for metropolis security.
- Crowd counting is a challenging task due to severe occlusions, scene perspective distortions and diverse crowd distributions. Since pedestrian detection and tracking has difficulty when being used in crowd scenes, most state-of-the-art methods are regression based and the goal is to learn a mapping between low-level features and crowd counts. However, these works are scene-specific, i.e., a crowd counting model learned for a particular scene can only be applied to the same scene. Given an unseen scene or a changed scene layout, the model has to be re-trained with new annotations.
- Counting by global regression ignores spatial information of pedestrians.
- Lempitsky et al introduced an object counting method through pixel-level object density map regression. Following this work, Fiaschi et al. used random forest to regress the object density and improve training efficiency.
- Another advantage of density regression based methods is that they are able to estimate object counts in any region of an image. Taking this advantage, an interactive object counting system was introduced, which visualized region counts to help users to determine the relevance feedback efficiently. Rodrigueze made use of density map estimation to improve the head detection results. These methods are scene-specific and not applicable to cross-scene counting.
- the disclosures address the problem of crowd density and count estimation, of which the aim is to automatically estimate the density map and the number/count of people on a given surveillance video frame.
- the present application proposes a cross-scene density and count estimation system, which is capable of estimating both density map and people counts for any target scene even the scene does not exist in the training set.
- an apparatus for generating a predictive model to predict the crowd density map and counts comprising a density map creation unit, a CNN generation unit, a similar data retrieval unit and a model fine-tuning unit.
- the density map creation unit is configured to approximate the perspective map for each training scene from a training set (with pedestrian head annotations, labeled the head position of every person in the region of interest (ROI) ) to create, based on the labels and the perspective map, ground-truth density maps and counts on the training set;
- the density map represents the crowd distribution for every frame and the integration over the density map is equal to the total number of pedestrians.
- the CNN generation unit is configured to construct and initialize a crowd convolutional neural networks (CNN) , to train the CNN by inputting the crowd patches and corresponding ground-truth density maps and counts sampled from the training set to the CNN.
- the similar data retrieval unit is configured to receive sample frames from the target scene and the samples from training set with the ground-truth density maps and counts created by the unit, and to retrieve the similar data from the training set for each target scene to overcome the scene gaps.
- the model fine-tuning unit is configured to receive the retrieved similar data and construct a second CNN, wherein the second CNN is initialized by using the trained first CNN, and the unit is further configured to fine-tune the initialized second CNN with the similar data to make the second CNN capable of predicting density map and the pedestrian count in region of interest for the video frame to be detected and the region of interest.
- method for generating a predictive model to predict a crowd density distribution and counts of persons in image frames comprising:
- an apparatus for generating a predictive model to predict a crowd density distribution and counts of persons in image frames comprising:
- a CNN training unit for training a CNN by inputting one or more crowd patches from frames in a training set, each of the crowd patches having a predetermined ground-truth density distribution and counts of persons in the inputted crowd patches;
- a similar data retrieval unit for sampling frames from a target scene image set and receiving the training images from the training set that having the determined ground-truth density distribution and counts/number, and retrieving similar image data from the received training frames for each of the sampled target image frames to overcome scene gaps between the target scene image set and the training images;
- a model fine-tuning unit for fine-tuning the CNN by inputting the similar image data to the CNN so as to determine a predictive model for predicting the crowd density map and counts of persons in image frames.
- a system for generating a predictive model to predict a crowd density distribution and counts of persons in image frames comprising:
- a processor electrically coupled to the memory to execute the executable components to perform operations of the system, wherein, the executable components comprise:
- a CNN training component for training a CNN by inputting one or more crowd patches from frames in a training set, each of the crowd patches having a predetermined ground-truth density distribution and counts of persons in the inputted crowd patches;
- a similar data retrieval component for sampling frames from a target scene image set and receiving the training images from the training set that having the determined ground-truth density distribution and counts/number, and retrieving similar image data from the received training frames for each of the sampled target image frames to overcome scene gaps between the target scene image set and the training images;
- a model fine-tuning component for fine-tuning the CNN by inputting the similar image data to the CNN so as to determine a predictive model for predicting the crowd density map and counts of persons in image frames.
- ⁇ Multi-task system can estimate both crowd density maps and counts together.
- the count number can be calculated through integration over the density map.
- the two related task can also help each other to obtain a better solution for our training model.
- Fig. 1 is a schematic diagram illustrating a block view of an apparatus 1000 for generating a predictive model to predict the crowd density map and counts according to one embodiment of the present application.
- Fig. 2 is a schematic diagram illustrating a flow chart for the apparatus 1000 generating a predictive model to predict a crowd density distribution and counts of persons in image frames according to one embodiment of the present application.
- Fig. 3 is a schematic diagram illustrating a flow process chart for the density map creation unit 10 according to one embodiment of the present application.
- Fig. 4 is a schematic diagram illustrating a flow process chart for the CNN training unit according to one embodiment of the present application.
- Fig. 5 is a schematic diagram illustrating overview of the crowd CNN model with switchable objectives according to one embodiment of the present application.
- Fig. 6 is a schematic diagram illustrating a flow chart for similar data retrieval according to another embodiment of the present application.
- Fig. 7 is a schematic diagram illustrating a system for generating a predictive model according to one embodiment of the present application, in which the functions of the present invention are carried out by the software.
- Fig. 1 is a schematic diagram illustrating a block view of an apparatus 1000 for generating a predictive model to predict the crowd density map and counts according to one embodiment of the present application.
- the apparatus 2000 may comprise a density map creation unit 10, a CNN generation unit 20, a similar data retrieval unit 30 and a model fine-tuning unit 40.
- Fig. 2 is a general schematic diagram illustrating a flow process chart 2000 for the apparatus 1000 according to one embodiment of the present application.
- the ground-truth density map creation unit 10 operates to select image patches from one or more training image frames in the training set; and determine the ground-truth crowd distribution in the selected patches and the ground-truth total counts of pedestrians in the selected patches.
- the CNN training unit 20 operates to train a CNN by inputting one or more crowd patches from frames in a training set, each of the crowd patches having a predetermined ground-truth density distribution and counts of persons in the inputted crowd patches.
- the similar data retrieval unit 30 operates to sample frames from a target scene image set and receiving the training images from the training set that having the determined ground-truth density distribution and counts/number, and retrieve similar image data from the received training frames for each of the sampled target image frames to overcome scene gaps between the target scene image set and the training images.
- the model fine-tuning unit 40 operates to fine-tune the CNN by inputting the similar image data to the CNN so as to determine a predictive model for predicting the crowd density map and counts of persons in image frames. The cooperation for the density map creation unit 10, a CNN generation unit 20, a similar data retrieval unit 30 and a model fine-tuning unit 40 will be discussed below in detail.
- the initial input to the apparatus 100 is a training set contained an amount of video frames captured from various surveillance cameras with pedestrians head labels.
- the density map creation unit 10 operates to output a density map and counts for each of the video frames based on the input training set.
- Fig. 3 is a schematic diagram illustrating a flow process chart for the density map creation unit 10 according to one embodiment of the present application.
- the density map creation unit 10 operates to approximate a perspective map/distribution for each training scene/frame from the training set.
- the pedestrian heads are labeled to indicate the head position of each person in the region of interest of the each training frame. With the labeled head position, the pedestrians’s patial location and human body shape will be located in each frame.
- the ground-truth density map/distribution is created based on pedestrians’s patial location, human body shape and perspective distortion of images so as to determine a ground-truth density for the pedestrians/crowds in each frame and to estimate counts of persons in the crowds in the each frame of the training set.
- the ground-truth density map/distribution represents a crowd distribution in each frame, and the integration over the density map/distribution is equal to the total number of pedestrians.
- the main objective for the crowd CNN model to be discussed later is to learn a mapping F: X —D, where X is the set of low-level features extracted from training images and D is the crowd density map/distribution of the image.
- X is the set of low-level features extracted from training images
- D is the crowd density map/distribution of the image.
- the density map/distribution is created based on pedestrians’s patial location, human body shape and perspective distortion of images. Patches randomly selected from the training images are treated as training samples, and the density maps/distributions of corresponding patches are treated as the ground truth for the crowd CNN model, which will be further discussed later.
- the total crowd number in a selected training patch is calculated through integration over the density map/distribution. Note that the total number will be a decimal, but not an integer.
- the density map regression ground truth has been defined as a sum of Gaussian kernels centered on the locations of objects.
- This kind of density maps /distributions is suitable for characterizing the density distribution of circle-like objects such as cells and bacteria.
- this assumption may fail when it comes to the pedestrian crowd, where cameras are generally not in a bird-view.
- An example of pedestrians in an ordinary surveillance camera three visible characteristics: 1) pedestrian images in the surveillance videos have different scales due to perspective distortion; 2) the shapes of pedestrians are more similar to ellipses than circles; 3) due to severe occlusions, heads and shoulders are the main cues to judge whether there exists a pedestrian at each position.
- the body parts of pedestrians are not reliable for human annotation. Taking these characteristics into account, the crowd density map/distribution is created by the combination of several distributions with perspective normalization.
- Perspective normalization is necessary to estimate the pedestrian scales. For each scene, several adult pedestrians will be randomly selected and then are labeled from head to toe. Assuming that the mean height of adults is 175cm (for example) , the perspective map M can be approximated through a linear regression.
- the crowd density map/distribution is created by rule of:
- the crowd density distribution kernel contains two terms, a normalized 2D Gaussian kernel Nh as a head part and a bivariate normal distribution Nb as a body part.
- Pb is the position of the pedestrian body, estimated by the head position and the perspective value.
- the whole distribution is normalized by Z.
- a body shape density distribution or kernel (hereinafter, “kernel” ) as described in formula (1) will be determined. All body shape kernels for all the labeled persons are combined (overlapped) to form the ground-truth density map/distribution for each frame. The bigger the values of the locations in the ground-truth density map/distribution are, the higher the crowd density is in these locations.
- the normalized value for each body shape kernel is equal to 1, the counts for the persons in the crowds will be equal to the sum of all the values for the body shape kernels in the ground-truth density map/distribution.
- the CNN generation unit 20 is configured to construct and initialize a first crowd convolutional neural network (CNN) .
- the generation unit 20 operates to retrieve/sample crowd patches from the frames in the training set, and obtain the corresponding ground-truth density maps and numbers of persons in the sampled crowd patches as determined by the unit 10. And then, the generation unit 20 inputs the crowd patches sampled from the training set and their corresponding ground-truth density maps/distributions and numbers of persons as target objectiveness into the CNN, so as to train the CNN.
- CNN first crowd convolutional neural network
- Fig. 4 is a schematic diagram illustrating a flow chart for the process 4000 for generating and training the CNN according to one embodiment of the present application.
- the process 300 samples one or more crowd patches from the frames in the training set, and obtains the corresponding ground-truth density distribution/maps and numbers of persons /crowd in the sampled crowd patches.
- the input is the image patches cropped from training images.
- the size of each patch at different locations is chosen according to the perspective value of its center pixel.
- each patch may be set to cover a 3-meter by 3-meter square in the actual scene.
- the patches are warped to 72 (for example) pixels by 72 (for example) pixels as the input of the crowd CNN model generated in step 302 as below.
- step s402 the process 4000 initializes randomly, based on Gaussian random distribution, a crowd convolutional neural network.
- An overview of the crowd CNN model with switchable objectives is shown in Figure 5.
- the crowd CNN model 500 contains 3 convolution layers (con1-conv3) and three fully connected layers (fc4, fc5 and fc6 or fc7) .
- Conv1 has 32 7X7X3 filters
- conv2 has 32 7X7X32 filters
- the last convolution layer has 64 5X5X32 filters.
- Max pooling layers with a 2X2 kernel size are used after conv1 and conv2.
- Rectified linear unit (ReLU) which is not shown in Fig. 5, is the activation function applied after every convolutional layer and fully connected layer. It shall be appreciated that the numbers of the filters and the number of layers are just described herein as an example for purpose of illustration, and the present application is not limited to these specific numbers and other numbers would be acceptable.
- the process 400 learns a mapping from the crowd patches to the density maps/distributions, for example, by using the mini-batch gradient descent and back-propagation until the density maps/distributions converge to the ground-truth density/distribution as created by the ground-truth density map creation unit 10.
- the process 400 switches the objectiveness and learns the mapping from the crowd patches to the counts until the learned counts converge to the counts estimated by the ground-truth density map creation unit 10.
- the steps s403-s405 will be discussed in detail.
- the main task for the crowd CNN model 400 is to estimate the crowd density map/distribution of the input patches.
- the output density map/distribution is down-sampled to 18X18. Therefore, the ground truth density map/distribution is also down-sampled to 18X18. Since the density map/distribution contains rich and abundant local and detailed information, the CNN model 500 can benefit from learning to predict density map/distribution and can obtain a better representation of crowd patches.
- the total count regression of the input patch is treated as the secondary task, which is calculated by integrating the density map patch. Two tasks alternatively assist each other and obtain a better solution.
- the two loss functions are defined by rule of:
- L D is the loss between estimated density map Fd (X i ; ⁇ ) (the output of fc6) and the ground truth density map Di.
- L Y is the loss between the estimated crowd number Fy (Xi; ⁇ ) (the output of fc7) and the ground truth number Y i . Euclidean distance is adopted in these two objective losses. The loss is minimized using mini-batch gradient descent and back propagation.
- the switchable training procedure is summarized in Algorithm 1.
- L D is set as the first objective loss to minimize, since the density map/distribution can introduce more spatial information to the CNN model, the density map/distribution estimation requires the model 500 to learn a general representation for the crowds.
- the model 500 switches to minimize the objective of global count regression.
- Count regression is an easier task and its learning converges faster than the task of density map/distribution regression.
- the two objective losses should be normalized to the similar or same scales; otherwise the objective with the larger scale would be dominant in the training process.
- the scale weight of density loss can be set to 10
- the scale weight of count loss can be set to 1.
- the training loss converged after about 6 switch iterations.
- the proposed switching learning approach can achieve better performance than the widely used multi-task learning approach.
- the similar data retrieval unit 30 is configured to receive sample frames from the target scene, and receive the samples from training set with the ground-truth density maps/distributions and counts created by the unit 10, and then to obtain the similar data from the training set for each target scene to overcome the scene gaps.
- the crowd CNN model 500 is pre-trained based on all training scene data through the proposed switchable learning process.
- each query crowd scene has its unique scene properties, such as different view angles, scales and different density distributions. These properties significantly change the appearance of crowd patches and affect the performance of the crowd CNN model 500.
- a nonparametric fine-tuning scheme is designed to adapt the pre-trained CNN model 500 to unseen target scenes.
- the retrieval task consists of two steps, candidate scenes retrieval and local patch retrieval.
- Candidate scene retrieval (step 601) .
- the view angle and the scale of a scene are the main factors affecting the appearance of crowd.
- the perspective map/distribution can indicate both the view angle and the scale.
- each input patch is normalized into the same scale, which covers a 3-meter by 3-meter square (for example) in the actual scene according to the perspective map/distribution. Therefore, the first step of the nonparametric fine-tuning method focuses on retrieving training scenes that have similar perspective maps/distributions with the target scene from all the training scenes. Those retrieved scenes are called candidate fine-tuning scenes.
- a perspective descriptor is designed to represent the view angle of each scene.
- the top (for example, 20) perspective-map-similar scenes are retrieved from the whole training dataset.
- the images in the retrieved scenes are treated as the candidate scenes for local patch retrieval.
- the second step is to select similar patches, which have similar density distributions with those in the test scene, from candidate scenes.
- the crowd density distribution also affects the appearance pattern of crowds. Higher density crowd has more severe occlusions, and only heads and shoulders can be observed. On the contrary, in sparse crowd, pedestrians appear with entire body shapes. Therefore, the similar data retrieval unit 30 is configured to try to predict the density distribution of the target scene and retrieve similar patches that match the predicted target density distribution from the candidate scenes. For example, for a crowd scene with high densities, denser patches should be retrieved to fine-tune the pre-trained model to fit the target scene.
- y i is the integrating count of estimated density map/distribution for sample i.
- the model fine-tuning unit 40 is configured to receive the retrieved similar dataand to fine-tune the pre-trained CNN 500 with the similar data to make the CNN 500 capable of predicting density map/distribution and the pedestrian count in region of interest for the Video frame to be detected and the region of interest.
- the fine-tuned crowd CNN model achieves better performance for the target scene.
- the fine-tuning unit 40 samples the similar patches obtained from the unit 30 and inputs the obtain similar patches into the pre-trained CNN to fine-tune it , for example, by using the mini-batch gradient descent and back-propagation until the density maps/distributions converge to the ground-truth density/distribution as created by the ground-truth density map creation unit 10. And then the fine-tuning unit 40 switches the objectiveness and learns the mapping from the crowd patches to the counts until the learned counts converge to the counts estimated by the ground-truth density map creation unit 10. At last, it is determined if the estimated density map/distribution and the count converge to the ground-truth, if not, repeat the above steps.
- the fine-tuned predictive model generated by the model fine-tuning unit 40 may receive video frames to be detected and region of interest, and then predict an estimated density map and the pedestrian count in the region of interest.
- the present invention may be embodied as a system, method or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment and hardware aspects that may all generally be referred to herein as a “unit” , “circuit, ” “module” or “system. ”
- ICs integrated circuits
- Fig. 7 illustrates a system 7000 for generating a predictive model to predict a crowd density distribution and counts of persons in image frames.
- the system 7000 comprises a memory 3001 that stores executable components and a processor 3002, electrically coupled to the memory 3001 to execute the executable components to perform operations of the system 3000.
- the executable components may comprise: a ground-truth density map creation component 701, a CNN training component 702, a similar data retrieval component 703, and a model fine-tuning component 704.
- the ground-truth density map creation component 701 is configured for selecting image patches from one or more training image frames in the training set; and determining the ground-truth crowd distribution in the selected patches and the ground-truth total counts of pedestrians in the selected patches.
- the CNN training component 702 is configured for training a CNN by inputting one or more crowd patches from frames in a training set, each of the crowd patches having a predetermined ground-truth density distribution and counts of persons in the inputted crowd patches.
- the similar data retrieval component 703 is configured for sampling frames from a target scene image set and receiving the training images from the training set that having the determined ground-truth density distribution and counts/number, and retrieving similar image data from the received training frames for each of the sampled target image frames to overcome scene gaps between the target scene image set and the training images.
- the model fine-tuning component 703 is configured for fine-tuning the CNN by inputting the similar image data to the CNN 500 so as to determine a predictive model for predicting the crowd density map and counts of persons in image frames.
Landscapes
- Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biodiversity & Conservation Biology (AREA)
- Biomedical Technology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Image Analysis (AREA)
Abstract
Disclosed is method for generating a predictive model to predict a crowd density distribution and counts of persons in image frames, comprising: training a CNN by inputting one or more crowd patches from frames in a training set, each of the crowd patches having a predetermined ground-truth density distribution and counts of persons in the inputted crowd patches; sampling frames from a target scene image set and receiving the training images from the training set that having the determined ground-truth density distribution and counts/number, retrieving similar image data from the received training frames for each of the sampled target image frames to overcome scene gaps between the target scene image set and the training images; and fine-tuning the CNN by inputting the similar image data to the CNN so as to determine a predictive model for predicting the crowd density map and counts of persons in image frames.
Description
The present application relates to an apparatus and a method for generating a predictive model to predict a crowd density distribution and counts of persons in image frames.
Counting crowd pedestrians in videos draws a lot of attention because of its intense demands in video surveillance, and it is especially important for metropolis security. Crowd counting is a challenging task due to severe occlusions, scene perspective distortions and diverse crowd distributions. Since pedestrian detection and tracking has difficulty when being used in crowd scenes, most state-of-the-art methods are regression based and the goal is to learn a mapping between low-level features and crowd counts. However, these works are scene-specific, i.e., a crowd counting model learned for a particular scene can only be applied to the same scene. Given an unseen scene or a changed scene layout, the model has to be re-trained with new annotations.
Many works have been proposed to count the pedestrians by detection or trajectory-clustering. But for the crowd counting problem, these methods are limited by severe occlusions between people. A number of methods tried to predict global counts by using regressors trained with low-level features. These approaches are more suitable for crowded environments and are computationally more efficient.
Counting by global regression ignores spatial information of pedestrians. Lempitsky et al. introduced an object counting method through pixel-level object density map regression. Following this work, Fiaschi et al. used random forest to regress the object density and improve training efficiency. Besides considering spatial information, another advantage of density regression based methods is that they are able to estimate
object counts in any region of an image. Taking this advantage, an interactive object counting system was introduced, which visualized region counts to help users to determine the relevance feedback efficiently. Rodrigueze made use of density map estimation to improve the head detection results. These methods are scene-specific and not applicable to cross-scene counting.
Many works introduced deep learning into various surveillance applications, such as person re-identification, pedestrian detection, tracking, crowd behavior analysis and crowd segmentation. Their success benefits from discriminative power of deep models. Sermanet et al. showed that the features extracted from deep models are more effective than hand-crafted feature for many applications. However, deep models have not yet been explored for crowd counting.
As many large-scale and well-labeled datasets published, nonparametric, data-driven approaches are proposed. Such approaches can be scaled up easily because they do not require training. They transfer the labels from the training images to the test image by retrieving the most similar training images and match them with the test image. Liu et al. proposed a nonparametric image parsing method looking for a dense deformation field between images.
Summary
The disclosures address the problem of crowd density and count estimation, of which the aim is to automatically estimate the density map and the number/count of people on a given surveillance video frame.
The present application proposes a cross-scene density and count estimation system, which is capable of estimating both density map and people counts for any target scene even the scene does not exist in the training set.
In one aspect, disclosed is an apparatus for generating a predictive model to predict the crowd density map and counts, comprising a density map creation unit, a CNN generation unit, a similar data retrieval unit and a model fine-tuning unit. The density map creation unit is configured to approximate the perspective map for each
training scene from a training set (with pedestrian head annotations, labeled the head position of every person in the region of interest (ROI) ) to create, based on the labels and the perspective map, ground-truth density maps and counts on the training set; The density map represents the crowd distribution for every frame and the integration over the density map is equal to the total number of pedestrians. The CNN generation unit is configured to construct and initialize a crowd convolutional neural networks (CNN) , to train the CNN by inputting the crowd patches and corresponding ground-truth density maps and counts sampled from the training set to the CNN. The similar data retrieval unit is configured to receive sample frames from the target scene and the samples from training set with the ground-truth density maps and counts created by the unit, and to retrieve the similar data from the training set for each target scene to overcome the scene gaps. The model fine-tuning unit is configured to receive the retrieved similar data and construct a second CNN, wherein the second CNN is initialized by using the trained first CNN, and the unit is further configured to fine-tune the initialized second CNN with the similar data to make the second CNN capable of predicting density map and the pedestrian count in region of interest for the video frame to be detected and the region of interest.
In an aspect of the present application, disclosed is method for generating a predictive model to predict a crowd density distribution and counts of persons in image frames, comprising:
training a CNN by inputting one or more crowd patches from frames in a training set, each of the crowd patches having a predetermined ground-truth density distribution and counts of persons in the inputted crowd patches;
sampling frames from a target scene image set and receiving the training images from the training set that having the determined ground-truth density distribution and counts/number, and
retrieving similar image data from the received training frames for each of the sampled target image frames to overcome scene gaps between the target scene image set and the training images; and
fine-tuning the CNN by inputting the similar image data to the CNN so as to determine a predictive model for predicting the crowd density map and counts of persons in image frames.
In a further aspect of the present application, disclosed is an apparatus for generating a predictive model to predict a crowd density distribution and counts of persons in image frames, comprising:
a CNN training unit for training a CNN by inputting one or more crowd patches from frames in a training set, each of the crowd patches having a predetermined ground-truth density distribution and counts of persons in the inputted crowd patches;
a similar data retrieval unit for sampling frames from a target scene image set and receiving the training images from the training set that having the determined ground-truth density distribution and counts/number, and retrieving similar image data from the received training frames for each of the sampled target image frames to overcome scene gaps between the target scene image set and the training images; and
a model fine-tuning unit for fine-tuning the CNN by inputting the similar image data to the CNN so as to determine a predictive model for predicting the crowd density map and counts of persons in image frames.
In a further aspect of the present application, disclosed is a system for generating a predictive model to predict a crowd density distribution and counts of persons in image frames, comprising:
a memory that stores executable components; and
a processor electrically coupled to the memory to execute the executable components to perform operations of the system, wherein, the executable components comprise:
a CNN training component for training a CNN by inputting one or more crowd patches from frames in a training set, each of the crowd patches having a predetermined ground-truth density distribution and counts of persons in the inputted crowd patches;
a similar data retrieval component for sampling frames from a target scene image set and receiving the training images from the training set that having the determined ground-truth density distribution and counts/number, and retrieving similar image data from the received training frames for each of the sampled target image frames to overcome scene gaps between the target scene image set and the training images; and
a model fine-tuning component for fine-tuning the CNN by inputting the similar image data to the CNN so as to determine a predictive model for predicting the crowd density map and counts of persons in image frames.
According to the claimed solutions, there would be at least one of the following advantages:
·Multi-task system –it can estimate both crowd density maps and counts together. The count number can be calculated through integration over the density map. The two related task can also help each other to obtain a better solution for our training model.
·Cross-scene capacity –the target scene requires no extra pedestrian labels in the framework for cross-scene counting.
·No crowd segmentation need –it does not rely on crowd foreground segmentation preprocessing. No matter whether the crowd is moving or not, the crowd texture would be captured by our model and the system can obtain a reasonable estimation result.
The following description and the annexed drawings set forth certain illustrative aspects of the disclosure. These aspects are indicative, however, of but a few of the various ways in which the principles of the disclosure may be employed. Other aspects of the disclosure will become apparent from the following detailed description of the disclosure when considered in conjunction with the drawings.
Brief Description of the Drawing
Exemplary non-limiting embodiments of the present invention are described below with reference to the attached drawings. The drawings are illustrative and generally not to an exact scale. The same or similar elements on different figures are referenced with the same reference numbers.
Fig. 1 is a schematic diagram illustrating a block view of an apparatus 1000 for generating a predictive model to predict the crowd density map and counts according to one embodiment of the present application.
Fig. 2 is a schematic diagram illustrating a flow chart for the apparatus 1000 generating a predictive model to predict a crowd density distribution and counts of persons in image frames according to one embodiment of the present application.
Fig. 3 is a schematic diagram illustrating a flow process chart for the density map creation unit 10 according to one embodiment of the present application.
Fig. 4 is a schematic diagram illustrating a flow process chart for the CNN training unit according to one embodiment of the present application.
Fig. 5 is a schematic diagram illustrating overview of the crowd CNN model with switchable objectives according to one embodiment of the present application.
Fig. 6 is a schematic diagram illustrating a flow chart for similar data retrieval according to another embodiment of the present application.
Fig. 7 is a schematic diagram illustrating a system for generating a predictive model according to one embodiment of the present application, in which the functions of the present invention are carried out by the software.
Detailed Description
Reference will now be made in detail to some specific embodiments of the invention including the best modes contemplated by the inventors for carrying out the invention. Examples of these specific embodiments are illustrated in the accompanying
drawings. While the invention is described in conjunction with these specific embodiments, it will be understood that it is not intended to limit the invention to the described embodiments. On the contrary, it is intended to cover alternatives, modifications, and equivalents as may be included within the spirit and scope of the invention as defined by the appended claims. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. The present invention may be practiced without some or all of these specific details. In other instances, well-known process operations have not been described in detail in order not to unnecessarily obscure the present invention.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms "a" , "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising, " when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
Fig. 1 is a schematic diagram illustrating a block view of an apparatus 1000 for generating a predictive model to predict the crowd density map and counts according to one embodiment of the present application. As shown, the apparatus 2000 may comprise a density map creation unit 10, a CNN generation unit 20, a similar data retrieval unit 30 and a model fine-tuning unit 40.
Fig. 2 is a general schematic diagram illustrating a flow process chart 2000 for the apparatus 1000 according to one embodiment of the present application. At step s201, the ground-truth density map creation unit 10 operates to select image patches from one or more training image frames in the training set; and determine the ground-truth crowd distribution in the selected patches and the ground-truth total counts of pedestrians in the selected patches. At step s202, the CNN training unit 20 operates to train a CNN by inputting one or more crowd patches from frames in a training set, each of the crowd
patches having a predetermined ground-truth density distribution and counts of persons in the inputted crowd patches. At step s203, the similar data retrieval unit 30 operates to sample frames from a target scene image set and receiving the training images from the training set that having the determined ground-truth density distribution and counts/number, and retrieve similar image data from the received training frames for each of the sampled target image frames to overcome scene gaps between the target scene image set and the training images. At step s204, the model fine-tuning unit 40 operates to fine-tune the CNN by inputting the similar image data to the CNN so as to determine a predictive model for predicting the crowd density map and counts of persons in image frames. The cooperation for the density map creation unit 10, a CNN generation unit 20, a similar data retrieval unit 30 and a model fine-tuning unit 40 will be discussed below in detail.
1)Density Map Creation Unit 10
The initial input to the apparatus 100 (i.e., the density map creation unit 10) is a training set contained an amount of video frames captured from various surveillance cameras with pedestrians head labels. The density map creation unit 10 operates to output a density map and counts for each of the video frames based on the input training set.
Fig. 3 is a schematic diagram illustrating a flow process chart for the density map creation unit 10 according to one embodiment of the present application. At step s301, the density map creation unit 10 operates to approximate a perspective map/distribution for each training scene/frame from the training set. The pedestrian heads are labeled to indicate the head position of each person in the region of interest of the each training frame. With the labeled head position, the pedestrians’s patial location and human body shape will be located in each frame. At step s302, the ground-truth density map/distribution is created based on pedestrians’s patial location, human body shape and perspective distortion of images so as to determine a ground-truth density for the pedestrians/crowds in each frame and to estimate counts of persons in the crowds in the each frame of the training set. In particular, the ground-truth density map/distribution represents a crowd distribution in each frame, and the integration over the density
map/distribution is equal to the total number of pedestrians.
To be specific, the main objective for the crowd CNN model to be discussed later is to learn a mapping F: X —D, where X is the set of low-level features extracted from training images and D is the crowd density map/distribution of the image. Assuming that the position of each pedestrian is labeled, the density map/distribution is created based on pedestrians’s patial location, human body shape and perspective distortion of images. Patches randomly selected from the training images are treated as training samples, and the density maps/distributions of corresponding patches are treated as the ground truth for the crowd CNN model, which will be further discussed later. As an auxiliary objective, the total crowd number in a selected training patch is calculated through integration over the density map/distribution. Note that the total number will be a decimal, but not an integer.
Conventionally, the density map regression ground truth has been defined as a sum of Gaussian kernels centered on the locations of objects. This kind of density maps /distributions is suitable for characterizing the density distribution of circle-like objects such as cells and bacteria. However, this assumption may fail when it comes to the pedestrian crowd, where cameras are generally not in a bird-view. An example of pedestrians in an ordinary surveillance camera three visible characteristics: 1) pedestrian images in the surveillance videos have different scales due to perspective distortion; 2) the shapes of pedestrians are more similar to ellipses than circles; 3) due to severe occlusions, heads and shoulders are the main cues to judge whether there exists a pedestrian at each position. The body parts of pedestrians are not reliable for human annotation. Taking these characteristics into account, the crowd density map/distribution is created by the combination of several distributions with perspective normalization.
Perspective normalization is necessary to estimate the pedestrian scales. For each scene, several adult pedestrians will be randomly selected and then are labeled from head to toe. Assuming that the mean height of adults is 175cm (for example) , the perspective map M can be approximated through a linear regression. The pixel value in the perspective map M (p) denotes that the number of pixels in the image representing a
certain distance (for example, one meter) at that location in the actual scene. If one pedestrian was labeled with H pixels, the perspective map on the center position of the pedestrian is M (p) =H/1.75, and then it interpolates linearly the perspective map along the vertical and horizontal direction, respectively, to obtain all the perspective map. After the perspective map/distribution and the center positions of pedestrian head Ph in the region of interest (ROI) is obtained, the crowd density map/distribution is created by rule of:
The crowd density distribution kernel contains two terms, a normalized 2D Gaussian kernel Nh as a head part and a bivariate normal distribution Nb as a body part. Here Pb is the position of the pedestrian body, estimated by the head position and the perspective value. To best represent the pedestrian contour, the variance is set as
for the term Nh, and Nx = 0.2M (p) ,for the term Nb. To ensure that the integration of all density values in a density map/distribution equals to the total crowd number in the original image, the whole distribution is normalized by Z.
In short, for each person with the labeled head position, a body shape density distribution or kernel (hereinafter, “kernel” ) as described in formula (1) will be determined. All body shape kernels for all the labeled persons are combined (overlapped) to form the ground-truth density map/distribution for each frame. The bigger the values of the locations in the ground-truth density map/distribution are, the higher the crowd density is in these locations. In addition, since the normalized value for each body shape kernel is equal to 1, the counts for the persons in the crowds will be equal to the sum of all the values for the body shape kernels in the ground-truth density map/distribution.
2) CNN Generation Unit 20
The CNN generation unit 20 is configured to construct and initialize a first crowd convolutional neural network (CNN) . The generation unit 20 operates to
retrieve/sample crowd patches from the frames in the training set, and obtain the corresponding ground-truth density maps and numbers of persons in the sampled crowd patches as determined by the unit 10. And then, the generation unit 20 inputs the crowd patches sampled from the training set and their corresponding ground-truth density maps/distributions and numbers of persons as target objectiveness into the CNN, so as to train the CNN.
Fig. 4 is a schematic diagram illustrating a flow chart for the process 4000 for generating and training the CNN according to one embodiment of the present application.
As shown, in step s401, the process 300 samples one or more crowd patches from the frames in the training set, and obtains the corresponding ground-truth density distribution/maps and numbers of persons /crowd in the sampled crowd patches. The input is the image patches cropped from training images. In order to obtain pedestrians at similar scales, the size of each patch at different locations is chosen according to the perspective value of its center pixel. In an example, each patch may be set to cover a 3-meter by 3-meter square in the actual scene. Then the patches are warped to 72 (for example) pixels by 72 (for example) pixels as the input of the crowd CNN model generated in step 302 as below.
In step s402, the process 4000 initializes randomly, based on Gaussian random distribution, a crowd convolutional neural network. An overview of the crowd CNN model with switchable objectives is shown in Figure 5.
As shown, the crowd CNN model 500 contains 3 convolution layers (con1-conv3) and three fully connected layers (fc4, fc5 and fc6 or fc7) . Conv1 has 32 7X7X3 filters, conv2 has 32 7X7X32 filters and the last convolution layer has 64 5X5X32 filters. Max pooling layers with a 2X2 kernel size are used after conv1 and conv2. Rectified linear unit (ReLU) , which is not shown in Fig. 5, is the activation function applied after every convolutional layer and fully connected layer. It shall be
appreciated that the numbers of the filters and the number of layers are just described herein as an example for purpose of illustration, and the present application is not limited to these specific numbers and other numbers would be acceptable.
At step s403, the process 400 learns a mapping from the crowd patches to the density maps/distributions, for example, by using the mini-batch gradient descent and back-propagation until the density maps/distributions converge to the ground-truth density/distribution as created by the ground-truth density map creation unit 10. At step s404, the process 400 switches the objectiveness and learns the mapping from the crowd patches to the counts until the learned counts converge to the counts estimated by the ground-truth density map creation unit 10. At step 405, it is determined if the estimated density map/distribution and the count converge to the ground-truth, if not, repeat steps s403-s405. Hereinafter, the steps s403-s405 will be discussed in detail.
In the embodiment of the present application, it introduces an iterative switching process in the crowd CNN model 500 to alternatively optimize the density map/distribution estimation task and the count estimation task. The main task for the crowd CNN model 400 is to estimate the crowd density map/distribution of the input patches. In the embodiments as shown in Fig. 5, because two pooling layers exist in the CNN model 500, the output density map/distribution is down-sampled to 18X18. Therefore, the ground truth density map/distribution is also down-sampled to 18X18. Since the density map/distribution contains rich and abundant local and detailed information, the CNN model 500 can benefit from learning to predict density map/distribution and can obtain a better representation of crowd patches. The total count regression of the input patch is treated as the secondary task, which is calculated by integrating the density map patch. Two tasks alternatively assist each other and obtain a better solution. The two loss functions are defined by rule of:
where θ is the set of parameters of the CNN model and N is the number of training samples. LD is the loss between estimated density map Fd (Xi; θ) (the output of fc6) and the ground truth density map Di. Similarly, LY is the loss between the estimated crowd number Fy (Xi; θ) (the output of fc7) and the ground truth number Yi. Euclidean distance is adopted in these two objective losses. The loss is minimized using mini-batch gradient descent and back propagation.
The switchable training procedure is summarized in Algorithm 1. LD is set as the first objective loss to minimize, since the density map/distribution can introduce more spatial information to the CNN model, the density map/distribution estimation requires the model 500 to learn a general representation for the crowds. After the first objective converges, the model 500 switches to minimize the objective of global count regression. Count regression is an easier task and its learning converges faster than the task of density map/distribution regression. Note that the two objective losses should be normalized to the similar or same scales; otherwise the objective with the larger scale would be dominant in the training process. In one embodiment of the present application, the scale weight of density loss can be set to 10, and the scale weight of count loss can be set to 1. The training loss converged after about 6 switch iterations. The proposed switching learning approach can achieve better performance than the widely used multi-task learning approach.
3) Similar Data Retrieval Unit 30
The similar data retrieval unit 30 is configured to receive sample frames from the target scene, and receive the samples from training set with the ground-truth density maps/distributions and counts created by the unit 10, and then to obtain the similar data from the training set for each target scene to overcome the scene gaps.
The crowd CNN model 500 is pre-trained based on all training scene data through the proposed switchable learning process. However, each query crowd scene has its unique scene properties, such as different view angles, scales and different density distributions. These properties significantly change the appearance of crowd patches and affect the performance of the crowd CNN model 500. In order to bridge the distribution gap between the training and test scenes, a nonparametric fine-tuning scheme is designed to adapt the pre-trained CNN model 500 to unseen target scenes.
Given a target video from the unseen scenes, samples with similar properties from the training scenes are retrieved and are added to the training data to fine-tune the crowd CNN model 500. The retrieval task consists of two steps, candidate scenes retrieval and local patch retrieval.
Candidate scene retrieval (step 601) . The view angle and the scale of a scene are the main factors affecting the appearance of crowd. The perspective map/distribution
can indicate both the view angle and the scale. To overcome the scale gap between different scenes, each input patch is normalized into the same scale, which covers a 3-meter by 3-meter square (for example) in the actual scene according to the perspective map/distribution. Therefore, the first step of the nonparametric fine-tuning method focuses on retrieving training scenes that have similar perspective maps/distributions with the target scene from all the training scenes. Those retrieved scenes are called candidate fine-tuning scenes. A perspective descriptor is designed to represent the view angle of each scene. Since the perspective map/distribution is linearly fitted along the y axis, its vertical gradient ΔMy=M (y) -M (y-1) may be used as the perspective descriptor. Based on the descriptor, for a target unseen scene, the top (for example, 20) perspective-map-similar scenes are retrieved from the whole training dataset. The images in the retrieved scenes are treated as the candidate scenes for local patch retrieval.
Local patch retrieval (step 602) . The second step is to select similar patches, which have similar density distributions with those in the test scene, from candidate scenes. Besides the view angle and the scale, the crowd density distribution also affects the appearance pattern of crowds. Higher density crowd has more severe occlusions, and only heads and shoulders can be observed. On the contrary, in sparse crowd, pedestrians appear with entire body shapes. Therefore, the similar data retrieval unit 30 is configured to try to predict the density distribution of the target scene and retrieve similar patches that match the predicted target density distribution from the candidate scenes. For example, for a crowd scene with high densities, denser patches should be retrieved to fine-tune the pre-trained model to fit the target scene.
With the pre-trained CNN model 500 as trained in the unit 20, we can roughly predict the density and the total count for every patch of the target image. It is assumed that patches with similar density map/distribution have similar output through the pre-trained model 500. Based on the prediction result, we compute a histogram of the density distribution for the target scene. Each bin is calculated by rule of:
where yi is the integrating count of estimated density map/distribution for sample i.
Since there rarely exist scenes where more than 20 pedestrians stand in a 3-meter by 3-meter square, when yi > 20, the patch should be assigned to the sixth bin, i.e. ci = 6. Density distribution of the target scene can be obtained from Equation (4) . Then, patches are randomly selected from the retrieved training scenes and the number of patches with different densities are controlled to match the density distribution of the target scene. In this way, the proposed fine-tuning method is adopted to retrieve the patches with similar view angles, scales and density distributions.
Model Fine-Tuning Unit 40
The model fine-tuning unit 40 is configured to receive the retrieved similar dataand to fine-tune the pre-trained CNN 500 with the similar data to make the CNN 500 capable of predicting density map/distribution and the pedestrian count in region of interest for the Video frame to be detected and the region of interest. The fine-tuned crowd CNN model achieves better performance for the target scene.
In one embodiment of the present application, the fine-tuning unit 40 samples the similar patches obtained from the unit 30 and inputs the obtain similar patches into the pre-trained CNN to fine-tune it , for example, by using the mini-batch gradient descent and back-propagation until the density maps/distributions converge to the ground-truth density/distribution as created by the ground-truth density map creation unit 10. And then the fine-tuning unit 40 switches the objectiveness and learns the mapping from the crowd patches to the counts until the learned counts converge to the counts estimated by the ground-truth density map creation unit 10. At last, it is determined if the estimated density map/distribution and the count converge to the ground-truth, if not, repeat the above steps.
The fine-tuned predictive model generated by the model fine-tuning unit 40 may receive video frames to be detected and region of interest, and then predict an estimated density map and the pedestrian count in the region of interest.
As will be appreciated by one skilled in the art, the present invention may be embodied as a system, method or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment and hardware aspects that may all generally be referred to herein as a “unit” , “circuit, ” “module” or “system. ” Much of the inventive functionality and many of the inventive principles when implemented, are best supported with or integrated circuits (ICs) , such as a digital signal processor and software therefore or application specific ICs. It is expected that one of ordinary skill, notwithstanding possibly significant effort and many design choices motivated by, for example, available time, current technology, and economic considerations, when guided by the concepts and principles disclosed herein will be readily capable of generating ICs with minimal experimentation. Therefore, in the interest of brevity and minimization of any risk of obscuring the principles and concepts according to the present invention, further discussion of such software and ICs, if any, will be limited to the essentials with respect to the principles and concepts used by the preferred embodiments.
In addition, the present invention may take the form of an entirely software embodiment (including firmware, resident software, micro-code, etc. ) or an embodiment combining software. Furthermore, the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer-usable program code embodied in the medium. Fig. 7 illustrates a system 7000 for generating a predictive model to predict a crowd density distribution and counts of persons in image frames. The system 7000 comprises a memory 3001 that stores executable components and a processor 3002, electrically coupled to the memory 3001 to execute the executable components to perform operations of the system 3000. The executable components may comprise: a ground-truth density map creation component 701, a CNN training component 702, a similar data retrieval component 703, and a model fine-tuning component 704.
The ground-truth density map creation component 701 is configured for selecting image patches from one or more training image frames in the training set; and determining the ground-truth crowd distribution in the selected patches and the ground-truth total counts of pedestrians in the selected patches. The CNN training component 702 is configured for training a CNN by inputting one or more crowd patches from frames in a training set, each of the crowd patches having a predetermined ground-truth density distribution and counts of persons in the inputted crowd patches.
The similar data retrieval component 703 is configured for sampling frames from a target scene image set and receiving the training images from the training set that having the determined ground-truth density distribution and counts/number, and retrieving similar image data from the received training frames for each of the sampled target image frames to overcome scene gaps between the target scene image set and the training images.
The model fine-tuning component 703 is configured for fine-tuning the CNN by inputting the similar image data to the CNN 500 so as to determine a predictive model for predicting the crowd density map and counts of persons in image frames.
The functions of the components 701~704 are similar to those of the unit 10~40, respectively, and thus the detailed descriptions thereof are omitted herein.
Although the preferred examples of the present invention have been described, those skilled in the art can make variations or modifications to these examples upon knowing the basic inventive concept. The appended claims are intended to be considered as comprising the preferred examples and all the variations or modifications fell into the scope of the present invention.
Claims (22)
- A method for generating a predictive model to predict a crowd density distribution and counts of persons in image frames, comprising:training a CNN by inputting one or more crowd patches from frames in a training set, each of the crowd patches having a predetermined ground-truth density distribution and counts of persons in the inputted crowd patches;sampling frames from a target scene image set and receiving the training images from the training set that having the determined ground-truth density distribution and counts/number;retrieving similar image data from the received training frames for each of the sampled target image frames to overcome scene gaps between the target scene image set and the training images; andfine-tuning the CNN by inputting the similar image data to the CNN so as to determine a predictive model for predicting the crowd density map and counts of persons in image frames.
- The method of claim 1, further comprising:selecting image patches from one or more training image frames in the training set; anddetermining the ground-truth crowd distribution in the selected patches and the ground-truth total counts of pedestrians in the selected patches.
- The method of claim 2, wherein the determining further comprises:indentifying each person with a labeled head position in each frame;determining a body shape kernel of each indentified person; andcombining all determined kernels to form the ground-truth density map/distribution for each frame, wherein the counts for the persons in the crowds are equal to the sum of all the values for the body shape kernels in the ground-truth density map/distribution.
- The method according to claim 1, wherein the training further comprises:initializing the CNN based on Gaussian random distribution randomly;sampling patches from the training images;estimating, by the CNN, crowd distribution in the sampled patches and a total number of pedestrians in the sampled patches;updating parameters for the CNN until the estimated distribution converge to the ground-truth distribution; andfurther updating parameters for the CNN until the estimated number converge to the determined ground-truth number so as to obtain a pre-trained CNN.
- The method of claim 4, wherein the retrieving similar data further comprises:retrieving, from the training image frames, candidate fine-tuning frame data that has similar perspective distribution with the target image frames; andselecting, from candidate scenes, similar patches which have similar density distributions to those of the target image frames.
- The method of claim 5, wherein the fine-tuning further comprises:sampling patches from the similar patches;estimating, by the pre-trained CNN, crowd distribution in the sampled patches and a total number of pedestrians in the sampled patches;updating parameters for the CNN until the estimated distribution converge to the ground-truth distribution; andfurther updating parameters for the pre-trained CNN until the estimated number converge to the determined ground-truth number so as to obtain a fine-tuned CNN.
- The method of any one of claims 1-6, wherein, a total of counts for the persons in the image frames are determined by counting integration over the determined ground-truth density distribution.
- The method of any one of claims 1-6, wherein the crowd distribution is created based on persons’s patial location in each image frame, human body shape in each image frame, and perspective distortion of images.
- An apparatus for generating a predictive model to predict a crowd density distribution and counts of persons in image frames, comprising:a CNN training unit (20) for training a CNN by inputting one or more crowd patches from frames in a training set, each of the crowd patches having a predetermined ground-truth density distribution and counts of persons in the inputted crowd patches;a similar data retrieval unit (30) for sampling frames from a target scene image set and receiving the training images from the training set that having the determined ground-truth density distribution and counts/number, and retrieving similar image data from the received training frames for each of the sampled target image frames to overcome scene gaps between the target scene image set and the training images; anda model fine-tuning unit (40) for fine-tuning the CNN by inputting the similar image data to the CNN so as to determine a predictive model for predicting the crowd density map and counts of persons in image frames.
- The apparatus according to claim 9, further comprising:a ground-truth density map creation unit (10) for selecting image patches from one or more training image frames in the training set; and determining the ground-truth crowd distribution in the selected patches and the ground-truth total counts of pedestrians in the selected patches.
- The apparatus according to claim 10, wherein the ground-truth density map creation unit (10) is configured to determining the ground-truth crowd distribution in the selected patches and the ground-truth total counts of pedestrians in the selected patches by:indentifying each person with a labeled head position in each frame;determining a body shape kernel of each indentified person; andcombining all determined kernels to form the ground-truth density map/distribution for each frame, wherein the counts for the persons in the crowds are equal to the sum of all the values for the body shape kernels in the ground-truth density map/distribution.
- The apparatus according to claim 9, wherein the CNN training unit (20) trains the CNN by:initializing the CNN based on Gaussian random distribution randomly;sampling patches from the training images;estimating, by the CNN, crowd distribution in the sampled patches and a total number of pedestrians in the sampled patches;updating parameters for the CNN until the estimated distribution converge to the ground-truth distribution; andfurther updating parameters for the CNN until the estimated number converge to the determined ground-truth number so as to obtain a pre-trained CNN.
- The apparatus of claim 12, wherein the similar data retrieval unit (30) is configured to,retrieve, from the training image frames, candidate fine-tuning frame data that has similar perspective distribution with the target image frames; andselect, from candidate scenes, similar patches which have similar density distributions to those of the target image frames.
- The apparatus of claim 13, wherein the fine-tuning unit is further configured for:sampling patches from the similar patches;estimating, by the pre-trained CNN, crowd distribution in the sampled patches and a total number of pedestrians in the sampled patches;updating parameters for the CNN until the estimated distribution converge to the ground-truth distribution; andfurther updating parameters for the pre-trained CNN until the estimated number converge to the determined ground-truth number to obtain a fine-tuned CNN.
- The apparatus of any one of claims 9-14, wherein, a total of counts for the persons in the image frames are determined by counting integration over the determined ground-truth density distribution.
- The apparatus of any one of claims 9-14, wherein the crowd distribution is created based on persons’s patial location in each image frame, human body shape in each image frame, and perspective distortion of images.
- A system for generating a predictive model to predict a crowd density distribution and counts of persons in image frames, comprising:a memory that stores executable components; anda processor electrically coupled to the memory to execute the executable components to perform operations of the system, wherein, the executable components comprise:a CNN training component for training a CNN by inputting one or more crowd patches from frames in a training set, each of the crowd patches having a predetermined ground-truth density distribution and counts of persons in the inputted crowd patches;a similar data retrieval component for sampling frames from a target scene image set and receiving the training images from the training set that having the determined ground-truth density distribution and counts/number, and retrieving similar image data from the received training frames for each of the sampled target image frames to overcome scene gaps between the target scene image set and the training images; anda model fine-tuning component for fine-tuning the CNN by inputting the similar image data to the CNN so as to determine a predictive model for predicting the crowd density map and counts of persons in image frames.
- The system according to claim 17, further comprising:a ground-truth density map creation component for selecting image patches from one or more training image frames in the training set; and determining the ground-truth crowd distribution in the selected patches and the ground-truth total counts of pedestrians in the selected patches.
- The system according to claim 17, wherein the ground-truth density map creation component is configured to determining the ground-truth crowd distribution in the selected patches and the ground-truth total counts of pedestrians in the selected patches by:indentifying each person with a labeled head position in each frame;determining a body shape kernel of each indentified person;combining all determined kernels to form the ground-truth density map/distribution for each frame, wherein the counts for the persons in the crowds are equal to the sum of all the values for the body shape kernels in the ground-truth density map/distribution.
- The system according to claim 17, wherein the CNN training component trains the CNN by:initializing the CNN based on Gaussian random distribution randomly;sampling patches from the training images;estimating, by the CNN, crowd distribution in the sampled patches and a total number of pedestrians in the sampled patches;updating parameters for the CNN until the estimated distribution converge to the ground-truth distribution; andfurther updating parameters for the CNN until the estimated number converge to the determined ground-truth number to obtain a pre-trained CNN.
- The system of claim 20, wherein the similar data retrieval component is configured to,retrieve, from the training image frames, candidate fine-tuning frame data that has similar perspective distribution with the target image frames; andselect, from candidate scenes, similar patches which have similar density distributions to those of the target image frames.
- The system of claim 21, wherein the fine-tuning component is further configured for:sampling patches from the similar patches;estimating, by the pre-trained CNN, crowd distribution in the sampled patches and a total number of pedestrians in the sampled patches;updating parameters for the pre-trained CNN until the estimated distribution converge to the ground-truth distribution; andfurther updating parameters for the CNN until the estimated number converge to the determined ground-truth number to obtain a fine-tune CNN.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2015/079178 WO2016183766A1 (en) | 2015-05-18 | 2015-05-18 | Method and apparatus for generating predictive models |
CN201580080145.XA CN107624189B (en) | 2015-05-18 | 2015-05-18 | Method and apparatus for generating a predictive model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2015/079178 WO2016183766A1 (en) | 2015-05-18 | 2015-05-18 | Method and apparatus for generating predictive models |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2016183766A1 true WO2016183766A1 (en) | 2016-11-24 |
Family
ID=57319199
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2015/079178 WO2016183766A1 (en) | 2015-05-18 | 2015-05-18 | Method and apparatus for generating predictive models |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN107624189B (en) |
WO (1) | WO2016183766A1 (en) |
Cited By (39)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106997459A (en) * | 2017-04-28 | 2017-08-01 | 成都艾联科创科技有限公司 | A kind of demographic method split based on neutral net and image congruencing and system |
CN107330364A (en) * | 2017-05-27 | 2017-11-07 | 上海交通大学 | A kind of people counting method and system based on cGAN networks |
CN107563349A (en) * | 2017-09-21 | 2018-01-09 | 电子科技大学 | A kind of Population size estimation method based on VGGNet |
CN107566781A (en) * | 2016-06-30 | 2018-01-09 | 北京旷视科技有限公司 | Video frequency monitoring method and video monitoring equipment |
CN107609597A (en) * | 2017-09-26 | 2018-01-19 | 嘉世达电梯有限公司 | A kind of number of people in lift car detecting system and its detection method |
CN107657226A (en) * | 2017-09-22 | 2018-02-02 | 电子科技大学 | A kind of Population size estimation method based on deep learning |
CN107977025A (en) * | 2017-11-07 | 2018-05-01 | 中国农业大学 | A kind of regulator control system and method for industrialized aquiculture dissolved oxygen |
CN108154089A (en) * | 2017-12-11 | 2018-06-12 | 中山大学 | A kind of people counting method of head detection and density map based on dimension self-adaption |
CN108615027A (en) * | 2018-05-11 | 2018-10-02 | 常州大学 | A method of video crowd is counted based on shot and long term memory-Weighted Neural Network |
CN108875456A (en) * | 2017-05-12 | 2018-11-23 | 北京旷视科技有限公司 | Object detection method, object detecting device and computer readable storage medium |
CN109117791A (en) * | 2018-08-14 | 2019-01-01 | 中国电子科技集团公司第三十八研究所 | A kind of crowd density drawing generating method based on expansion convolution |
CN109409318A (en) * | 2018-11-07 | 2019-03-01 | 四川大学 | Training method, statistical method, device and the storage medium of statistical model |
CN109447008A (en) * | 2018-11-02 | 2019-03-08 | 中山大学 | Population analysis method based on attention mechanism and deformable convolutional neural networks |
CN109635634A (en) * | 2018-10-29 | 2019-04-16 | 西北大学 | A kind of pedestrian based on stochastic linear interpolation identifies data enhancement methods again |
WO2019084854A1 (en) * | 2017-11-01 | 2019-05-09 | Nokia Technologies Oy | Depth-aware object counting |
EP3534300A3 (en) * | 2018-07-02 | 2019-12-18 | Baidu Online Network Technology (Beijing) Co., Ltd. | Method, apparatus, device, and storage medium for predicting the number of people of dense crowd |
US20200053515A1 (en) * | 2015-11-04 | 2020-02-13 | XAD INC. (dba GROUNDTRUTH) | Systems and Methods for Discovering Lookalike Mobile Devices |
CN110826496A (en) * | 2019-11-07 | 2020-02-21 | 腾讯科技(深圳)有限公司 | Crowd density estimation method, device, equipment and storage medium |
CN110942015A (en) * | 2019-11-22 | 2020-03-31 | 上海应用技术大学 | Crowd density estimation method |
CN111062275A (en) * | 2019-12-02 | 2020-04-24 | 汇纳科技股份有限公司 | Multi-level supervision crowd counting method, device, medium and electronic equipment |
CN111178235A (en) * | 2019-12-27 | 2020-05-19 | 卓尔智联(武汉)研究院有限公司 | Target quantity determination method, device, equipment and storage medium |
CN111191667A (en) * | 2018-11-15 | 2020-05-22 | 天津大学青岛海洋技术研究院 | Crowd counting method for generating confrontation network based on multiple scales |
CN111274900A (en) * | 2020-01-15 | 2020-06-12 | 北京航空航天大学 | Empty-base crowd counting method based on bottom layer feature extraction |
CN111291587A (en) * | 2018-12-06 | 2020-06-16 | 深圳光启空间技术有限公司 | Pedestrian detection method based on dense crowd, storage medium and processor |
CN111626141A (en) * | 2020-04-30 | 2020-09-04 | 上海交通大学 | Crowd counting model establishing method based on generated image, counting method and system |
CN111652168A (en) * | 2020-06-09 | 2020-09-11 | 腾讯科技(深圳)有限公司 | Group detection method, device and equipment based on artificial intelligence and storage medium |
CN111898578A (en) * | 2020-08-10 | 2020-11-06 | 腾讯科技(深圳)有限公司 | Crowd density acquisition method and device, electronic equipment and computer program |
CN112001274A (en) * | 2020-08-06 | 2020-11-27 | 腾讯科技(深圳)有限公司 | Crowd density determination method, device, storage medium and processor |
US10880682B2 (en) | 2015-11-04 | 2020-12-29 | xAd, Inc. | Systems and methods for creating and using geo-blocks for location-based information service |
US10939233B2 (en) | 2018-08-17 | 2021-03-02 | xAd, Inc. | System and method for real-time prediction of mobile device locations |
CN112801018A (en) * | 2021-02-07 | 2021-05-14 | 广州大学 | Cross-scene target automatic identification and tracking method and application |
CN113033342A (en) * | 2021-03-10 | 2021-06-25 | 西北工业大学 | Crowd scene pedestrian target detection and counting method based on density estimation |
CN113269224A (en) * | 2021-03-24 | 2021-08-17 | 华南理工大学 | Scene image classification method, system and storage medium |
US11106904B2 (en) * | 2019-11-20 | 2021-08-31 | Omron Corporation | Methods and systems for forecasting crowd dynamics |
US11134359B2 (en) | 2018-08-17 | 2021-09-28 | xAd, Inc. | Systems and methods for calibrated location prediction |
US11172324B2 (en) | 2018-08-17 | 2021-11-09 | xAd, Inc. | Systems and methods for predicting targeted location events |
CN113822111A (en) * | 2021-01-19 | 2021-12-21 | 北京京东振世信息技术有限公司 | Crowd detection model training method and device and crowd counting method and device |
CN113920391A (en) * | 2021-09-17 | 2022-01-11 | 北京理工大学 | Target counting method based on generated scale self-adaptive true value graph |
CN115293465A (en) * | 2022-10-09 | 2022-11-04 | 枫树谷(成都)科技有限责任公司 | Crowd density prediction method and system |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109815936B (en) * | 2019-02-21 | 2023-08-22 | 深圳市商汤科技有限公司 | Target object analysis method and device, computer equipment and storage medium |
CN110197502B (en) * | 2019-06-06 | 2021-01-22 | 山东工商学院 | Multi-target tracking method and system based on identity re-identification |
CN111340801A (en) * | 2020-03-24 | 2020-06-26 | 新希望六和股份有限公司 | Livestock checking method, device, equipment and storage medium |
CN112990530B (en) * | 2020-12-23 | 2023-12-26 | 北京软通智慧科技有限公司 | Regional population quantity prediction method, regional population quantity prediction device, electronic equipment and storage medium |
CN118155142A (en) * | 2024-05-09 | 2024-06-07 | 浙江大华技术股份有限公司 | Object density recognition method and event recognition method |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090222388A1 (en) * | 2007-11-16 | 2009-09-03 | Wei Hua | Method of and system for hierarchical human/crowd behavior detection |
CN104077613A (en) * | 2014-07-16 | 2014-10-01 | 电子科技大学 | Crowd density estimation method based on cascaded multilevel convolution neural network |
CN104268524A (en) * | 2014-09-24 | 2015-01-07 | 朱毅 | Convolutional neural network image recognition method based on dynamic adjustment of training targets |
CN104573744A (en) * | 2015-01-19 | 2015-04-29 | 上海交通大学 | Fine granularity classification recognition method and object part location and feature extraction method thereof |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7991193B2 (en) * | 2007-07-30 | 2011-08-02 | International Business Machines Corporation | Automated learning for people counting systems |
CN103971100A (en) * | 2014-05-21 | 2014-08-06 | 国家电网公司 | Video-based camouflage and peeping behavior detection method for automated teller machine |
-
2015
- 2015-05-18 WO PCT/CN2015/079178 patent/WO2016183766A1/en active Application Filing
- 2015-05-18 CN CN201580080145.XA patent/CN107624189B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090222388A1 (en) * | 2007-11-16 | 2009-09-03 | Wei Hua | Method of and system for hierarchical human/crowd behavior detection |
CN104077613A (en) * | 2014-07-16 | 2014-10-01 | 电子科技大学 | Crowd density estimation method based on cascaded multilevel convolution neural network |
CN104268524A (en) * | 2014-09-24 | 2015-01-07 | 朱毅 | Convolutional neural network image recognition method based on dynamic adjustment of training targets |
CN104573744A (en) * | 2015-01-19 | 2015-04-29 | 上海交通大学 | Fine granularity classification recognition method and object part location and feature extraction method thereof |
Cited By (62)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200053515A1 (en) * | 2015-11-04 | 2020-02-13 | XAD INC. (dba GROUNDTRUTH) | Systems and Methods for Discovering Lookalike Mobile Devices |
US11683655B2 (en) | 2015-11-04 | 2023-06-20 | xAd, Inc. | Systems and methods for predicting mobile device locations using processed mobile device signals |
US10880682B2 (en) | 2015-11-04 | 2020-12-29 | xAd, Inc. | Systems and methods for creating and using geo-blocks for location-based information service |
US10715962B2 (en) * | 2015-11-04 | 2020-07-14 | Xad Inc. | Systems and methods for predicting lookalike mobile devices |
CN107566781A (en) * | 2016-06-30 | 2018-01-09 | 北京旷视科技有限公司 | Video frequency monitoring method and video monitoring equipment |
CN107566781B (en) * | 2016-06-30 | 2019-06-21 | 北京旷视科技有限公司 | Video monitoring method and video monitoring equipment |
CN106997459A (en) * | 2017-04-28 | 2017-08-01 | 成都艾联科创科技有限公司 | A kind of demographic method split based on neutral net and image congruencing and system |
CN106997459B (en) * | 2017-04-28 | 2020-06-26 | 成都艾联科创科技有限公司 | People counting method and system based on neural network and image superposition segmentation |
CN108875456A (en) * | 2017-05-12 | 2018-11-23 | 北京旷视科技有限公司 | Object detection method, object detecting device and computer readable storage medium |
CN107330364A (en) * | 2017-05-27 | 2017-11-07 | 上海交通大学 | A kind of people counting method and system based on cGAN networks |
CN107330364B (en) * | 2017-05-27 | 2019-12-03 | 上海交通大学 | A kind of people counting method and system based on cGAN network |
CN107563349A (en) * | 2017-09-21 | 2018-01-09 | 电子科技大学 | A kind of Population size estimation method based on VGGNet |
CN107657226A (en) * | 2017-09-22 | 2018-02-02 | 电子科技大学 | A kind of Population size estimation method based on deep learning |
CN107609597B (en) * | 2017-09-26 | 2020-10-13 | 嘉世达电梯有限公司 | Elevator car number detection system and detection method thereof |
CN107609597A (en) * | 2017-09-26 | 2018-01-19 | 嘉世达电梯有限公司 | A kind of number of people in lift car detecting system and its detection method |
US11270441B2 (en) | 2017-11-01 | 2022-03-08 | Nokia Technologies Oy | Depth-aware object counting |
WO2019084854A1 (en) * | 2017-11-01 | 2019-05-09 | Nokia Technologies Oy | Depth-aware object counting |
CN111295689B (en) * | 2017-11-01 | 2023-10-03 | 诺基亚技术有限公司 | Depth aware object counting |
CN111295689A (en) * | 2017-11-01 | 2020-06-16 | 诺基亚技术有限公司 | Depth aware object counting |
CN107977025A (en) * | 2017-11-07 | 2018-05-01 | 中国农业大学 | A kind of regulator control system and method for industrialized aquiculture dissolved oxygen |
CN108154089B (en) * | 2017-12-11 | 2021-07-30 | 中山大学 | Size-adaptive-based crowd counting method for head detection and density map |
CN108154089A (en) * | 2017-12-11 | 2018-06-12 | 中山大学 | A kind of people counting method of head detection and density map based on dimension self-adaption |
CN108615027B (en) * | 2018-05-11 | 2021-10-08 | 常州大学 | Method for counting video crowd based on long-term and short-term memory-weighted neural network |
CN108615027A (en) * | 2018-05-11 | 2018-10-02 | 常州大学 | A method of video crowd is counted based on shot and long term memory-Weighted Neural Network |
US11302104B2 (en) | 2018-07-02 | 2022-04-12 | Baidu Online Network Technology (Beijing) Co., Ltd. | Method, apparatus, device, and storage medium for predicting the number of people of dense crowd |
EP3534300A3 (en) * | 2018-07-02 | 2019-12-18 | Baidu Online Network Technology (Beijing) Co., Ltd. | Method, apparatus, device, and storage medium for predicting the number of people of dense crowd |
CN109117791A (en) * | 2018-08-14 | 2019-01-01 | 中国电子科技集团公司第三十八研究所 | A kind of crowd density drawing generating method based on expansion convolution |
US11134359B2 (en) | 2018-08-17 | 2021-09-28 | xAd, Inc. | Systems and methods for calibrated location prediction |
US11172324B2 (en) | 2018-08-17 | 2021-11-09 | xAd, Inc. | Systems and methods for predicting targeted location events |
US10939233B2 (en) | 2018-08-17 | 2021-03-02 | xAd, Inc. | System and method for real-time prediction of mobile device locations |
CN109635634B (en) * | 2018-10-29 | 2023-03-31 | 西北大学 | Pedestrian re-identification data enhancement method based on random linear interpolation |
CN109635634A (en) * | 2018-10-29 | 2019-04-16 | 西北大学 | A kind of pedestrian based on stochastic linear interpolation identifies data enhancement methods again |
CN109447008A (en) * | 2018-11-02 | 2019-03-08 | 中山大学 | Population analysis method based on attention mechanism and deformable convolutional neural networks |
CN109409318A (en) * | 2018-11-07 | 2019-03-01 | 四川大学 | Training method, statistical method, device and the storage medium of statistical model |
CN111191667B (en) * | 2018-11-15 | 2023-08-18 | 天津大学青岛海洋技术研究院 | Crowd counting method based on multiscale generation countermeasure network |
CN111191667A (en) * | 2018-11-15 | 2020-05-22 | 天津大学青岛海洋技术研究院 | Crowd counting method for generating confrontation network based on multiple scales |
CN111291587A (en) * | 2018-12-06 | 2020-06-16 | 深圳光启空间技术有限公司 | Pedestrian detection method based on dense crowd, storage medium and processor |
CN110826496B (en) * | 2019-11-07 | 2023-04-07 | 腾讯科技(深圳)有限公司 | Crowd density estimation method, device, equipment and storage medium |
CN110826496A (en) * | 2019-11-07 | 2020-02-21 | 腾讯科技(深圳)有限公司 | Crowd density estimation method, device, equipment and storage medium |
US11106904B2 (en) * | 2019-11-20 | 2021-08-31 | Omron Corporation | Methods and systems for forecasting crowd dynamics |
CN110942015B (en) * | 2019-11-22 | 2023-04-07 | 上海应用技术大学 | Crowd density estimation method |
CN110942015A (en) * | 2019-11-22 | 2020-03-31 | 上海应用技术大学 | Crowd density estimation method |
CN111062275A (en) * | 2019-12-02 | 2020-04-24 | 汇纳科技股份有限公司 | Multi-level supervision crowd counting method, device, medium and electronic equipment |
CN111178235A (en) * | 2019-12-27 | 2020-05-19 | 卓尔智联(武汉)研究院有限公司 | Target quantity determination method, device, equipment and storage medium |
CN111274900A (en) * | 2020-01-15 | 2020-06-12 | 北京航空航天大学 | Empty-base crowd counting method based on bottom layer feature extraction |
CN111626141A (en) * | 2020-04-30 | 2020-09-04 | 上海交通大学 | Crowd counting model establishing method based on generated image, counting method and system |
CN111626141B (en) * | 2020-04-30 | 2023-06-02 | 上海交通大学 | Crowd counting model building method, counting method and system based on generated image |
CN111652168A (en) * | 2020-06-09 | 2020-09-11 | 腾讯科技(深圳)有限公司 | Group detection method, device and equipment based on artificial intelligence and storage medium |
CN111652168B (en) * | 2020-06-09 | 2023-09-08 | 腾讯科技(深圳)有限公司 | Group detection method, device, equipment and storage medium based on artificial intelligence |
CN112001274B (en) * | 2020-08-06 | 2023-11-17 | 腾讯科技(深圳)有限公司 | Crowd density determining method, device, storage medium and processor |
CN112001274A (en) * | 2020-08-06 | 2020-11-27 | 腾讯科技(深圳)有限公司 | Crowd density determination method, device, storage medium and processor |
CN111898578A (en) * | 2020-08-10 | 2020-11-06 | 腾讯科技(深圳)有限公司 | Crowd density acquisition method and device, electronic equipment and computer program |
CN111898578B (en) * | 2020-08-10 | 2023-09-19 | 腾讯科技(深圳)有限公司 | Crowd density acquisition method and device and electronic equipment |
CN113822111A (en) * | 2021-01-19 | 2021-12-21 | 北京京东振世信息技术有限公司 | Crowd detection model training method and device and crowd counting method and device |
CN113822111B (en) * | 2021-01-19 | 2024-05-24 | 北京京东振世信息技术有限公司 | Crowd detection model training method and device and crowd counting method and device |
CN112801018B (en) * | 2021-02-07 | 2023-07-07 | 广州大学 | Cross-scene target automatic identification and tracking method and application |
CN112801018A (en) * | 2021-02-07 | 2021-05-14 | 广州大学 | Cross-scene target automatic identification and tracking method and application |
CN113033342A (en) * | 2021-03-10 | 2021-06-25 | 西北工业大学 | Crowd scene pedestrian target detection and counting method based on density estimation |
CN113269224B (en) * | 2021-03-24 | 2023-10-31 | 华南理工大学 | Scene image classification method, system and storage medium |
CN113269224A (en) * | 2021-03-24 | 2021-08-17 | 华南理工大学 | Scene image classification method, system and storage medium |
CN113920391A (en) * | 2021-09-17 | 2022-01-11 | 北京理工大学 | Target counting method based on generated scale self-adaptive true value graph |
CN115293465A (en) * | 2022-10-09 | 2022-11-04 | 枫树谷(成都)科技有限责任公司 | Crowd density prediction method and system |
Also Published As
Publication number | Publication date |
---|---|
CN107624189B (en) | 2020-11-20 |
CN107624189A (en) | 2018-01-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2016183766A1 (en) | Method and apparatus for generating predictive models | |
Hossain et al. | Crowd counting using scale-aware attention networks | |
Pal et al. | Deep learning in multi-object detection and tracking: state of the art | |
Liu et al. | Exploiting unlabeled data in cnns by self-supervised learning to rank | |
Ming et al. | Deep learning-based person re-identification methods: A survey and outlook of recent works | |
Yang et al. | Sanet: Scene agnostic network for camera localization | |
US11836931B2 (en) | Target detection method, apparatus and device for continuous images, and storage medium | |
US20180114071A1 (en) | Method for analysing media content | |
CN110717411A (en) | Pedestrian re-identification method based on deep layer feature fusion | |
Fooladgar et al. | Multi-modal attention-based fusion model for semantic segmentation of RGB-depth images | |
Heo et al. | Monocular depth estimation using whole strip masking and reliability-based refinement | |
CN108875456B (en) | Object detection method, object detection apparatus, and computer-readable storage medium | |
EP3249610B1 (en) | A method, an apparatus and a computer program product for video object segmentation | |
Porikli et al. | Object tracking in low-frame-rate video | |
Li et al. | Toward a comprehensive face detector in the wild | |
CN112070044A (en) | Video object classification method and device | |
CN115240121B (en) | Joint modeling method and device for enhancing local features of pedestrians | |
CN112084952A (en) | Video point location tracking method based on self-supervision training | |
Aldhaheri et al. | MACC Net: Multi-task attention crowd counting network | |
Dahirou et al. | Motion Detection and Object Detection: Yolo (You Only Look Once) | |
Delibasoglu et al. | Motion detection in moving camera videos using background modeling and FlowNet | |
Zhang et al. | Visual Object Tracking via Cascaded RPN Fusion and Coordinate Attention. | |
Li et al. | TFMFT: Transformer-based multiple fish tracking | |
Liu et al. | [Retracted] Mean Shift Fusion Color Histogram Algorithm for Nonrigid Complex Target Tracking in Sports Video | |
Thangaraj et al. | A competent frame work for efficient object detection, tracking and classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 15892145 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 15892145 Country of ref document: EP Kind code of ref document: A1 |