CN111191524A

CN111191524A - Sports people counting method

Info

Publication number: CN111191524A
Application number: CN201911275692.XA
Authority: CN
Inventors: 文国坤
Original assignee: HUADI COMPUTER GROUP CO Ltd
Current assignee: HUADI COMPUTER GROUP CO Ltd
Priority date: 2019-12-12
Filing date: 2019-12-12
Publication date: 2020-05-22

Abstract

The invention discloses a sports crowd counting method, which comprises the following steps: acquiring a foreground image of a current frame image in the image, detecting all feature points in the current frame image by taking the foreground image as a mask and extracting effective feature points; respectively tracking each effective characteristic point in the image to obtain a track of each effective characteristic point; performing spatial clustering on all effective characteristic points to form a plurality of classes of the effective characteristic points; calculating the similarity between all the effective characteristic point tracks in each class, and forming a new class after eliminating all the isolated effective characteristic points in each class; fusing two new classes into a class pair based on the similarity and the fusion reliability between any two new classes in all the new classes; and constructing each class pair after the fusion is finished and each new class which cannot be fused into a minimum spanning tree respectively, and judging whether the minimum spanning tree belongs to a moving human body or not based on the number of effective feature points contained in the minimum spanning tree. The accuracy of the sports crowd counting is improved.

Description

Sports people counting method

Technical Field

The invention relates to the technical field of computer vision, in particular to a sports crowd counting method.

Background

At present, computer vision technology is rapidly developed, researchers in various countries also provide methods for counting a plurality of people, but because human behaviors are complex and environment is changeable, people and people are easy to be shielded from each other, and the like, satisfactory effects cannot be obtained all the time. Some methods have high accuracy when detecting a small number of targets, and the detection effect is sharply reduced once the crowd is dense. At present, people counting methods are mainly divided into methods based on human body appearance models and methods based on human body motion models. In certain specific environments, the two methods have respective advantages, and the combined method has better effect, but is time-consuming in calculation, high in cost and complex in analysis. Meanwhile, when a small number of targets are detected, the accuracy is high, once the crowd is dense, the detection effect is extremely reduced, and because the human body behaviors are complex and are easy to be shielded, a satisfactory effect is not obtained all the time.

Therefore, it is necessary to provide a method for counting sports people with higher accuracy, which can be applied to monitoring scenes with higher population density.

Disclosure of Invention

The invention aims to provide a sports crowd counting method, which can improve the accuracy of sports crowd counting in a monitoring scene with high population density.

In order to achieve the above object, the present invention provides a method for counting sports people, comprising:

step 1: acquiring a foreground image of a current frame image in an image, detecting all feature points in the current frame image by taking the foreground image as a mask and extracting effective feature points;

step 2: respectively tracking each effective characteristic point in continuous multi-frame images before and after the current frame image, and recording the position of each effective characteristic point in each frame image to obtain the track of each effective characteristic point;

and step 3: calculating the spatial distance between every two tracks, and performing spatial clustering on all effective characteristic points based on the spatial distance to form a plurality of classes of the effective characteristic points, wherein each class comprises a plurality of effective characteristic points;

and 4, step 4: calculating the similarity between all effective characteristic point tracks in each class, and forming a new class after eliminating all isolated effective characteristic points in each class according to the similarity;

and 5: calculating the similarity and the fusion reliability between any two new classes in all the new classes, and fusing the two new classes into a class pair based on the similarity and the fusion reliability;

step 6: and constructing each class pair after the fusion is finished and each new class which cannot be fused into a minimum spanning tree respectively, judging whether the minimum spanning tree belongs to a moving human body or not based on the number of effective feature points contained in the minimum spanning tree, and obtaining the number of the moving people in the image according to the number of the minimum spanning trees belonging to the moving human body.

Optionally, the valid feature point is a feature point whose detected feature point has a matching error in a subsequent image of no more than one pixel and satisfies the following discriminant:

W(Df，2)＝W(Df+1，1)＝Df+2；

the function W (Df, n) represents tracking the feature points obtained from the current frame image Df, and returning to the position of the nth frame image, Df is the current frame image, Df +1 is the next frame image, and Df +2 is the next 2 frame image.

Optionally, the step 2 includes:

respectively tracking each effective feature point in a continuous multi-frame image before and after the current frame image by adopting a layered optical flow method, recording the position of each effective feature point in each frame image, and acquiring the track { X1, X2.., Xm } of all the effective feature points, wherein m is the number of the effective feature points.

Optionally, the step 3 further includes:

if the reliable position of the effective characteristic point is not tracked in a certain frame of image, the known reliable speed is used, the position of the effective characteristic point in the frame is obtained through linear interpolation, and then the tracking is continued.

Optionally, the determining the spatial distance between each two tracks includes:

respectively solving Euclidean distances of every two effective characteristic points in each frame of image, and then selecting the maximum value to represent the space distance of every two tracks;

the spatial distance is calculated according to the following formula:

dist(Xr，Xs)＝max(dist(Xir，Xis))；

where dist (Xr, Xs) represents a spatial distance between the two valid feature point trajectories Xr and Xs, dist (Xir, Xis) represents a euclidean distance of the two valid feature point trajectories Xr and Xs in the ith frame image, i ∈ (1,... multidot.m), and m is the number of valid feature points.

Optionally, the spatially clustering all the valid feature points based on the spatial distance to form multiple classes of valid feature points includes:

and performing spatial clustering by adopting a maximum tree clustering method to form a plurality of classes of the effective characteristic points, wherein the number of the classes is 3-5 times of the maximum value of the number of the expected persons in the scene of the current frame image, and is less than or equal to 1/2 of the number of the effective characteristic points.

Optionally, the similarity between all valid feature point trajectories in each class is calculated by the following formula:

Q(Xu，Xv)＝1/(1+Var(dist(Xu，Xv)))；

wherein, Q (Xu, Xv) represents the similarity of the trajectories Xu and Xv of any two effective feature points, Var (Xu, Xv) represents the variance of the spatial distance between the trajectories Xu and Xv of any two effective feature points, u belongs to (1,.. m), v belongs to (1,.. m), and m is the number of effective feature points in each class.

Optionally, the method for determining the isolated valid feature point is:

and if the similarity between the track of one effective characteristic point and the tracks of other effective characteristic points in the belonged class is smaller than the preset similarity, the effective characteristic point is an isolated effective characteristic point.

Optionally, the step 5 comprises:

calculating the similarity and the fusion reliability by the following formulas, respectively:

P(Ci，Cj)＝average(Q(Xi，Xj))，{Xi∈Ci，Xj∈Cj}；

V(Ci，Cj)＝Var(Q(Xi，Xj))，{Xi∈Ci，Xj∈Cj}；

the method comprises the following steps that P (Ci, Cj) represents the similarity between any two new classes Ci and Cj, V (Ci, Cj) represents the fusion reliability of any two new classes Ci and Cj, Q (Xi, Xj) represents the similarity of any one effective characteristic point track Xi in Ci and any one effective characteristic point track Xj in Cj, averge represents the average value, Var represents the variance, i belongs to (1,.. k), j belongs to (1,.. k), and k is the number of the new classes;

and when P (Ci, Cj) >0.3, V (Ci, Cj) <0.12 and the central distance between Ci and Cj is less than the preset multiple of the estimated visual scale of the human body in the image, fusing Ci and Cj into a class pair by a greedy algorithm.

Optionally, the determining whether the minimum spanning tree belongs to a moving human body based on the number of effective feature points included in the minimum spanning tree includes:

and when the number of the effective characteristic points contained in a minimum spanning tree is more than or equal to 3 and is not collinear, the minimum spanning tree is considered as a moving human body.

The invention has the beneficial effects that:

the invention provides a framework for counting human bodies by twice clustering by utilizing independent motion information of objects, tracks by detecting reliable effective characteristic points on moving objects to obtain characteristic motion tracks, tracks by detecting reliable effective characteristic points on the moving objects to obtain the motion tracks of effective characteristics, and performs spatial clustering on the effective characteristic points to meet the requirement of distinguishing independent moving human bodies.

The apparatus of the present invention has other features and advantages which will be apparent from or are set forth in detail in the accompanying drawings and the following detailed description, which are incorporated herein, and which together serve to explain certain principles of the invention.

Drawings

The above and other objects, features and advantages of the present invention will become more apparent by describing in more detail exemplary embodiments thereof with reference to the attached drawings, in which like reference numerals generally represent like parts.

Fig. 1 shows a flow chart of the steps of a method for counting a population of sports according to the invention.

Detailed Description

First, a feature point of a two-dimensional image is tracked on a time axis to obtain a motion trajectory thereof. Detecting all reliable characteristic points in the current frame image, tracking to obtain the motion tracks of the characteristic points, and then respectively calculating the similarity Z (Xi, Yj) between any two tracks so as to obtain a similarity matrix Z (X1: N) of the characteristic tracks. Assuming that each person in the monitoring scene includes several feature tracks, the task is to find the most likely cluster among all the feature tracks, and thus each feature group obtained may represent a moving human body. Reliable and accurate clustering is critical, but if all possible clusters are enumerated, the amount of computation is very large and is not easy to implement. We use the basic information of the motion to limit the number of possible clusters, making clustering feasible. Based on this idea, the method of the present invention for counting a population in motion is presented.

The invention will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

Referring to fig. 1, a method for counting a moving population according to the present invention includes:

step 1: acquiring a foreground image of a current frame image in the image, detecting all feature points in the current frame image by taking the foreground image as a mask and extracting effective feature points;

Specifically, the motion tracks of effective features are obtained by detecting reliable effective feature points on a moving object for tracking, the effective feature points are subjected to spatial clustering, then the track similarity and the inter-class similarity are defined, and the motion consistency clustering is carried out, so that the requirement of distinguishing independent moving human bodies is met, the method can be applied to monitoring scenes with high population density, and the accuracy effect is higher.

In this embodiment, the valid feature point is a feature point whose matching error of the detected feature point in the subsequent image does not exceed one pixel and satisfies the following discriminant:

W(Df，2)＝W(Df+1，1)＝Df+2；

In particular, reliable features are the basis for accurate population counts, and therefore, it is important to select good features. While reliable features must be able to be tracked in successive images with high reliability. In order to select a feature that can be stably tracked from all detected features, we assume that the matching error of the independently detected features in subsequent images is not more than one pixel. The function W (Df, n) indicates the tracking of the features obtained in image f and the return to the position in the nth image. Therefore, the following discriminant is given:

W(Df，2)＝W(Df+1，1)＝Df+2 (1)

the above discriminant expression indicates: the position of the feature point in the next 2 frames of images is equal to the position of the feature point in the next frame of images on the basis of the next 1 frame of images, which is equal to the position of the next 2 frames of images.

All the features satisfying the above formula are selected as valid features, where Df represents all the feature points detected in the f-th frame image. When detecting feature points, it is imperative to detect the condition-satisfying points in the complex background as feature points, and these feature points completely conform to the formula (1), so these points will be used as useful feature points to participate in the subsequent processing. These points are actually noise and can affect the accuracy of the results in subsequent population counts. Therefore, before feature point detection, a foreground image is obtained through a background subtraction method and is used as a mask during feature point detection, so that the obtained feature points are basically feature points on a foreground object, and the influence of background noise is eliminated. It should be noted that, the characteristic points of the moving human body in the monitored image are all obtained by methods familiar to those skilled in the art, and are not described herein again.

In this embodiment, step 2 includes:

respectively tracking each effective feature point in the continuous 30-frame images before the current frame image and the continuous 30-frame images after the current frame image by adopting a layered optical flow method, recording the position of each effective feature point in each frame image, and acquiring the track { X1, X2.., Xn } of all effective feature points, wherein n is the number of the effective feature points. If the reliable position of the effective characteristic point is not tracked in a certain frame of image, the position of the effective characteristic point in the frame is obtained through linear interpolation by using the known reliable speed, and then the tracking is continued.

In this embodiment, the obtaining the spatial distance between every two tracks includes:

the spatial distance is calculated according to the following formula:

dist(Xr，Xs)＝max(dist(Xir，Xis)) (2)；

In particular, feature points that are closer in distance are more likely to be from the same human body, while points that are very far away are certainly not from the same human body. Therefore, in order to limit the number of classes (clusters) in the final clustering, the initial clustering is performed using the spatial distance information of the feature points, which is referred to as spatial clustering. And the Euclidean distance of the two feature points is the most intuitive and reliable information for representing the spatial distance. Meanwhile, in order to enhance the reliability of the clustering, the maximum distance of the two characteristic tracks is used, namely the Euclidean distance of two characteristic points in each frame is respectively calculated, and then the maximum space distance representing the two characteristic tracks is selected

In this embodiment, performing spatial clustering on all the effective feature points based on the spatial distance to form multiple classes of effective feature points includes:

and performing spatial clustering by adopting a maximum tree clustering method to form a plurality of classes of the effective characteristic points, wherein the number of the classes is 3-5 times of the maximum value of the number of the expected persons in the scene of the current frame image and is less than or equal to 1/2 of the effective characteristic points.

In this embodiment, calculating the similarity between all the effective feature point trajectories in each class is calculated by the following formula:

Q(Xu，Xv)＝1/(1+Var(dist(Xu，Xv))) (3)；

wherein, Q (Xu, Xv) represents the similarity of the trajectories Xu and Xv of any two effective feature points, Var (Xu, Xv) represents the variance of the spatial distance between the trajectories Xu and Xv of any two effective feature points, u belongs to (1,.. m), v belongs to (1,.. m), and m is the number of effective feature points in each class. And if the similarity between the track of one effective characteristic point and the tracks of other effective characteristic points in the belonged class is smaller than the preset similarity, the effective characteristic point is an isolated effective characteristic point.

Specifically, in the spatial clustering process, if only the spatial distance between the feature points is considered, the relatively rough clustering is performed, which includes some noise. Among the more serious cases are: when more noise points exist in the feature points, such as background noise point feature points which are not eliminated, and the distance between the noise point feature points is large, the noise point feature points occupy more parts in the class, so that real feature points with completely different motion tracks but not far distant space are classified into one class. Therefore, before the final clustering, the class is divided, i.e. isolated feature points (point clusters) in the class are separated from the class to form a new class. The criteria for judging the isolation of the feature points (clusters) are: the similarity (formula 3 below) with other characteristic tracks in the class is less than 0.2. 0.2 is obtained through experiments, and a person skilled in the art can select different values according to different application scenarios.

In this embodiment, step 5 includes:

the similarity and fusion reliability are calculated by the following formulas, respectively:

P(Ci，Cj)＝average(Q(Xi，Xj))，{Xi∈Ci，Xj∈Cj} (4)；

V(Ci，Cj)＝Var(Q(Xi，Xj))，{Xi∈Ci，Xj∈Cj} (5)；

when P (Ci, Cj) >0.3, V (Ci, Cj) <0.12, and the center distance of Ci and Cj is less than 1.5 times of the estimated visual scale of the human body in the image, Ci and Cj are fused from a class pair by a greedy algorithm.

Specifically, step 5 belongs to motion consistency clustering, and the clustering at this stage is to determine whether to merge two classes (clusters) into one class or not by mining the similarity between the classes (clusters) according to the motion information of the feature tracks between the classes (clusters) for the result of spatial clustering. Ideally, the motion trajectories of feature points from the same human body should be consistent. It is therefore critical to accurately define the similarity between trajectories. Suppose that: two independent features, when the variation in distance between their trajectories is smaller, are more likely to be from the same human body. We therefore define the similarity between trajectories as: q (Xu, Xv) ═ 1/(1+ Var (Xu, Xv)), where Var (Xu, Xv) ═ Var (dist (Xu, Xv));

when 2 feature tracks are from the same human body and the tracking is reliable, the Q values of the two should be 1. In practice, the distance between the moving body and the image plane cannot be completely constant because the moving body is not completely parallel to the image plane, but for a non-fast moving body, the visual scale change in the picture is small in the continuous 61-frame images, so that the definition is reliable. Similarly, the similarity between any two classes (clusters) can be obtained by using the similarity between feature tracks, and 2 classes Ci, Cj define the similarity between the two classes as:

P(Ci，Cj)＝average(Q(Xi，Xj)){Xi∈Ci，Xj∈Cj}；；

the reliability of the two fusion is as follows:

V(Ci，Cj)＝Var(Q(Xi，Xj)){Xi∈Ci，Xj∈Cj}；

wherein, only when the similarity is larger than the threshold T0, the two classes are considered to be possible to be fused. V (Ci, Cj) is then judged and fusion of the two classes is considered reliable only if it is less than a threshold T1. The larger T0 is, the smaller T1 is, the less the final classification noise is, but the less robust is, so it is important to find a suitable T0 and T1, and therefore in this embodiment, T0 is selected to be 0.3, and T1 is selected to be 0.12, so that good effects can be obtained. Meanwhile, the definition of the similarity between the classes must be that on the premise that the two classes are adjacent, when the two classes are far away from each other, it is meaningless to calculate the similarity between the two classes, so that whether the two classes are adjacent can be judged by the following method: and comparing the distance between the two types of centers with the estimated visual scale of the human body in the image, and if the distance is larger than 1.5 times of the scale, judging that the two types of centers are not adjacent.

In this embodiment, determining whether the minimum spanning tree belongs to a moving human body based on the number of effective feature points included in the minimum spanning tree includes:

when the number of the effective feature points contained in a minimum spanning tree is more than or equal to 3 and the effective feature points are not collinear (the effective feature points are not on the same line), the minimum spanning tree is considered as a moving human body.

Specifically, after the similarity of all class pairs is obtained, a method for fusing classes is also selected. Any greedy algorithm or coarse-to-fine fusion algorithm may be used in this embodiment, but the algorithm must satisfy the following 3 conditions:

1) the method can decide which class pair to start fusion in real time;

2) the criterion for stopping fusion is reliable and the calculation is not repeated;

3) and reasonably and accurately classifying the feature points shared by the edges of the two human bodies.

In this embodiment, a minimum spanning tree is constructed by using class pairs satisfying the conditions (P (Ci, Cj) > T0 and V (Ci, Cj) < T1), a single class (new classes that cannot be fused) is also calculated as a tree, and when the number of feature points included in a tree is not less than 3 and is not collinear, it is considered as a human body, and the number of human bodies of a moving crowd in an image is obtained from the number of minimum spanning trees belonging to the moving human body.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments.

Claims

1. A method for counting a population in motion, comprising:

2. The method according to claim 1, wherein the valid feature points are feature points whose detected feature points have no more than one pixel matching error in subsequent images and satisfy the following discriminant:

W(Df，2)＝W(Df+1，1)＝Df+2；

3. The method for counting a population of moving persons as claimed in claim 1, wherein said step 2 comprises:

4. The method of claim 3, wherein the step 3 further comprises:

5. The method of claim 1, wherein the determining the spatial distance between each two tracks comprises:

the spatial distance is calculated according to the following formula:

dist(Xr，Xs)＝max(dist(Xir，Xis))；

6. The method of claim 4, wherein the spatially clustering all the valid feature points based on the spatial distance to form a plurality of classes of valid feature points comprises:

7. The method of claim 1, wherein the similarity between all valid feature point trajectories in each class is calculated by the following formula:

Q(Xu，Xv)＝1/(1+Var(dist(Xu，Xv)))；

8. The method for counting sports people according to claim 1, wherein the method for determining the isolated valid feature point is:

9. The method for counting a population of moving persons as claimed in claim 1, wherein said step 5 comprises:

P(Ci，Cj)＝average(Q(Xi，Xj))，{Xi∈Ci，Xj∈Cj}；

V(Ci，Cj)＝Var(Q(Xi，Xj))，{Xi∈Ci，Xj∈Cj}；

10. The method of claim 9, wherein the determining whether the minimum spanning tree belongs to a moving human body based on the number of valid feature points included in the minimum spanning tree comprises: