CN111160101B

CN111160101B - Video personnel tracking and counting method based on artificial intelligence

Info

Publication number: CN111160101B
Application number: CN201911200873.6A
Authority: CN
Inventors: 邹建红; 高元荣; 陈雯珊; 王辉; 陈哲; 张兴; 王宇奇; 陈彬; 陈凡千; 孙建锋
Original assignee: Fujian Nebula Big Data Application Service Co ltd
Current assignee: Fujian Nebula Big Data Application Service Co ltd
Priority date: 2019-11-29
Filing date: 2019-11-29
Publication date: 2023-04-18
Anticipated expiration: 2039-11-29
Also published as: CN111160101A

Abstract

The invention discloses a video personnel tracking and counting method based on artificial intelligence, which comprehensively utilizes learning features extracted by a convolutional neural network and artificial features extracted by geometric calculation, utilizes a tracker capable of updating network parameters on line to carry out multi-target matching between video image sequences, and calculates personnel increment according to the change of inner and outer identification positions of the same pedestrian in adjacent frames. A group of feature sets are obtained by learning from a mass of public video data sets by using a sparse self-encoder and are used as a filter of the convolutional neural network, so that the online updating efficiency of the convolutional neural network is improved. In addition, common personnel shielding modes are considered, and counting errors caused by shielding are compensated. The method has robustness, real-time performance, relatively high precision and strong anti-blocking capability, is suitable for personnel counting under video big data, and can be integrated into a video monitoring software system.

Description

Video personnel tracking and counting method based on artificial intelligence

[ technical field ] A method for producing a semiconductor device

The invention belongs to the technical field of intelligent video monitoring and analysis, and particularly relates to a video personnel tracking and counting method based on artificial intelligence.

[ background of the invention ]

Generally, two ideas are available for detecting and counting the number of people in a building. The method comprises the steps of firstly, accumulating and summing the number of people detected in the monitoring video of each layer and each area of the building to serve as the total number of people in the building. This concept requires that building video surveillance must be fully covered. In addition, since the number of people detected by each monitoring video has a certain error, the summation error is large. And subtracting the accumulated number of people who leave from the accumulated number of people who enter detected in the monitoring videos of all the entrances and exits of the building to obtain the total number of people in the building. The number of the network cameras related to the idea is small, the accumulated error is relatively small, and the feasibility is good.

Actually, the second idea is to analyze the monitoring videos at the entrance and exit of the public building in real time to realize the passenger flow statistics, which is a technical solution that receives much attention and is gradually applied in recent years. The technical scheme generally requires that a network camera with a vertical downward overlooking visual angle is installed at the top end of each entrance and exit of the building, videos of people entering and exiting the building are captured, and the aim of calculating passenger flow is fulfilled by detecting and counting the heads of the people through an intelligent front end or a background. However, many times, owners do not want to additionally deploy a network video monitoring system dedicated for passenger flow statistics, but want to add a certain video analysis software module on the basis of the deployed video monitoring system for security and protection purpose to realize the passenger flow statistics. Since this not only simplifies the deployment of the system, but also avoids increasing hardware costs.

However, in order to obtain a large monitoring range, the network camera of the security video monitoring system is usually installed on the roof of a house, and looks obliquely at a monitoring area at a certain angle. In this scenario, people detection and counting cannot be achieved simply by detecting their heads. In a monitoring video scene obtained from a vertical downward overlooking visual angle, the human head characteristics are simple and consistent, no mutual shielding phenomenon exists generally, and a video analysis algorithm is simpler. However, in a monitored video scene observed from an oblique viewing angle, the human head features are complex, and the phenomenon of blocking or covering with other people often occurs, which adds great difficulty to the video analysis technology.

[ summary of the invention ]

The invention aims to solve the technical problem of providing a video personnel tracking and counting method based on artificial intelligence, selecting proper characteristics and establishing a reasonable pedestrian shielding model to effectively improve the accuracy of personnel detection and counting, realizing continuous robust tracking matching by utilizing a pedestrian tracking matching algorithm and meeting the real-time and long-term counting requirement of video monitoring.

The invention is realized by the following technical scheme:

a video personnel tracking and counting method based on artificial intelligence comprises the following steps:

step 1: initializing a video frame number n =1, and segmenting an nth frame video object to obtain a pedestrian connected domain set

Calculating a feature vector for a jth pedestrian>

And motion vector pick>

Setting the longest unacquired tracking matching number of the jth pedestrian>

The calculation method of the feature vector and the motion vector of the pedestrian is as follows:

the feature vector of the jth pedestrian is v _j ＝(x _j ，y _j ，S _j ) Wherein (x) _j ，y _j ) Is p _j Center of mass coordinate of S _j Is p _j Area of (d):

wherein, y _h For monitoring the height of the video image, N _i And M _i Are each p _j The number of pixels in the length and width directions of the circumscribed rectangle, f _j (x, y) is p _j The binary image of (2):

the motion vector of the jth pedestrian is m _j ＝(l _j ，λ _j ) Wherein l is _j ＝l(p _j ) For the inside and outside of the door by a pedestrian, l _j =0 denotes the inside of the door (in the building) | _j =1 represents the outside of the door (outside the building); lambda [ alpha ] _j For the longest untracked matching times of jth pedestrian, i.e. lambda _j ＝λ(p _j )；

Step 2: dividing the (n + 1) th frame video object

Calculate->

And &>

j＝1，...，k；

And 3, step 3: at P ⁽ⁿ⁾ Middle search and P ⁽ⁿ⁺¹⁾ Adapted for

And sets->

i＝1，...，k；

To P ⁽ⁿ⁺¹ ) Each pedestrian in (1)

Are all from P _n Finding a pedestrian or a vehicle matched to the tracking>

If the matching is successful, calculating the number increment in: (1) If the pedestrians move from the inside of the building to the outside of the building, the increment of the number of the pedestrians is-1; (2) If the pedestrians move from the outside to the inside of the building, the increment of the number of the pedestrians is 1; (3) if the pedestrians move in the building all the time, the increment of the number of the pedestrians is 0; (4) if the pedestrians move outside the building all the time, the increment of the number of the pedestrians is 0;

successfully obtaining matched p _i The longest untracked matching times lambda _i All are reset;

if the matching is successful, the matching needs to be checked

Whether the judgment condition of the combined type shielding is met or not is judged, if yes, the in detection needs to be compensated;

if the matching fails, the judgment is needed

Whether the pedestrian is a blocked pedestrian in the nth frame; if it is

If the judgment condition of distributed shielding is met, compensating the in; otherwise, it is looked pick>

For pedestrians newly appearing in the monitored area, let λ _i ＝0；

And 4, step 4: examination of P ⁽ⁿ⁾ Those failing to react with P ⁽ⁿ⁺¹⁾ Successfully matched rowsPeople, supplementing these pedestrians to P ⁽ⁿ⁺¹⁾ The longest untracked matching time is added with 1; if the pedestrian is matched in the (n + 2) th frame, judging that intermittent shielding occurs and in is correspondingly changed; otherwise, adding 1 again to the longest untracked matching times, and judging that the mobile terminal leaves the monitoring area once the threshold value is reached; if it is

If the judgment condition of the convergent type shielding is met, compensating the in;

and 5: rejecting pedestrians and misdetected pedestrians who have left the monitored area, for P ⁽ⁿ⁺¹⁾ Checking whether the longest untracked matching frequency of each pedestrian exceeds a threshold value;

if the pedestrian is larger than the threshold value, the pedestrian is considered to leave the monitoring area and should be abandoned;

otherwise, the pedestrian is considered to be temporarily shielded and should be reserved;

meanwhile, whether the area of the pedestrian exceeds the range is checked, if the area of the pedestrian is not within the range, the pedestrian is considered to be detected wrongly and should be discarded;

updating P ⁽ⁿ⁺¹⁾ ；

Step 6: and (4) making n = n +1, and jumping to the step 2 until the analysis of the whole video image sequence is completed.

Further, the tracking matching in step 3 specifically includes the following steps:

step 31: initializing a video frame number n =1, tracker T (W);

step 32: handle

Center of mass (x) _i ，y _i ) Translated to the coordinate->

To (3). The translation mode is that the center of mass is taken as the center, and the center of mass is translated to D between the center of mass and the direction of the center of mass ₈ Pixel point with distance equal to d, wherein->

d =5, 10; in conjunction with->

Obtaining 17 samples (labeled as i) of the ith class;

step 33: using the obtained samples to form a sample set C ⁽¹⁾ And training the tracker T (W) to determine the parameter as W ₁ ；

Step 34: detecting the (n + 1) th frame to obtain P ⁽ⁿ⁺¹⁾ Is provided with C ⁽ⁿ⁺¹⁾ ＝C ⁽ⁿ⁾ ；

Step 35: p is to be _j ∈P ⁽ⁿ⁺¹⁾ Input tracker T (W) _n ) Obtaining output; taking the maximum value o of the output _m And an upper threshold value sigma ₁ Lower threshold σ ₂ (σ ₁ ≥σ ₂ ) And (3) comparison:

(1) If o is _m Less than a lower threshold σ ₂ Then, consider p _j In the (n + 1) th frame, the tracking matching fails for the newly appearing pedestrian. Handle p _j Translation of the center of mass to coordinates

Is located, wherein>

d =5, 10. Together with p _j A total of 17 samples were obtained, increasing to C ⁽ⁿ⁺¹⁾ As a new class of samples;

(2) If o is _m Greater than the upper threshold σ ₁ Then consider p to be _m ∈P ⁽ⁿ⁾ And p _j ∈P ⁽ⁿ⁺¹⁾ Are highly matched;

(3) If o is _m Greater than a lower threshold value sigma ₂ But is smaller than the upper threshold value sigma ₁ Then, consider p _m ∈P ⁽ⁿ⁾ And p _j ∈P ⁽ⁿ⁺¹⁾ Is matched, with _j Translation of the center of mass to coordinates

Is in, wherein +>

d =5, 10. Together with p _j Obtaining 17 samples in total, adding the samples into a sample set with the label of m, and removing the 17 samples with the label of m in the first entering class if the number of the samples with the label of m is larger than the capacity V of each class of sample pool;

step 36: updating the sample set, removing the samples of pedestrians who leave the monitoring area and are detected by mistake, and updating the samples into 3 conditions:

(1) For newly emerging pedestrians, a new pedestrian category is created.

(2) For pedestrians whose appearance has changed, new samples should be collected and supplemented. When the number of samples exceeds the capacity V of each type of sample pool during sample expansion, the sample set is updated according to a first-in first-out rule, and the sample entering the sample pool at the earliest time is replaced by the newly supplemented sample. V =34 was determined by experiment.

(3) And for the pedestrian which leaves the monitoring area and is detected by mistake, rejecting the sample of the category to which the pedestrian belongs. After updating, a new sample set C is obtained ⁽ⁿ⁺¹⁾ 。

Step 37: update parameters of tracker T (W): use of C ⁽ⁿ⁺¹⁾ Training T (W) and determining the parameter as W _n+1 When training T (W), the initial value of the network parameter is W _n 。

Further, the tracker includes: the method comprises the following steps of (1) updating a filter, a convolutional neural network, a discriminant classifier and parameters on line;

obtaining a pedestrian set containing moving targets after the nth frame image is segmented by a video object, adjusting the area of each pedestrian rectangular frame to be 50 multiplied by 110, and inputting the pedestrian rectangular frames into a convolutional neural network;

the convolutional neural network inputs the extracted features into a discriminant classifier, and the discriminant classifier outputs a tracking result vector and gives the probability that the pedestrian in the current frame belongs to each class;

if the tracking result shows that the appearance characteristics of the newly appeared pedestrians are changed, the pedestrians leave the monitoring area and the false detection situation occurs, the sample set is updated, the hidden layer and the classifier are retrained, new network parameters are determined, and then the pedestrian tracking of the (n + 1) th frame is started.

Furthermore, the filter in the tracker is a set of feature sets pre-trained by a sparse self-encoder, and is obtained by training in a massive unsupervised auxiliary training set, so that the filter has good generality and completeness, the pre-training process of the features is an off-line process, and the trained features are not updated when a target tracking algorithm is executed.

Further, the convolution kernel used by the convolutional neural network in the tracker is a filter composed of 100 pre-training features with the size of 10 × 10.

Further, a mathematical model of a discriminant classifier in the tracker employs a SoftMax function.

The invention has the advantages that: the method has robustness, real-time performance, relatively high precision and strong anti-blocking capability, and can meet the application requirement of long-time uninterrupted operation of video monitoring. The method is suitable for personnel counting under video big data, and can be integrated into a video monitoring software system.

[ description of the drawings ]

The invention will be further described with reference to the following examples with reference to the accompanying drawings.

FIG. 1 is a flow chart of a video people tracking and counting method based on artificial intelligence of the present invention.

FIG. 2 is a schematic view of the same side type occlusion of the present invention.

FIG. 3 is a schematic diagram of distributed occlusion according to the present invention.

FIG. 4 is a schematic diagram of the convergent occlusion of the present invention.

FIG. 5 is a schematic view of the intermittent occlusion of the present invention.

FIG. 6 is a schematic diagram of a merged occlusion of the present invention.

FIG. 7 is a table of occlusion mode determination and compensation according to the present invention.

FIG. 8 is a flow chart of the trace matching algorithm of the present invention.

FIG. 9 is a block diagram of a convolutional neural network-based tracker of the present invention.

Fig. 10 is a network structure diagram of the sparse autoencoder of the present invention.

FIG. 11 is a sparse self-encoder training result visualization diagram of the present invention.

[ detailed description ] A

Fig. 1 is a video human tracking and counting method based on artificial intelligence, which calculates human increment through the change of side identification position of the same pedestrian in adjacent frames, matches multiple pedestrians in adjacent frames using a tracker based on convolutional neural network feature extraction and online parameter update, detects common occlusion pattern and compensates human increment, and the method comprises the following steps:

Calculating a feature vector ^ of the jth pedestrian>

And motion vector pick>

Setting a maximum number of untracked matches >>

The following describes a method for calculating a feature vector and a motion vector of a pedestrian.

The feature vector of the jth pedestrian is v _j ＝(x _j ，y _j ，S _j ) Wherein (x) _j ，y _j ) Is p _j Center of mass coordinate of S _j Is p _j Area of (c):

the motion vector of the jth pedestrian is m _j ＝(l _j ，λ _j ) Wherein l is _j ＝l(p _j ) For the inside and outside of the door by a pedestrian, l _j =0 denotes the inside of the door (in the building) | _j =1 represents the outside of the door (outside the building); lambda [ alpha ] _j For the longest untracked matching times of jth pedestrian, i.e. lambda _j ＝λ(p _j )。

Step 2: dividing the (n + 1) th frame video object

Calculate->

And &>

j＝1，...，k。

And step 3: at P ⁽ⁿ⁾ Middle search and P ⁽ⁿ⁺¹⁾ Adapted for

And sets->

i＝1，...，k。

To P ⁽ⁿ⁺¹⁾ Each pedestrian in (1)

Are all from P _n Finding a pedestrian or a vehicle matched to the tracking>

If the matching is successful, calculating the number increment in: (1) If the pedestrians move from the inside of the building to the outside of the building, the increment of the number of the pedestrians is-1; (2) If the pedestrians move from the outside to the inside of the building, the increment of the number of the pedestrians is 1; (3) if the pedestrians move in the building all the time, the increment of the number of the pedestrians is 0; and (4) if the pedestrians move outside the building all the time, the increment of the number of the pedestrians is 0.

In either case, a successful match of p is obtained _i The longest untracked matching times lambda _i Is cleared. If the matching is successful, the matching needs to be checked

And if the judgment condition of the combined shielding is met, compensating the in detection.

If the matching fails, the judgment is needed

Whether it is a certain blocked pedestrian in the nth frame. If/or>

For pedestrians newly appearing in the monitored area, let λ _i ＝0。

And 4, step 4: examination of P ⁽ⁿ⁾ Those of (1) fail to react with P ⁽ⁿ⁺¹⁾ Successfully matched pedestrians, supplementing them to P ⁽ⁿ⁺¹⁾ And adds 1 to the longest untracked match. If the pedestrian gets a match in the (n + 2) th frameIf so, judging that intermittent shielding occurs and in also changes correspondingly; otherwise, adding 1 again to the longest untracked matching times, and judging that the mobile terminal leaves the monitoring area once the threshold value is reached. If it is

And if the judgment condition of the convergent type shielding is met, compensating the in.

And 5: and eliminating pedestrians who have left the monitoring area and misdetected pedestrians. For P ⁽ⁿ⁺¹⁾ Each pedestrian of (4) is checked whether the longest untracked match number exceeds a threshold. If the pedestrian is larger than the threshold value, the pedestrian is considered to leave the monitoring area and should be abandoned; otherwise, the pedestrian is considered to be temporarily blocked and should be reserved. And meanwhile, checking whether the area of the pedestrian exceeds the range, and if the area of the pedestrian is not within the range, determining that false detection occurs and abandoning the detection. Updating P ⁽ⁿ⁺¹⁾ 。

And 6: let n = n +1 and jump to step 2 until the analysis of the entire sequence of video images is completed.

FIGS. 2, 3, 4, 5, 6 are schematic diagrams of unilateral occlusion, decentralized occlusion, convergent occlusion, intermittent occlusion, and merged occlusion, respectively. Table 1 shows the judgment conditions and the personnel count error compensation formula for the five common occlusion modes.

The design idea of anti-shielding is as follows: and regarding pedestrians appearing in the previous frame and not appearing in the current frame as blocked by default, adding the pedestrians to the pedestrian set of the current frame, and recording the blocking times by using the maximum tracking-free matching times, wherein the pedestrians still participate in the matching process of the pedestrian set of the next frame. If the pedestrians are detected again in the next few frames, the maximum number of times of untracked matching is reset; otherwise, it can be considered that the pedestrians are not occluded but have left the monitored area. Thus, if a certain pedestrian p _i Maximum number of untracked matches λ _i Exceeds a threshold lambda ₀ Considered to have left the surveillance zone (including entering the building interior from the inside of the door and moving away from the outside of the door); if λ _i Not exceeding lambda ₀ And is not equal to zero, consider the pedestrian at the n-thIs occluded in the frame. Lambda [ alpha ] _i And p _i The relationship between the states of (a) is: if λ _i =0, then p _i Is located in the monitoring area and is successfully detected; if 0 < lambda _i ＜λ ₀ Then p is _i Is positioned in the monitoring area and is shielded; if λ _i ≥λ ₀ Then p is _i Leaving the monitored area.

Fig. 7 is a flow chart of the trace matching algorithm used in step 3 of the method of the present invention. The following details how the various steps in the trace matching algorithm are implemented:

step 31: initializing video frame number n =1, tracker T (W).

Step 32: handle

Center of mass (x) _i ，y _i ) Translated to the coordinate->

To (3). The translation mode is that the center of mass is used as the center, and the center of mass is translated to D of the center of mass in the direction I ₈ Pixel point with distance equal to d, wherein->

d =5, 10. In conjunction with->

A total of 17 samples (labeled i) from category i were obtained.

Step 33: using the obtained samples to form a sample set C ⁽¹⁾ And training the tracker T (W) to determine the parameter as W ₁ 。

Step 34: detecting the (n + 1) th frame to obtain P ⁽ⁿ⁺¹⁾ Is provided with C ⁽ⁿ⁺¹⁾ ＝C ⁽ⁿ⁾ 。

Step 35: p is to be _j ∈P ⁽ⁿ⁺¹ ) Input tracker T (W) _n ) And obtaining output. Taking the maximum value o of the output _m And an upper threshold value sigma ₁ Lower threshold σ ₂ (σ ₁ ≥σ ₂ ) And (3) comparison:

(1) If o is _m Less than a lower threshold σ ₂ Then, consider p _j In the (n + 1) th frame, the tracking matching fails for the newly appearing pedestrian. P is to _j Translation of the center of mass to coordinates

Is located, wherein>

d =5, 10. Together with p _j A total of 17 samples were obtained, increasing to C ⁽ⁿ⁺¹⁾ As a new class of samples.

(2) If o is _m Greater than the upper threshold σ ₁ Then, consider p _m ∈P ⁽ⁿ⁾ And p _j ∈P ⁽ⁿ⁺¹⁾ Are highly matched.

Is in, wherein +>

d =5, 10. Together with p _j A total of 17 samples were obtained, adding to the set of samples labeled m. At this time, if the number of samples labeled m is greater than the per-class pool capacity V, 17 samples labeled m are removed first.

Step 36: and updating the sample set, and rejecting the samples of the pedestrians which leave the monitoring area and are detected by mistake. The update is divided into 3 cases:

(1) For newly emerging pedestrians, a new pedestrian category is created.

(2) For pedestrians whose appearance has changed, new samples should be collected and supplemented. When the number of samples exceeds the capacity V of each type of sample pool during sample expansion, the sample set is updated according to a first-in first-out rule, and the sample entering the sample pool at the earliest time is replaced by the newly supplemented sample. V =34 was determined experimentally.

(3) And for the pedestrians who leave the monitoring area and are detected by mistake, rejecting the samples of the categories to which the pedestrians belong.

After updating, a new sample set C is obtained ⁽ⁿ⁺¹⁾ 。

Step 37: the parameters of the tracker T (W) are updated. Use of C ⁽ⁿ⁺¹⁾ Training T (W) and determining the parameter as W _n+1 . In training T (W), the initial value of the network parameter is W _n 。

Fig. 8 is a diagram of a tracker structure in a tracking matching algorithm. The tracker T (W) mainly comprises a filter, a convolutional neural network, a discriminant classifier, parameter online updating and the like.

And (3) segmenting the nth frame image by using a video object to obtain a pedestrian set containing a moving target, adjusting the area of each pedestrian rectangular frame to be 50 x 110, and inputting the pedestrian rectangular frames into the convolutional neural network. The convolutional neural network inputs the extracted features into a discriminant classifier, and the classifier outputs tracking result vectors to give the probability that the pedestrians in the current frame belong to each class. And if the tracking result shows that the appearance characteristics of newly appeared pedestrians are changed, the pedestrians leave the monitoring area, the false detection is carried out and the like, updating the sample set, retraining the hidden layer and the classifier, determining new network parameters, and then entering the pedestrian tracking of the (n + 1) th frame.

The design and training methods of the filter, the convolutional neural network, the discriminant classifier, etc. are described below.

1. Filter with a filter element having a plurality of filter elements

The filter is a set of feature sets pre-trained by a sparse autoencoder to serve as convolution kernels. The method is obtained by training in a massive unsupervised auxiliary training set, and the feature set has good generality and completeness. Characterised byThe pre-training process is an off-line process, and the trained features are not updated when the target tracking algorithm is executed. Fig. 9 is a network configuration diagram of the sparse autoencoder. L is ₁ Is an input layer, and for a 10 × 10 image, input x = [ x ] ₁ ，x ₂ ，...，x ₁₀₀ ]。L ₂ Is a hidden layer, containing 100 hidden neurons. L is ₃ Is an output layer, outputs h _W，b (x) In that respect Is provided with

Is the connection weight between the jth unit of the l th layer and the ith unit of the l +1 th layer, and the combination unit>

Is the bias term of the ith unit of the l +1 th layer, the parameter of the sparse self-encoder is (W, b) = (W) ⁽¹⁾ ，b ⁽¹⁾ ，W ⁽²⁾ ，b ⁽²⁾ ) Wherein W is ^(l) (l =1,2) is @>

Is a 100 x 100 matrix of elements, b ^(l) (l =1,2) is @>

Is a 100-dimensional vector of elements.

The training process of the sparse autoencoder is as follows: (1) The gradient of the initial weight and bias term is 0, with a normal distribution N (0,0.01) ² ) The generated random value is used as an initial value of the network parameter (W, b); and (2) calculating partial derivatives. Calculating and accumulating partial derivatives by using a back propagation algorithm; (3) updating the weight parameter; and (4) repeating the steps (1) - (3) until convergence.

The method randomly selects one million pictures from a public data set TinyImagesDataset containing a large number of pictures of objects, pedestrians, backgrounds and the like in real life as auxiliary unsupervised training data, and calculates and determines parameters (W, b).

If the input g (100-dimensional vector) has the following constraints:

the inputs that make the i-th element of the hidden layer get the maximum excitation are:

wherein the content of the first and second substances,

the ith unit (i =1,2.., 100) of the hidden layer is sequentially set to the maximum excitation value, and g at this time is calculated ⁽ⁱ⁾ Then 100 input images of 10 × 10 are obtained, as shown in fig. 10. These 100 images can be considered as "bases" of a training sample set, and any given image sample can be approximately represented by a combination of these bases. In the convolutional neural network, the substrates are used as convolution kernels, so that the features of an input picture can be effectively extracted.

2. Convolutional neural network

The convolution kernel is a filter consisting of 100 pre-training features of size 10 x 10. The filter may extract features of the input image. The step size of the filter is set to 5, and each filter convolves the input image to obtain a feature map with the size of 9 × 21. Then, pooling is performed for each 3 × 3 region of the feature map, and the pooling algorithm is an averaging, thereby obtaining a feature map with a size of 3 × 7. All 2100 nodes of the feature map are input into a neural network (namely a hidden layer) containing 350 nodes, and the features of a higher layer are further extracted while the dimension is reduced, so that the classifier can further judge the features.

3. Discriminant classifier

The mathematical model of the discriminant classifier is the SoftMax function. The minimum value of the cost function of the SoftMax regression algorithm can be solved by a gradient descent method, and a unique optimal solution is obtained.

4. Training of hidden layer and discriminant classifier cascade network

When the parameters need to be updated, the tracker should be retrained. The filter parameters need not be updated, and the parameters of the hidden layer and discriminant classifier need to be updated. And training the network formed by cascading the hidden layer and the discriminant classifier by using a gradient descent method as a whole. The training algorithm comprises the following steps: (1) Performing feedforward transmission, and calculating feature maps after convolution and pooling, hidden layer weighted sum, activation value vector and classification probability vector; (2) calculating a residual error; (3) calculating partial derivatives; (4) updating the parameters; and (5) repeating the steps (1) - (4) until convergence.

The relevant parameter settings for the process of the invention are shown in table 1.

TABLE 1 parameter settings

The method of the present invention was compared with IVT (inclusive Visual Tracking), SCM (Sparse collaborative model), MIL (multiple institute learning) methods on a self-built building doorway monitoring video data set, and the performance was as shown in table 2. From the result, the tracking accuracy of the method is closer to that of other algorithms, the execution efficiency is slightly higher than that of other algorithms, and the method is better than other algorithms in the aspects of robustness and counting precision.

TABLE 2 Performance and comparison of people number increment detection methods based on motion tracking

The method of the invention designs the shielding mode detection and compensation method for the common shielding mode, and has stronger shielding resistance. The design of a convolution filter, a simple convolution neural network structure, a hidden layer, a regression layer and the like only need to be trained when parameters are updated on line in advance through off-line training, so that the method can meet the application requirement of long-time uninterrupted operation of video monitoring. The method has robustness, real-time performance and relatively high precision, is suitable for personnel counting under video big data, and can be integrated into a video monitoring software system.

The above description is only an example of the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A video personnel tracking and counting method based on artificial intelligence is characterized in that: the method comprises the following steps:

Calculating a feature vector ^ of the jth pedestrian>

And motion vector pick>

Setting a maximum number of untracked matches >>

wherein, y _h For monitoring the height of the video image, N _i And M _i Are respectively provided withIs p _j The number of pixels in the length and width directions of the circumscribed rectangle, f _j (x, y) is p _j The binary image of (2):

the motion vector of the jth pedestrian is m _j ＝(l _j ，λ _j ) Wherein l is _j ＝l(p _j ) For the inside and outside of the door by a pedestrian, l _j =0 for inside of door, i.e. in building,/ _j =1 represents the outside of the door, i.e. outside the building; lambda [ alpha ] _j For the longest untracked matching times of jth pedestrian, i.e. lambda _j ＝λ(p _j )；

Step 2: dividing the (n + 1) th frame video object

Calculate->

And &>

And step 3: at P ⁽ⁿ⁾ Middle search and P ⁽ⁿ⁺¹⁾ Adapted for

And sets->

To P ⁽ⁿ⁺¹⁾ Each pedestrian in (1)

Are all from P _n Finding a pedestrian or a vehicle matched to the tracking>

if the matching is successful, the matching needs to be checked

if the matching fails, the judgment is needed

Whether the pedestrian is a blocked pedestrian in the nth frame or not; if/or>

For pedestrians newly appearing in the monitored area, let λ _i ＝0；

And 4, step 4: examination of P ⁽ⁿ⁾ Those failing to react with P ⁽ⁿ⁺¹⁾ Successfully matched pedestrians, supplementing them to P ⁽ⁿ⁺¹⁾ The maximum number of times of matching which is not tracked is added with 1; if go upIf the pedestrian is matched in the (n + 2) th frame, judging that the intermittent shielding occurs and the in is correspondingly changed; otherwise, adding 1 again to the longest untracked matching times, and judging that the mobile terminal leaves the monitoring area once the threshold value is reached; if it is

and 5: rejecting pedestrians who have left the monitored area and misdetected pedestrians, for P ⁽ⁿ⁺¹⁾ Checking whether the longest untracked matching frequency of each pedestrian exceeds a threshold value;

updating P ⁽ⁿ⁺¹⁾ ；

Step 6: letting n = n +1, and skipping to step 2 until the analysis of the whole video image sequence is completed;

the tracking matching in the step 3 specifically comprises the following steps:

step 31: initializing a video frame number n =1, tracker T (W);

step 32: handle

Center of mass (x) _i ，y _i ) Translated to the coordinate->

At least one of (1) and (b); the translation mode is that the center of mass is used as the center, and the center of mass is translated to D of the center of mass in the direction I ₈ Pixel point with distance equal to d, wherein->

d＝5，10；In conjunction with->

Obtaining 17 samples of the ith class, wherein the labels are i;

Step 35: p is to be _j ∈P ⁽ⁿ⁺¹⁾ Input tracker T (W) _n ) Obtaining output; taking the maximum value o of the output _m And an upper threshold σ ₁ Lower threshold σ ₂ A comparison is made, where σ ₁ ≥σ ₂

(1) If o is _m Less than a lower threshold σ ₂ Then, consider p _j Tracking and matching failure is carried out on newly appeared pedestrians in the (n + 1) th frame; p is to _j Translation of the center of mass to coordinates

Is located, wherein>

d =5, 10; together with p _j A total of 17 samples were obtained, increasing to C ⁽ⁿ⁺¹⁾ As a new class of samples;

(3) If o is _m Greater than the lower thresholdValue sigma ₂ But is smaller than the upper threshold value sigma ₁ Then, consider p _m ∈P ⁽ⁿ⁾ And p _j ∈P ⁽ⁿ⁺¹⁾ Is matched, with _j Translation of the center of mass to coordinates

Is located, wherein>

d =5, 10; together with p _j Obtaining 17 samples in total, adding the samples into a sample set with a label of m, and removing the 17 samples with the label of m which enter the first class if the number of the samples with the label of m is greater than the capacity V of each class of sample pool; />

Step 36: updating the sample set, and removing the samples of the pedestrians who leave the monitoring area and are detected by mistake, wherein the updating is divided into 3 conditions:

(1) For newly appeared pedestrians, a new pedestrian category is created;

(2) For pedestrians whose appearance characteristics have changed, new samples should be collected and supplemented; when the number of samples exceeds the capacity V of each type of sample pool during sample expansion, updating the sample set according to a first-in first-out rule, namely replacing the sample entering the sample pool at the earliest time with the newly supplemented sample, and determining V =34 through experiments;

(3) For the pedestrian which leaves the monitoring area and is detected by mistake, samples of the category of the pedestrian are removed;

after updating, a new sample set C is obtained ⁽ⁿ⁺¹⁾ ；

2. The artificial intelligence based video personnel tracking and counting method according to claim 1, wherein: the tracker, comprising: the method comprises the following steps of (1) updating a filter, a convolutional neural network, a discriminant classifier and parameters on line;

and if the tracking result shows that the appearance characteristics of newly appeared pedestrians are changed, the pedestrians leave the monitoring area and the false detection condition is detected, updating the sample set, retraining the hidden layer and the classifier, determining new network parameters, and then entering the pedestrian tracking of the (n + 1) th frame.

3. The artificial intelligence based video personnel tracking and counting method according to claim 2, characterized in that: the filter in the tracker is a group of feature sets which are pre-trained by a sparse self-encoder, is obtained by training in a massive unsupervised auxiliary training set, and has good generality and completeness.

4. The artificial intelligence based video personnel tracking and counting method according to claim 3, wherein: the convolution kernel used by the convolution neural network in the tracker is a filter composed of 100 pre-training features with the size of 10 × 10.

5. The artificial intelligence based video personnel tracking and counting method according to claim 2, characterized in that: and a mathematical model of a discriminant classifier in the tracker adopts a SoftMax function.