CN114067240A

CN114067240A - Pedestrian single-target tracking method based on online updating strategy and fusing pedestrian characteristics

Info

Publication number: CN114067240A
Application number: CN202111294661.6A
Authority: CN
Inventors: 薛彦兵; 丁明远; 袁立明; 蔡靖; 温显斌
Original assignee: Tianjin University of Technology
Current assignee: Tianjin University of Technology
Priority date: 2021-11-03
Filing date: 2021-11-03
Publication date: 2022-02-18

Abstract

A pedestrian characteristic-fused online updating strategy pedestrian single-target tracking method is characterized in that under the condition that only an initial frame target state is given, a pedestrian target tracking problem of a subsequent frame is decomposed into a classification task and a regression task, the classification task aims to classify an image region into a foreground and a background through a classification filter so as to predict the rough position of a target in an image, and the regression task estimates a target state through rough positioning obtained in the classification task and a candidate bounding box and is usually represented by the bounding box; the rough position of the target is corrected by combining the inherent characteristics in the moving process of the pedestrian, different tracking states are defined in different scenes according to the complexity of the current scene, and different online updating strategies are adopted for the classification filter to increase the discrimination capability of the classifier, so that the tracking performance of the pedestrian single target is improved, the tracking success rate is higher, and the method has certain practical value.

Description

Pedestrian single-target tracking method based on online updating strategy and fusing pedestrian characteristics

[ technical field ] A method for producing a semiconductor device

The invention relates to the fields of pattern recognition, image processing, computer vision and the like, in particular to an online updating strategy pedestrian single-target tracking method integrating pedestrian characteristics.

[ background of the invention ]

The visual tracking technology is an important subject in the field of computer vision (branch of artificial intelligence), has important research significance, and has attracted much attention in recent years. The pedestrian target tracking is the key for realizing intelligent analysis such as pedestrian analysis.

In the prior art, methods for tracking a single target of a pedestrian are mainly divided into two types:

1. the method comprises the steps of detecting video frames frame by using a pedestrian recognition algorithm, and then connecting pedestrian target frames to form a target track. However, the effect of detecting the network is inversely proportional to the detection time, the more complex the detection algorithm is, the better the system can extract the image features, and the better the detection effect of the system is, but the deeper the network is, the more the system parameters are, the longer the detection time of the system is, and the real-time pedestrian target tracking cannot be performed. On the contrary, the weaker the expression capability of the network is, the corresponding detection precision is reduced along with the weaker expression capability, and the algorithm is easy to cause the pedestrian to be lost. It is not easy to apply the algorithm to the actual scene. Therefore, the method can only increase the computing capability of the system and increase the system configuration when being applied to the actual scene. The scheme has the advantages that the depth semantic features of the pedestrian target can be extracted, and the recognition capability is strong. However, the disadvantage of this scheme is that frame-by-frame detection is not performed, and video context information is not utilized, so that the system detection frame rate cannot be increased, and during real-time video target detection, the video has situations such as motion blur, which may cause local tracking failure, and reduce the tracking efficiency of the system.

2. The pedestrian target is manually framed in the first frame by using a tracking algorithm, or the pedestrian target is detected by using a recognition algorithm and tracked by using the tracking algorithm, and the scheme can realize the purpose of tracking the target in a short time. The target tracking algorithm has the advantages that the tracking algorithm is simple in structure and can achieve real-time effect, but the target tracking algorithm has the defects that in the tracking process, a pedestrian target is easy to deform, be shielded or change in illumination, the target is easy to lose track, and the target cannot be found back again after the tracking fails, so that the algorithm fails.

Therefore, the method and the device aim at analyzing the pedestrians, the single target and the short-time tracking, and provide a corresponding solution for solving the problem of the conventional tracker in the tracking of the pedestrians and the single target. Under the condition of only giving an initial frame target state, a tracking problem is decomposed into a classification task and a regression task on the basis of a tracking algorithm, the classification task aims to stably provide a rough position of a target in an image by classifying image regions into a foreground and a background, and the regression task is to estimate a target state and is usually represented by a boundary box, so that the aim of tracking a single-target pedestrian in a subsequent frame of a video sequence by a tracker is fulfilled.

[ summary of the invention ]

The invention aims to provide an online updating strategy pedestrian single-target tracking method fusing pedestrian characteristics, which can overcome the defects of the prior art, is a method with high single-target pedestrian tracking precision and high tracking real-time speed, and has certain practical value.

The technical scheme of the invention is as follows: an online updating strategy pedestrian single-target tracking method fused with pedestrian characteristics is characterized by comprising the following steps:

step 1: classification tasks in the tracking process:

(1.1) extracting the features of the reference frame in the video sequence and each frame in the video sequence through a feature extraction network, namely: selecting a first frame of a video sequence as a reference frame, selecting a target to be tracked in a mode of manually marking a target boundary frame, and identifying and detecting the selected target to be tracked of the reference frame in each frame, namely a test frame, in a subsequent video sequence, so that the target to be tracked in the video sequence is tracked, and the pedestrian single-target tracking process can be realized;

the feature extraction network is composed of ResNet50, ResNet50 is composed of 4 residual blocks (ResidualBlock) in series, the names of the 4 residual blocks are Block1, Block2, Block3 and Block4 respectively, and the 4 residual blocks contain 50 convolution operations, which is a known technology; and connecting two convolutional layers behind the ResNet50 to form a backbone network for feature extraction, wherein the backbone network is used for extracting the features of the current reference frame or test frame image.

(1.2) utilizing the characteristics of each frame in the reference frames extracted from the reference frames selected in the step (1.1) and the artificially marked target boundary frame thereof, and training a classification filter on line through a model predictor for distinguishing the background in front of the target to be tracked in the subsequent frames and predicting the position of the target;

the model predictor consists of an initializer module and an optimizer module; the initializer module can effectively provide initial estimation of the classification filter only by using the appearance of the target to be tracked; and the optimizer module is used for optimizing the initially estimated classification filter init _ filter to finally obtain an optimized classification filter new _ filter, performing pre-target and background classification on subsequent frames of the tracked video sequence, and predicting the central position of a target to be tracked to perform rough positioning.

The method comprises the following steps of utilizing information of a reference frame to obtain a classification filter new _ filter through online training of a model predictor, and specifically comprising the following steps:

(1.2.1) respectively carrying out turning, mirroring, blurring and rotating data enhancement operations on the reference frame by utilizing the reference frame information of the current tracked video sequence, including the reference frame image information and a manually-specified target annotation bounding box, obtaining images with respective operation effects, forming a set by the images to serve as an image set subjected to data enhancement processing, and simultaneously obtaining the corresponding annotation bounding box after the data enhancement;

(1.2.2) performing feature extraction on the image set processed in the step (1.2.1) by using the feature extraction network in the step (1.1) to obtain a group of training sample features, and sending the training sample features to a model initialization module for initial estimation of a classification filter, so as to obtain an initial estimated classification filter init _ filter, as shown in fig. 2; the model initialization module performs initial estimation of the classification filter, namely, the training sample features are subjected to PrRoi Pooling Pooling operation, the features in the labeling boundary box are extracted, the obtained features are target features, and the target features serve as output, namely, the classification filter init _ filter of the initial estimation.

(1.2.3) performing feature extraction on the initially estimated classification filter init _ filter obtained in the step (1.2.2) and the image set processed in the step (1.2.1) by using the feature extraction network in the step (1.1) to obtain a group of training sample features, sending the training sample features into an optimizer module, and obtaining an optimized classification filter new _ filter through optimization for background classification before a target of a subsequent test frame, wherein a specific implementation process can be described by the following steps:

calculating reference frame response graph s_ref：

Performing convolution operation on the classification filter init _ filter initially estimated in the step (1.2.2) and the training sample characteristics to obtain a response diagram s_refI.e. the response map of the reference frame, as shown in equation (1):

s_ref＝x_ref*f_init (1)

in the formula, x_refRepresenting the features of the reference frame, namely the training sample features; represents a convolution calculation; f. of_initA classification filter init _ filter for initial estimation;

② calculating reference frame response picture s_refAnd reference frame tags

Difference between r (s, c):

according to the response graph s of the reference frame calculated in the step (1.2.2)_refRepresented in the form of a 19 × 19 two-dimensional matrix; radix Ginseng IndiciScaling the reference frame pedestrian target labeling bounding box on a 19 x 19 two-dimensional matrix to obtain the position c of the labeled pedestrian target in the 19 x 19 two-dimensional matrix;

in formula (2) and formula (3), t represents each position of a 19 × 19 two-dimensional matrix, and c represents a target position;

representing spatial coefficients obtained from training; rho_kIs a distance calculation function determined by the equations (4-1), (4-2):

wherein, y_hnReal labels representing the current frame response map; m is_cRepresenting mask parameters, in order to determine the area of the current target in the 19 × 19 two-dimensional matrix, the area m corresponding to the target_c1, in the background area m_c0 is approximately distributed; label y_hnAnd the mask parameter m_cIs represented by a 19 × 19 two-dimensional matrix;

obtaining a reference frame label from the reference frame through formula (2) and formula (3)

And mask parameters

And calculating the reference obtained in the step IResponse map s of a frame_refAnd reference frame tags

The difference r (s, c) between them, as shown in equation (5):

r(s,c)＝v_c·(m_cs+(1-m_c)max(0,s)-y_hn) (5)

where s is the response map s of the reference frame_ref；v_cIs a spatial weight; m is_cMask parameters for reference frames

y_hnTags for reference frames

Adding regularization to the difference r (s, c) obtained in the step (c) to obtain L (f) shown as a formula (6), and taking the L (f) as a reference frame response image s_refAnd reference frame tags

The loss is transmitted reversely, so that the aim of optimizing the classification filter is fulfilled;

in the formula (I), the compound is shown in the specification,

is a set of training samples, where x_jFor the training sample features extracted by the feature extraction network, c_jMarking the center coordinates of the marked target of the sample, namely marking the coordinates of the center point of the boundary box; represents a convolution calculation; λ is a regularization factor; f is an optimized classification filter.

The optimized classification filter f is a back propagation optimized classification filter, and adopts a steepest gradient descent method as shown in formula (7):

in the formula (f)⁽ⁱ⁾Representing the classification filter after the ith sub-optimization; α represents a learning rate;

representing a gradient calculation;

(1.3) taking the subsequent frame in the current sequence to be tracked as a test frame, and performing feature extraction, namely performing feature extraction on the test frame by using a feature extraction network to obtain a feature x of the test frame_test；

And (1.4) performing convolution processing by using the classification filter new _ filter optimized in the step (1.2) and the characteristics of the test frame obtained in the step (1.3), as shown in a formula (8).

s_test＝x_test*f_new (8)

s_testIs a test frame response graph; wherein x_testFeatures representing test frames; represents a convolution calculation; f. of_newThe optimized classification filter new _ filter;

(1.5) obtaining a response graph s of the test frame according to the step (1.4)_testJudging the tracking state of the current test frame, wherein the tracking state is divided into a normal state, an uncertain state and a state which can not be found; the normal state represents a simple scene of the test frame, and the target to be tracked and the background can be simply identified through a classification task; the uncertain state represents that the scene of the current test frame is complex and is influenced by an interfering object and a background, so that the target to be tracked and the background are difficult to accurately identify; the condition that the current test frame scene is complex and is shielded or the target to be tracked and the background cannot be identified cannot be found; the specific implementation mode comprises the following steps:

(1.5.1) response graph s of the test frame obtained according to the step (1.4)_testAnd judging the tracking state of the current test frame, wherein the tracking state is divided into a normal state, an uncertain state and an unavailable state:

if the response graph of the test frames_testIf only the target center is the highest response, it indicates that only the target to be tracked or the target to be tracked and the background have a definite boundary, that is, the tracking state of the current test frame is a normal state, the scene representing the test frame is simple, and the target to be tracked and the background can be simply identified through a classification task;

if response map s of test frame_testThe middle target area is an area framed and selected around the position of the highest response score by using the size of the target bounding box of the previous frame, the response scores in the target area are relatively disordered at the moment, and the highest response score is also reduced in vain, so that the representation that the target to be tracked is confused with the background and close to the target to be tracked and an interfering object when the target to be tracked is in an uncertain state at the moment is shown;

third, if the response image s of the test frame_testThe response score in the target area in the state which cannot be found is more disordered relative to the uncertain state, and meanwhile, the highest response score is more obviously reduced than the uncertain state, so that the target to be tracked is seriously shielded and belongs to the state which cannot be found;

(1.5.2) calculating the response map s of the test frame_testThe response score variance of the medium target area and the highest response score of the test frame are used for determining the state of the target to be tracked in the current search area:

(1.5.2.1) calculating the mean variance of the target region response scores of m frames before the test frame by sliding

The target area is the response image s in the test frame using the target bounding box size of the previous frame_tsetRecording the variance sigma of the response score of the target area of m frames before the test frame in the area framed around the position with the highest response score, as shown in formula (9):

in the formula, score_iFor each position score in the corresponding target region in the response map of the test frame,

response map s for a test frame_testThe average value of each position of the corresponding target area is obtained; w h is the response map s of the test frame_testThe size of the middle corresponding target area width and height;

(1.5.2.2) calculating the mean value of the mean value according to the formula (10), i.e. the mean value of the variance of the response score of m frames

In the formula, σ_jRepresenting a target region response score variance for each frame within the target region of the m frames;

(1.5.2.3) at the same time, the highest response score of the test frame is the response map s of the test frame_testThe maximum value of the medium response score, the highest response score max _ score of m frames before the test frame is recorded, and the mean of the highest response scores of the m frames is calculated using equation (11)

Where max _ score_jRepresents the highest response score per frame;

(1.5.2.4) combining the mean variance of the target region response scores of the m frames before the test frame obtained in the step (1.5.2.2) according to the normal state, the uncertain state and the unavailable state condition stated in the step (1.5.1)

And (1.5.2.3) obtaining the highest response score mean value of m frames before the test frame, and judging the tracking state of the test frame:

if the formula (12) is satisfied, it indicates that the tracking state of the test frame is an uncertain state:

if equation (13) is satisfied, the tracked state of the test frame is the not-found state:

the other cases are regarded as normal states;

wherein the content of the first and second substances,

response score variance mean of the target area of m frames before the current test frame;

response map s of test frame m frames before test frame_testMean of the highest response scores; k is a radical of₁，k₂Is a scaling factor.

Step 2: regression tasks in the tracking process:

(2.1) response graph s of the test frame obtained in the classification task according to step (1.4)_testPredicting the position of the target center;

(2.1.1) response graph s of the test frame obtained according to the step (1.4)_testResponse map s of test frame_testTaking the position coordinate corresponding to the maximum value of the medium response score as a first response point, and if the tracking state of the test frame is a normal tracking state and no interference object is encountered, taking the first response point as a predicted target center position;

(2.1.2) response map s at test frame_testMiddle target area, i.e. response map s in test frame using last frame target bounding box size_testThe response graph s of the test frame in the area selected around the maximum position of the medium response score_testThe area outside the selected area of the middle frame is outside the target area, and the position corresponding to the highest response score outside the target area is regarded as a second response point;

(2.1.3) when the highest response score of the second response point is greater than 0.5 times the highest response score of the first response point, then the second response point is considered to be the target analog in the background;

(2.1.4) assuming that the current first response point position is c₁[x₁,y₁]The second response point position is c₂[x₂,y₂]The response graph s of the target central point of the previous frame of the current test frame in the test frame_testIs at a position of c₀[x₀,y₀]I.e. the center point of the target area in the response map, c₁And c₂Relative to c₀The amounts of positional deviation of (a) are respectively expressed by equations (14) and (15):

(2.1.5) judging the real position of the current target:

if the offset calculated according to the equations (14) and (15) is in different value ranges, returning different response point positions as predicted target center positions, as shown in equations (16) and (17):

c₁,(d₁＞Ω&d₂<Ω)|(d₁＞Ω&d₂＞Ω)|(d₁<d₂&d₁<Ω&d₂<Ω) (16)

c₂,(d₁<Ω&d₂＞Ω)|(d₁＞d₂&d₁<Ω&d₂<Ω) (17)

in the above formulas (16) and (17), Ω represents a threshold value in the value range;

(2.2) combining the inherent characteristics of the pedestrian target in the motion process, correcting (2.1) the predicted target center position to obtain the final predicted target center position, namely: according to the inherent characteristics of a pedestrian target in the moving process, the target tracking is normal in the normal state, the pedestrian target does not have violent scale transformation in the moving process, and the pedestrian moves relatively smoothly, so that the offset of the target center in the normal tracking state is recorded, the relative offset of the target center between two frames in the v frame is recorded in a sliding frame mode, the average value of the offset is calculated to serve as the predicted offset, and the value range of v is 12-18; and (3) if the tracking state is in an uncertain state, correcting the predicted target center position obtained in the step (2.1) through the predicted offset, and finally obtaining the final predicted target center position.

(2.3) according to the final predicted target center position obtained in the step (2.2) and the target boundary frame of the previous frame, if the frame is the second frame, taking the target boundary frame marked by the first frame as an initial candidate boundary frame, and if the frame is the subsequent frame, taking the boundary frame predicted in the step (2.4) as the initial candidate boundary frame; and combining the target boundary frame marked by the reference frame as a reference boundary frame, and generating a candidate boundary frame set around the target center position finally predicted in the step (2.2); the specific implementation process is as follows:

(2.3.1) obtaining a final predicted target center position according to the step (2.2), taking a target boundary frame of the previous frame as an initial candidate boundary frame, taking the target boundary frame marked by the first frame as the initial candidate boundary frame if the frame is the second frame, taking the boundary frame predicted in the step (2.4) as the initial candidate boundary frame if the frame is the subsequent frame, and randomly generating a candidate boundary frames with different proportions around the final predicted target center position to form an initial candidate boundary frame set;

(2.3.2) according to the inherent characteristics of the pedestrian target in the motion process, the scale change of the pedestrian target in the motion process is relatively smooth, the target boundary frame of the artificially labeled reference frame in the step (1.1) is combined to serve as a reference candidate boundary frame, and b reference candidate boundary frames with different proportions are randomly generated around the finally predicted target center position to form a reference candidate boundary frame set;

(2.3.3) fusing the initial candidate bounding box set obtained in the step (2.3.1) and the reference candidate bounding box set obtained in the step (2.3.2) to obtain a + b candidate bounding boxes serving as candidate bounding box sets;

the value range of a in the step (2.3.1) is 7-15; the value range of b in the step (2.3.2) is 3-7.

(2.4) sending the a + b optimized candidate bounding box sets obtained in the step (2.3) and the characteristics of the test frame obtained in the step (1.3) to a bounding box prediction module for target bounding box prediction;

the process of target bounding box prediction in step (2.4) comprises the following steps:

(2.4.1) since the specified tracked target is uncertain before labeling, in the process of predicting the bounding box, the target information labeled in the reference frame needs to be combined to perform the prediction of the bounding box, and therefore, the target information labeled in the reference frame needs to be extracted, that is: performing feature extraction on the reference frame by using the feature extraction network ResNet50 in the step (1.1), sending the reference frame into a ResNet50 feature extraction network, sequentially performing four residual Block processing, extracting features of the reference frame, which are output layer1 through Block3 and output layer2 through Block4, fusing the features after rolling and Pr Pooling, and obtaining a modulation vector after passing through a full connection layer, wherein the modulation vector can be used as pedestrian target information marked in the reference frame;

(2.4.2) for the test frame, as shown in fig. 4, the feature of the test frame obtained by using the ResNet50 in the feature extraction network in step (1.1) is utilized, and the PrPooling pooling operation is performed on a + b bounding box regions respectively after passing through two convolutional layers, so as to extract internal features, and the cross-over ratio iou (cross over unit) of a + b bounding boxes is predicted respectively through a full connection layer by combining modulation vectors, namely information of pedestrian targets marked by reference frames; calculating the gradient of the bounding box through an intersection ratio of IoU, and optimizing a + b candidate bounding boxes respectively to obtain an optimized candidate bounding box which is used as a new optimized candidate bounding box set;

(2.4.3) repeating the step (2.4.2) to carry out iterative optimization, and after 5 iterations, taking the coordinate average value of the three candidate bounding boxes with the intersection ratio of IoU being the maximum as the coordinate of the predicted bounding box, namely the final predicted target bounding box.

(2.5) updating the new _ filter of the classification filter according to the tracking state of the current test frame judged in the step (1.5), wherein the pedestrian can have scenes of shielding, staggered motion among pedestrians and background dissolving in the motion process, and the new _ filter of the classification filter is optimized and updated by judging the tracking state of the current target, so that background or interference object information can be prevented from being introduced; the new _ filter is optimized and updated according to the tracking state of the test frame as follows:

(2.5.1) when the tracking state of the current test frame is determined to be a normal tracking state, optimizing and updating the new _ filter of the classification filter every n frames or when an interfering object is encountered; the value range of n is 15-25;

the new _ filter optimization updating mode of the classification filter is to use the test frame response image s obtained in the step (1.4)₂And a test frame tag

Implementation of, wherein the response map s of the test frame₂Expressed as a 19 × 19 two-dimensional matrix:

(i) the boundary frame predicted in the step (2.4.3) is scaled on a 19 × 19 two-dimensional matrix, and then a target position c of the pedestrian target of the test frame on the 19 × 19 two-dimensional matrix can be obtained;

(ii) calculating the target mask parameter m according to the step (1.2.3)_cAnd a label y_hnCalculating target mask parameters of the test frame

And a label

(iii) Calculating a test frame response map s using equation (2)_testAnd a test frame tag

The residual r (s, c) between; in this case, in equation (2), s is a response map s of the test frame_test；v_cIs a spatial weight; m is_cTarget mask parameters for test frames

y_hnLabels for test frames

(iv) Obtaining a test frame response image s according to the operation of the step (1.2.3)_testAnd a test frame tag

And optimizing and updating the new _ filter of the classification filter according to the loss difference, so as to obtain a new optimized classification filter new _ filter for the foreground background classification of the target of the subsequent test frame.

(2.5.2) when the tracking state of the current test frame is determined to be an uncertain state, optimizing and updating the new _ filter of the classification filter;

(2.5.3) when the tracking state of the current test frame is determined as the state can not be found, the new _ filter of the classification filter is not updated optimally.

(2.6) each frame of target identification detection in the whole tracked video sequence is completed by repeating the steps (1.3) to (2.5), and finally, the single-target tracking of the pedestrian is realized, and the specific process is as follows:

(2.6.1) acquiring information of a specified target pedestrian selected from the reference frame of the tracked video sequence through the steps (1.1) to (1.2), and using the obtained new _ filter for classification of a foreground and a background of the target of a subsequent test frame;

(2.6.2) repeating the steps (1.3) to (2.5) for each frame in the subsequent test frames until the last frame, thereby completing the identification and detection of the specified pedestrian target for each frame in the whole video sequence, and finally realizing the tracking of the single specified pedestrian target of the reference frame on the whole video sequence.

The invention has the advantages that: the invention designs a discriminant pedestrian single-target tracking method integrating pedestrian characteristics and a new online updating strategy, and mainly researches the application of a discriminant model based on the online updating strategy in pedestrian single-target tracking. The online training is a solution in a discriminant model, a classification filter needs to be obtained by predicting information of a first frame, if update conditions are met in a subsequent tracking process, a predicted result is used as a new training sample to be filled into a training sample set, the classification filter is optimized through the new sample set, but the current tracking condition is uncertain in the updating process, excessive background or interference object information is possibly introduced through an update mode of separating a plurality of frames, the tracking is inaccurate or even a drift phenomenon is caused in the tracking process, and meanwhile, the target center is predicted to be inaccurate under different states. In order to solve the above problems, the present invention determines the tracking state of the test frame by convolving the classification filter with the test frame characteristics to obtain a response map, thereby adjusting the update strategy. When pedestrians are likely to be shielded and move in a staggered manner in the moving process, the pedestrians are fused with different states such as a background and the like. The current technology cannot determine the position of the current target when the pedestrian target is shielded, so that the drift phenomenon possibly occurs after the follow-up target tracking; the drift phenomenon may also occur when a pedestrian target and other pedestrians leave after moving in a staggered manner. In order to solve the problems of shielding and drifting, the target center of the test frame prediction is adjusted by combining the characteristics of pedestrians, so that the tracking is more accurate. Compared with other methods, the method has higher tracking precision on the single-target pedestrian, achieves real-time speed, has both precision and speed, and has certain practical value.

[ description of the drawings ]

Fig. 1 is a system framework schematic diagram of an online updating strategy pedestrian single-target tracking method integrating pedestrian characteristics according to the present invention.

Fig. 2 is a structural diagram of an initializer module of an online updating strategy pedestrian single-target tracking method integrating pedestrian characteristics according to the invention.

FIG. 3 is a structural diagram of an optimizer module of the online update strategy pedestrian single-target tracking method integrating pedestrian characteristics according to the present invention.

Fig. 4 is a schematic structural diagram of a bounding box prediction module of an online updating strategy pedestrian single-target tracking method integrating pedestrian characteristics according to the present invention.

[ detailed description ] embodiments

As shown in fig. 1, an online updating strategy pedestrian single-target tracking method fusing pedestrian characteristics is characterized by comprising the following steps:

step 1: classification tasks in the tracking process:

the feature extraction network in the step (1.1) is composed of ResNet50, ResNet50 is composed of 4 residual blocks (ResidualBlock) which are connected in series, the names of the 4 residual blocks are Block1, Block2, Block3 and Block4, and the 4 residual blocks contain 50 convolution operations, which is a known technology; two convolutional layers are connected behind the ResNet50 to form a backbone network for feature extraction, which is shown in FIG. 1 and is used for extracting the features of the current reference frame or test frame image.

the model predictor in the step (1.2) is composed of an initializer module and an optimizer module as shown in fig. 1; the initializer module, as shown in fig. 2, can effectively provide initial estimation of the classification filter by only using the appearance of the target to be tracked; the optimizer module, as shown in fig. 3, is configured to optimize the initially estimated classification filter init _ filter, finally obtain an optimized classification filter new _ filter, perform pre-target and background classification on subsequent frames of the tracked video sequence, and predict a center position of a target to be tracked to perform rough positioning.

In the step (1.2), the classification filter new _ filter is obtained by utilizing the information of the reference frame through model predictor online training, and the method specifically comprises the following steps:

(1.2.2) performing feature extraction on the image set processed in the step (1.2.1) by using the feature extraction network in the step (1.1) to obtain a group of training sample features, and sending the training sample features to a model initialization module for initial estimation of a classification filter, so as to obtain an initial estimated classification filter init _ filter, as shown in fig. 2;

the initial estimation of the classification filter by the model initialization module in the step (1.2.2) is to perform PrRoi Pooling Pooling operation on the training sample characteristics, extract the characteristics in the labeling boundary box, obtain the characteristics as target characteristics, and output the target characteristics as the initial estimated classification filter init _ filter.

(1.2.3) performing feature extraction on the initially estimated classification filter init _ filter obtained in the step (1.2.2) and the image set processed in the step (1.2.1) by using the feature extraction network in the step (1.1) to obtain a group of training sample features, sending the training sample features into an optimizer module, and performing optimization to obtain an optimized classification filter new _ filter for background classification before a subsequent test frame target.

The specific implementation process of the step (1.2.3), as shown in fig. 3, can be described by the following steps:

calculating reference frame response graph s_ref：

s_ref＝x_ref*f_init (1)

② calculating reference frame response picture s_refAnd reference frame tags

Difference between r (s, c):

according to the response graph s of the reference frame calculated in the step (1.2.2)_refRepresented in the form of a 19 × 19 two-dimensional matrix; scaling the reference frame pedestrian target labeling bounding box on a 19 x 19 two-dimensional matrix to obtain the position c of the labeled pedestrian target in the 19 x 19 two-dimensional matrix;

in this embodiment, N is 10, and in a 19 × 19 two-dimensional matrix, the distance between t and c is calculated as the farthest

So setting N to 10, i.e., k to 9, indicates all positions away from the target center, and therefore the same process can be performed; the value range of tau is 0.45-0.55, and the experimental result is optimal when the value of tau is 0.5, the background and the target corresponding area can be better distinguished.

And mask parameters

And calculating the response image s of the reference frame obtained in the step (i)_refAnd reference frame tags

The difference r (s, c) between them, as shown in equation (5):

r(s,c)＝v_c·(m_cs+(1-m_c)max(0,s)-y_hn) (5)

y_hnTags for reference frames

in the formula (I), the compound is shown in the specification,

The optimized classification filter f in the step (III) is a back propagation optimized classification filter, and adopts a steepest gradient descent method shown as a formula (7):

representing a gradient calculation;

in the embodiment, the optimized classification filter f is obtained after five times of iterative optimization, and the optimized classification filter new _ filter is used for classification of a foreground background of a target of a subsequent test frame.

s_test＝x_test*f_new (8)

(1.5) obtaining a response graph s of the test frame according to the step (1.4)_testJudging the tracking state of the current test frame, wherein the tracking state is divided into a normal state, an uncertain state and a state which can not be found; the normal state represents a simple scene of the test frame, and the target to be tracked and the background can be simply identified through a classification task; the uncertain state represents that the scene of the current test frame is complex and is influenced by an interfering object and a background, so that the target to be tracked and the background are difficult to accurately identify; the condition that the current test frame scene is complex and is shielded or the target to be tracked and the background cannot be identified cannot be found;

the specific implementation manner of the step (1.5) comprises the following steps:

if the response graph s of the test frame_testIf only the target center is the highest response, it indicates that only the target to be tracked or the target to be tracked and the background have a definite boundary, that is, the tracking state of the current test frame is a normal state, the scene representing the test frame is simple, and the target to be tracked and the background can be simply identified through a classification task;

(1.5.2.2) is further calculated according to equation (10)Calculating the average value, namely the average value of the variance of the response score of the m frames

Where max _ score_jRepresents the highest response score per frame;

the other cases are regarded as normal states;

wherein the content of the first and second substances,

In this example, m is 25, k₁＝0.75，k₂0.5. And order

To test the mean of variance of the target region response scores for the 25 frames preceding the frame,

response map s for test frame 25 frames before test frame_testThe highest response score, and if the number of frames before the test frame is less than 25 frames, the calculation proceeds from the test frame until there are no frames, 25 frames and the scaling factor k₁，k₂The values are the best choices obtained by experimental summary.

Step 2: regression tasks in the tracking process:

the step (2.1) specifically comprises the following steps:

(2.1.1) obtaining according to the step (1.4)Response map s of the incoming test frame_testResponse map s of test frame_testTaking the position coordinate corresponding to the maximum value of the medium response score as a first response point, and if the tracking state of the test frame is a normal tracking state and no interference object is encountered, taking the first response point as a predicted target center position;

(2.1.5) judging the real position of the current target:

c₁,(d₁＞Ω&d₂<Ω)|(d₁＞Ω&d₂＞Ω)|(d₁<d₂&d₁<Ω&d₂<Ω) (16)

c₂,(d₁<Ω&d₂＞Ω)|(d₁＞d₂&d₁<Ω&d₂<Ω) (17)

(2.2) correcting (2.1) the predicted target center position by combining the inherent characteristics of the pedestrian target in the motion process to obtain the final predicted target center position;

the step (2.2) specifically comprises the following steps: according to the inherent characteristics of a pedestrian target in the moving process, the target tracking is normal in the normal state, the pedestrian target does not have violent scale transformation in the moving process, and the pedestrian moves relatively smoothly, so that the offset of the target center in the normal tracking state is recorded, the relative offset of the target center between two frames in the v frame is recorded in a sliding frame mode, the average value of the offset is calculated to serve as the predicted offset, and the value range of v is 12-18; if the tracking state is in an uncertain state, correcting the predicted target center position obtained in the step (2.1) through the predicted offset, and finally obtaining the final predicted target center position;

in this example, v is 16, and the resulting optimal choice is summarized by experiments.

(2.3) according to the final predicted target center position obtained in the step (2.2) and the target boundary frame of the previous frame, if the frame is the second frame, taking the target boundary frame marked by the first frame as an initial candidate boundary frame, and if the frame is the subsequent frame, taking the boundary frame predicted in the step (2.4) as the initial candidate boundary frame; and combining the target boundary frame marked by the reference frame as a reference boundary frame, and generating a candidate boundary frame set around the target center position finally predicted in the step (2.2);

the specific implementation process of the step (2.3) is as follows:

(2.3.1) obtaining a final predicted target center position according to the step (2.2), taking a target boundary frame of the previous frame as an initial candidate boundary frame, taking the target boundary frame marked by the first frame as the initial candidate boundary frame if the frame is the second frame, taking the boundary frame predicted in the step (2.4) as the initial candidate boundary frame if the frame is the subsequent frame, and randomly generating a candidate boundary frames with different proportions around the final predicted target center position to form an initial candidate boundary frame set; in this embodiment, a is 10;

(2.3.2) according to the inherent characteristics of the pedestrian target in the motion process, the scale change of the pedestrian target in the motion process is relatively smooth, the target boundary frame of the artificially labeled reference frame in the step (1.1) is combined to serve as a reference candidate boundary frame, and b reference candidate boundary frames with different proportions are randomly generated around the finally predicted target center position to form a reference candidate boundary frame set; in this embodiment, b is 4;

the value range of a in the step (2.3.1) is 7-15; the value range of b in the step (2.3.2) is 3-7. In this embodiment, a-10 and b-4 are parameters that are optimized by experiments.

the target bounding box prediction process in the step (2.4) comprises the following steps:

(2.4.1) since the specified tracked target is uncertain before labeling, in the process of predicting the bounding box, the target information labeled in the reference frame needs to be combined to perform the prediction of the bounding box, and therefore, the target information labeled in the reference frame needs to be extracted, that is: as shown in fig. 4, a feature extraction network ResNet50 in step (1.1) is used to perform feature extraction on a reference frame, the reference frame is sent to a ResNet50 feature extraction network, four residual Block processes are performed in sequence, the features of the reference frame, which are output layer1 of Block3 and layer2 output by Block4, are extracted, and are pooled and fused after being subjected to convolution and Pr Pooling, and then a modulation vector is obtained after passing through a full connection layer, so that pedestrian target information marked in the reference frame can be obtained;

(2.4.2) for the test frame, as shown in the test frame branch of fig. 4, using the features of the test frame obtained by ResNet50 in the feature extraction network in step (1.1), passing through two convolutional layers, then performing PrPooling pooling operation on a + b (in the embodiment, 14) bounding box regions respectively, extracting internal features, combining modulation vectors, that is, information of pedestrian targets labeled by reference frames, and then predicting the intersection-to-parallel ratio IoU of a + b (in the embodiment, 14) bounding boxes respectively through a full-connection layer; calculating the gradient of the bounding box by using an intersection ratio of IoU, and optimizing a + b (14 in the embodiment) candidate bounding boxes respectively to obtain an optimized candidate bounding box which is used as a new optimized candidate bounding box set;

In this embodiment, after 5 iterations, the average value of the coordinates of the IoU largest three candidate bounding boxes is taken as the coordinates of the predicted bounding box, that is, the final predicted target bounding box.

(2.5) updating the new _ filter of the classification filter according to the tracking state of the current test frame judged in the step (1.5), wherein the pedestrian can have scenes of shielding, staggered motion among pedestrians and background dissolving in the motion process, and the new _ filter of the classification filter is optimized and updated by judging the tracking state of the current target, so that background or interference object information can be prevented from being introduced;

in the step (2.5), the new _ filter is optimized and updated according to the tracking state of the test frame as follows:

(2.5.1) when the tracking state of the current test frame is determined to be a normal tracking state, optimizing and updating the new _ filter of the classification filter every n frames or when an interfering object is encountered;

the value range of n in the step (2.5.1) is 15-25. In this embodiment, since the time for updating the classification filter is 20, if the speed is obviously reduced due to frequent updating, more background information is introduced to affect the performance of the final overall model; since the video is dynamically changed, if the video is not updated for a long time, the performance of the final overall model is affected by the deterioration of the classification effect of the classification filter, wherein 20 frames are the optimal choice obtained through experimental summary.

In the step (2.5.1), the new _ filter optimization updating mode of the classification filter is to use the test frame response image s obtained in the step (1.4)₂And a test frame tag

And a label

The residual r (s, c) between; at this time, the formula (2)) Where s is the response map s of the test frame_test；v_cIs a spatial weight; m is_cTarget mask parameters for test frames

y_hnLabels for test frames

And (2.6) completing target identification detection of each frame in the whole tracked video sequence by repeating the steps (1.3) to (2.5), and finally realizing single-target tracking of the pedestrians.

The specific process of the step (2.6) is as follows:

In the embodiment, an online updating strategy pedestrian single-target tracking method fusing pedestrian characteristics is constructed by using Python language and PyTorch framework.

The implementation operation mainly involved comprises a classification task and a regression task, wherein a new updating strategy is adopted in the classification task, the target center position is corrected in the regression task, and a candidate boundary box generated based on the first frame boundary box is newly added as an innovation point.

And the video sequence is used as input, and a pedestrian target boundary box required to be tracked is manually marked in the first frame to be used as a pedestrian target object to be tracked in the subsequent frame. The method comprises the steps of performing data enhancement on a first frame image, namely obtaining a training image set after operations such as turning, mirroring and shifting, and obtaining a corresponding artificial labeling pedestrian target boundary frame as a training sample set according to an enhancement mode. And (3) performing feature extraction on the training sample set by using a backbone network, wherein the backbone network consists of a ResNet50 network pre-trained by an ImageNet data set and a two-layer convolution network finished by offline training, and a training sample feature set is obtained after processing. Training a sample feature set as input and sending the input to a prediction classification filter in a model predictor, wherein an initializer module sent to the model predictor can effectively provide initial estimation of the classification filter by only using target appearance, namely manually marking a boundary box and performing Pr Pooling operation; and secondly, the initially estimated classification filter init _ filter sum is extracted to an optimizer module, the optimizer module optimizes the initially estimated classification filter init _ filter through a steepest gradient descent method, the iterative optimization is carried out for 5 times after experimental summary, the learning rate is set to be 0.6, i is set to be 5 in the formula (7), the result is better when alpha is set to be 0.6, and finally the optimized classification filter new _ filter is obtained. And identifying and detecting the target object marked on the reference frame in the subsequent frame of the current video sequence, thereby realizing the tracking effect of a single specified pedestrian target on the current video sequence. And (3) in the tracking process of the subsequent frame, taking the subsequent frame as a test frame, performing feature extraction by using a twin network idea that the subsequent frame shares the same backbone network with the reference frame, performing convolution operation on the extracted features and the optimized classification filter new _ filter to obtain a response graph, obtaining the central position of the predicted target according to the step (2.1), and correcting the central position of the predicted target according to the current tracking state to obtain the final central position of the predicted target.

And predicting the size of a target boundary frame of the previous frame, if the current test frame is the second frame, taking the size of a reference frame artificially marked boundary frame as an initial candidate boundary frame, randomly generating 14 candidate boundary frames at the central position of the final predicted target by combining the size of the reference frame artificially marked boundary frame, obtaining 14 candidate boundary frames IoU through regression branch calculation, directly optimizing the 14 candidate boundary frames through gradient descent, and finally taking the coordinate average value of the three candidate boundary frames with the maximum IoU as the final predicted boundary frame to finish the target identification and detection of the current test frame. And then, identifying and detecting each frame of the subsequent test frame of the current video sequence until the last frame, thereby completing the identification and detection of the specified pedestrian target on each frame in the whole video sequence and finally realizing the tracking of the single specified pedestrian target on the whole video sequence.

Table 1 shows the comparison result of the tracking performance of the method and other methods on a pedestrian video sequence dataset, and the method achieves 30FPS on GTX1650 and achieves real-time tracking. The method has the advantages of both precision and speed and certain practical value.

	Ours	ATOM	DIMP	SiamCAR	SiamDW	SiamRPN++	Ocean
								Success rate	0.740	0.696	0.681	0.655	0.650	0.633	0.617

TABLE 1

Claims

1. An online updating strategy pedestrian single-target tracking method fused with pedestrian characteristics is characterized by comprising the following steps:

step 1: classification tasks in the tracking process:

(1.4) performing convolution processing by using the classification filter new _ filter optimized in the step (1.2) and the characteristics of the test frame obtained in the step (1.3), as shown in a formula (8):

s_test＝x_test*f_new (8)

in the formula, s_testIs a test frame response graph; wherein x_testFeatures representing test frames; represents a convolution calculation; f. of_newThe optimized classification filter new _ filter;

step 2: regression tasks in the tracking process:

(2.3) according to the final predicted target center position obtained in the step (2.2) and the target boundary frame of the previous frame, if the current frame is the second frame, taking the target boundary frame marked by the first frame as an initial candidate boundary frame, if the current frame is the subsequent frame, taking the boundary frame predicted in the step (2.4) as an initial candidate boundary frame, and combining the target boundary frame marked by the reference frame as a reference boundary frame to generate a candidate boundary frame set around the final predicted target center position obtained in the step (2.2);

2. The pedestrian single-target tracking method of the online updating strategy fused with the pedestrian characteristics according to claim 1, characterized in that the feature extraction network in the step (1.1) is a ResNet50 module structure; the ResNet50 module structure is formed by connecting 4 residual blocks in series, wherein the names of the 4 residual blocks are Block1, Block2, Block3 and Block 4; and the ResNet50 module structure is connected with two convolutional layers, so that a backbone network for feature extraction is formed, and the backbone network is used for extracting the features of the current reference frame or test frame image.

3. The pedestrian single-target tracking method of the online updating strategy fused with the pedestrian characteristics as claimed in claim 1, wherein the model predictor in the step (1.2) is composed of an initializer module and an optimizer module; the initializer module can effectively provide initial estimation of the classification filter only by using the appearance of the target to be tracked; and the optimizer module is used for optimizing the initially estimated classification filter init _ filter to finally obtain an optimized classification filter new _ filter, performing pre-target and background classification on subsequent frames of the tracked video sequence, and predicting the central position of a target to be tracked to perform rough positioning.

4. The method for tracking the pedestrian single target by the online updating strategy fused with the pedestrian characteristics as claimed in claim 1, wherein in the step (1.2), the classification filter new _ filter is obtained by online training of a model predictor by using the information of the reference frame, and the method specifically comprises the following steps:

(1.2.2) extracting the characteristics of the image set processed in the step (1.2.1) by using the characteristic extraction network in the step (1.1) to obtain a group of training sample characteristics, and sending the training sample characteristics to a model initialization module for initial estimation of a classification filter to obtain an initial estimated classification filter init _ filter;

5. The method for tracking the pedestrian single target by the online updating strategy fused with the pedestrian characteristics as claimed in claim 4, wherein the initial estimation of the classification filter by the model initialization module in the step (1.2.2) is to extract the features in the labeling bounding box by performing PrRoi Pooling Pooling operation on the training sample features, and the obtained features are target features and are used as output, namely the classification filter init _ filter of the initial estimation;

the specific implementation process of the step (1.2.3) can be described by the following steps:

calculating reference frame response graph s_ref：

s_ref＝x_ref*f_init (1)

② calculating reference frame response picture s_refAnd reference frame tags

Difference between r (s, c):

wherein, y_hnRepresenting the response map of the current frameThe real tag of (1); m is_cRepresenting mask parameters, in order to determine the area of the current target in the 19 × 19 two-dimensional matrix, the area m corresponding to the target_c1, in the background area m_c0 is approximately distributed; label y_hnAnd the mask parameter m_cIs represented by a 19 × 19 two-dimensional matrix;

And mask parameters

The difference r (s, c) between them, as shown in equation (5):

r(s，c)＝v_c·(m_cs+(1-m_c)max(0，s)-y_hn) (5)

y_hnTags for reference frames

in the formula (I), the compound is shown in the specification,

6. The method for tracking the pedestrian single target by the online updating strategy fused with the pedestrian characteristics according to claim 5, wherein the optimization classification filter f in the step (iii) is a back propagation optimization classification filter, and a steepest gradient descent method shown as a formula (7) is adopted:

representing the gradient calculation.

7. The pedestrian single-target tracking method based on the online updating strategy fused with the pedestrian characteristics according to claim 1, wherein the specific implementation manner of the step (1.5) is composed of the following steps:

if the response graph s of the test frame_testOnly the target center is the highest response, which indicates that only the target center is left at this timeThe tracking target or the target to be tracked and the background have a clear boundary, namely the tracking state of the current test frame is a normal state, the scene representing the test frame is simple, and the target to be tracked and the background can be simply identified through a classification task;

In the formula, max-score_jRepresents the highest response score per frame;

And (1.5.2.3) obtaining the highest response score mean value of m frames before the test frame, and judging the trace state of the test frameState:

the other cases are regarded as normal states;

wherein the content of the first and second substances,

8. The pedestrian single-target tracking method of the online updating strategy fused with the pedestrian characteristics according to claim 1, wherein the step (2.1) specifically refers to:

(2.1.2) response map s at test frame_testMiddle target area, i.e. response map s in test frame using last frame target bounding box size_testPosition of maximum value of medium response scoreArea of surrounding frame, response map s of test frame_testThe area outside the selected area of the middle frame is outside the target area, and the position corresponding to the highest response score outside the target area is regarded as a second response point;

(2.1.4) assuming that the current first response point position is c₁[x₁，y₁]The second response point position is c₂[x₂，y₂]The response graph S of the target central point of the previous frame of the current test frame in the test frame_testIs at a position of c₀[x₀，y₀]I.e. the center point of the target area in the response map, c₁And c₂Relative to c₀The amounts of positional deviation of (a) are respectively expressed by equations (14) and (15):

(2.1.5) judging the real position of the current target:

if the offset calculated according to the equations (14) and (15) is in different value ranges, returning different response point positions as predicted target center positions, as shown in equations (16) and (17): in the formula, Ω represents a threshold value in the value range.

c₁，(d₁＞Ω&d₂＜Ω)|(d₁＞Ω&d₂＞Ω)|(d₁＜d₂&d₁＜Ω&d₂＜Ω) (16)

c₂，(d₁＜Ω&d₂＞Ω)|(d₁＞d₂&d₁＜Ω&d₂＜Ω) (17)

9. The pedestrian single-target tracking method of the online updating strategy fused with the pedestrian characteristics according to claim 1, wherein the step (2.2) specifically refers to: according to the inherent characteristics of the pedestrian target in the moving process, the target tracking is normal in the normal state, the pedestrian target does not have violent scale transformation in the moving process, and the pedestrian moves relatively smoothly, so that the offset of the target center in the normal tracking state is recorded, the relative offset of the target center between two frames in the v frame is recorded in a sliding frame mode, the average value of the offset is calculated to serve as the predicted offset, and the value range of v is 12-18; and (3) if the tracking state is in an uncertain state, correcting the predicted target center position obtained in the step (2.1) through the predicted offset, and finally obtaining the final predicted target center position.

10. The pedestrian single-target tracking method of the online updating strategy fused with the pedestrian characteristics according to claim 1, wherein the step (2.3) is implemented by the following specific steps:

(2.3.1) obtaining a final predicted target center position according to the step (2.2), taking a target boundary frame of the previous frame as an initial candidate boundary frame, taking the target boundary frame marked by the first frame as the initial candidate boundary frame if the frame is the second frame, taking the boundary frame predicted in the step (2.4) as the initial candidate boundary frame if the frame is the subsequent frame, and randomly generating a candidate boundary frames with different proportions around the final predicted target center position to form an initial candidate boundary frame set, wherein the value range of a is 7-15;

(2.3.2) according to the inherent characteristics of the pedestrian target in the motion process, the scale change of the pedestrian target in the motion process is relatively smooth, the target boundary frame of the artificially labeled reference frame in the step (1.1) is combined to serve as a reference candidate boundary frame, and b reference candidate boundary frames with different proportions are randomly generated around the finally predicted target center position to form a reference candidate boundary frame set; wherein the value range of b is 3-7;

(2.4.1) since the specified tracked target is uncertain before labeling, in the process of predicting the bounding box, the target information labeled in the reference frame needs to be combined to perform the prediction of the bounding box, and therefore, the target information labeled in the reference frame needs to be extracted, that is: performing feature extraction on the reference frame by using the feature extraction network ResNet50 in the step (1.1), sending the reference frame into a ResNet50 feature extraction network, sequentially performing 4 residual Block processing, extracting features of the reference frame, which are output layer1 through Block3 and output layer2 through Block4, fusing the features after rolling and Pr Pooling, and obtaining a modulation vector after passing through a full connection layer, wherein the modulation vector can be used as pedestrian target information marked in the reference frame;

(2.4.2) for the test frame, utilizing the features of the test frame obtained by ResNet50 in the feature extraction network in the step (1.1), passing through two convolution layers, then respectively carrying out PrPooling pooling operation on a + b bounding box areas, extracting internal features, combining modulation vectors, namely information of pedestrian targets marked by the reference frame, and then respectively predicting the intersection-to-parallel ratio IoU of a + b bounding boxes through a full connection layer; calculating the gradient of the bounding box through an intersection ratio of IoU, and optimizing a + b candidate bounding boxes respectively to obtain an optimized candidate bounding box which is used as a new optimized candidate bounding box set;

(2.4.3) repeating the step (2.4.2) to carry out iterative optimization, and after 5 iterations, taking the coordinate average value of three candidate bounding boxes with the intersection ratio of IoU being the maximum as the coordinate of the predicted bounding box, namely the final predicted target bounding box;

(2.5.1) when the tracking state of the current test frame is determined to be a normal tracking state, optimizing and updating the new _ filter of the classification filter every n frames or when an interfering object is encountered; wherein the value range of n is 15-25;

And a label

y_hnLabels for test frames

Optimizing and updating the new _ filter of the classification filter according to the loss difference between the target objects, and obtaining a new optimized classification filter new _ filter for the foreground background classification of the target of the subsequent test frame;

(2.5.3) when the tracking state of the current test frame is determined to be the state which can not be found, optimizing and updating the new _ filter of the classification filter;

the specific process of the step (2.6) is as follows: