CN114067240A - Pedestrian single-target tracking method based on online updating strategy and fusing pedestrian characteristics - Google Patents

Pedestrian single-target tracking method based on online updating strategy and fusing pedestrian characteristics Download PDF

Info

Publication number
CN114067240A
CN114067240A CN202111294661.6A CN202111294661A CN114067240A CN 114067240 A CN114067240 A CN 114067240A CN 202111294661 A CN202111294661 A CN 202111294661A CN 114067240 A CN114067240 A CN 114067240A
Authority
CN
China
Prior art keywords
target
frame
test
response
pedestrian
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111294661.6A
Other languages
Chinese (zh)
Inventor
薛彦兵
丁明远
袁立明
蔡靖
温显斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University of Technology
Original Assignee
Tianjin University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University of Technology filed Critical Tianjin University of Technology
Priority to CN202111294661.6A priority Critical patent/CN114067240A/en
Publication of CN114067240A publication Critical patent/CN114067240A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

A pedestrian characteristic-fused online updating strategy pedestrian single-target tracking method is characterized in that under the condition that only an initial frame target state is given, a pedestrian target tracking problem of a subsequent frame is decomposed into a classification task and a regression task, the classification task aims to classify an image region into a foreground and a background through a classification filter so as to predict the rough position of a target in an image, and the regression task estimates a target state through rough positioning obtained in the classification task and a candidate bounding box and is usually represented by the bounding box; the rough position of the target is corrected by combining the inherent characteristics in the moving process of the pedestrian, different tracking states are defined in different scenes according to the complexity of the current scene, and different online updating strategies are adopted for the classification filter to increase the discrimination capability of the classifier, so that the tracking performance of the pedestrian single target is improved, the tracking success rate is higher, and the method has certain practical value.

Description

Pedestrian single-target tracking method based on online updating strategy and fusing pedestrian characteristics
[ technical field ] A method for producing a semiconductor device
The invention relates to the fields of pattern recognition, image processing, computer vision and the like, in particular to an online updating strategy pedestrian single-target tracking method integrating pedestrian characteristics.
[ background of the invention ]
The visual tracking technology is an important subject in the field of computer vision (branch of artificial intelligence), has important research significance, and has attracted much attention in recent years. The pedestrian target tracking is the key for realizing intelligent analysis such as pedestrian analysis.
In the prior art, methods for tracking a single target of a pedestrian are mainly divided into two types:
1. the method comprises the steps of detecting video frames frame by using a pedestrian recognition algorithm, and then connecting pedestrian target frames to form a target track. However, the effect of detecting the network is inversely proportional to the detection time, the more complex the detection algorithm is, the better the system can extract the image features, and the better the detection effect of the system is, but the deeper the network is, the more the system parameters are, the longer the detection time of the system is, and the real-time pedestrian target tracking cannot be performed. On the contrary, the weaker the expression capability of the network is, the corresponding detection precision is reduced along with the weaker expression capability, and the algorithm is easy to cause the pedestrian to be lost. It is not easy to apply the algorithm to the actual scene. Therefore, the method can only increase the computing capability of the system and increase the system configuration when being applied to the actual scene. The scheme has the advantages that the depth semantic features of the pedestrian target can be extracted, and the recognition capability is strong. However, the disadvantage of this scheme is that frame-by-frame detection is not performed, and video context information is not utilized, so that the system detection frame rate cannot be increased, and during real-time video target detection, the video has situations such as motion blur, which may cause local tracking failure, and reduce the tracking efficiency of the system.
2. The pedestrian target is manually framed in the first frame by using a tracking algorithm, or the pedestrian target is detected by using a recognition algorithm and tracked by using the tracking algorithm, and the scheme can realize the purpose of tracking the target in a short time. The target tracking algorithm has the advantages that the tracking algorithm is simple in structure and can achieve real-time effect, but the target tracking algorithm has the defects that in the tracking process, a pedestrian target is easy to deform, be shielded or change in illumination, the target is easy to lose track, and the target cannot be found back again after the tracking fails, so that the algorithm fails.
Therefore, the method and the device aim at analyzing the pedestrians, the single target and the short-time tracking, and provide a corresponding solution for solving the problem of the conventional tracker in the tracking of the pedestrians and the single target. Under the condition of only giving an initial frame target state, a tracking problem is decomposed into a classification task and a regression task on the basis of a tracking algorithm, the classification task aims to stably provide a rough position of a target in an image by classifying image regions into a foreground and a background, and the regression task is to estimate a target state and is usually represented by a boundary box, so that the aim of tracking a single-target pedestrian in a subsequent frame of a video sequence by a tracker is fulfilled.
[ summary of the invention ]
The invention aims to provide an online updating strategy pedestrian single-target tracking method fusing pedestrian characteristics, which can overcome the defects of the prior art, is a method with high single-target pedestrian tracking precision and high tracking real-time speed, and has certain practical value.
The technical scheme of the invention is as follows: an online updating strategy pedestrian single-target tracking method fused with pedestrian characteristics is characterized by comprising the following steps:
step 1: classification tasks in the tracking process:
(1.1) extracting the features of the reference frame in the video sequence and each frame in the video sequence through a feature extraction network, namely: selecting a first frame of a video sequence as a reference frame, selecting a target to be tracked in a mode of manually marking a target boundary frame, and identifying and detecting the selected target to be tracked of the reference frame in each frame, namely a test frame, in a subsequent video sequence, so that the target to be tracked in the video sequence is tracked, and the pedestrian single-target tracking process can be realized;
the feature extraction network is composed of ResNet50, ResNet50 is composed of 4 residual blocks (ResidualBlock) in series, the names of the 4 residual blocks are Block1, Block2, Block3 and Block4 respectively, and the 4 residual blocks contain 50 convolution operations, which is a known technology; and connecting two convolutional layers behind the ResNet50 to form a backbone network for feature extraction, wherein the backbone network is used for extracting the features of the current reference frame or test frame image.
(1.2) utilizing the characteristics of each frame in the reference frames extracted from the reference frames selected in the step (1.1) and the artificially marked target boundary frame thereof, and training a classification filter on line through a model predictor for distinguishing the background in front of the target to be tracked in the subsequent frames and predicting the position of the target;
the model predictor consists of an initializer module and an optimizer module; the initializer module can effectively provide initial estimation of the classification filter only by using the appearance of the target to be tracked; and the optimizer module is used for optimizing the initially estimated classification filter init _ filter to finally obtain an optimized classification filter new _ filter, performing pre-target and background classification on subsequent frames of the tracked video sequence, and predicting the central position of a target to be tracked to perform rough positioning.
The method comprises the following steps of utilizing information of a reference frame to obtain a classification filter new _ filter through online training of a model predictor, and specifically comprising the following steps:
(1.2.1) respectively carrying out turning, mirroring, blurring and rotating data enhancement operations on the reference frame by utilizing the reference frame information of the current tracked video sequence, including the reference frame image information and a manually-specified target annotation bounding box, obtaining images with respective operation effects, forming a set by the images to serve as an image set subjected to data enhancement processing, and simultaneously obtaining the corresponding annotation bounding box after the data enhancement;
(1.2.2) performing feature extraction on the image set processed in the step (1.2.1) by using the feature extraction network in the step (1.1) to obtain a group of training sample features, and sending the training sample features to a model initialization module for initial estimation of a classification filter, so as to obtain an initial estimated classification filter init _ filter, as shown in fig. 2; the model initialization module performs initial estimation of the classification filter, namely, the training sample features are subjected to PrRoi Pooling Pooling operation, the features in the labeling boundary box are extracted, the obtained features are target features, and the target features serve as output, namely, the classification filter init _ filter of the initial estimation.
(1.2.3) performing feature extraction on the initially estimated classification filter init _ filter obtained in the step (1.2.2) and the image set processed in the step (1.2.1) by using the feature extraction network in the step (1.1) to obtain a group of training sample features, sending the training sample features into an optimizer module, and obtaining an optimized classification filter new _ filter through optimization for background classification before a target of a subsequent test frame, wherein a specific implementation process can be described by the following steps:
calculating reference frame response graph sref
Performing convolution operation on the classification filter init _ filter initially estimated in the step (1.2.2) and the training sample characteristics to obtain a response diagram srefI.e. the response map of the reference frame, as shown in equation (1):
sref=xref*finit (1)
in the formula, xrefRepresenting the features of the reference frame, namely the training sample features; represents a convolution calculation; f. ofinitA classification filter init _ filter for initial estimation;
② calculating reference frame response picture srefAnd reference frame tags
Figure BDA0003336085340000041
Difference between r (s, c):
according to the response graph s of the reference frame calculated in the step (1.2.2)refRepresented in the form of a 19 × 19 two-dimensional matrix; radix Ginseng IndiciScaling the reference frame pedestrian target labeling bounding box on a 19 x 19 two-dimensional matrix to obtain the position c of the labeled pedestrian target in the 19 x 19 two-dimensional matrix;
Figure BDA0003336085340000042
Figure BDA0003336085340000043
in formula (2) and formula (3), t represents each position of a 19 × 19 two-dimensional matrix, and c represents a target position;
Figure BDA0003336085340000044
representing spatial coefficients obtained from training; rhokIs a distance calculation function determined by the equations (4-1), (4-2):
Figure BDA0003336085340000045
Figure BDA0003336085340000046
wherein, yhnReal labels representing the current frame response map; m iscRepresenting mask parameters, in order to determine the area of the current target in the 19 × 19 two-dimensional matrix, the area m corresponding to the targetc1, in the background area mc0 is approximately distributed; label yhnAnd the mask parameter mcIs represented by a 19 × 19 two-dimensional matrix;
obtaining a reference frame label from the reference frame through formula (2) and formula (3)
Figure BDA0003336085340000047
And mask parameters
Figure BDA0003336085340000048
And calculating the reference obtained in the step IResponse map s of a framerefAnd reference frame tags
Figure BDA0003336085340000049
The difference r (s, c) between them, as shown in equation (5):
r(s,c)=vc·(mcs+(1-mc)max(0,s)-yhn) (5)
where s is the response map s of the reference frameref;vcIs a spatial weight; m iscMask parameters for reference frames
Figure BDA00033360853400000410
yhnTags for reference frames
Figure BDA00033360853400000411
Adding regularization to the difference r (s, c) obtained in the step (c) to obtain L (f) shown as a formula (6), and taking the L (f) as a reference frame response image srefAnd reference frame tags
Figure BDA00033360853400000412
The loss is transmitted reversely, so that the aim of optimizing the classification filter is fulfilled;
Figure BDA0003336085340000051
in the formula (I), the compound is shown in the specification,
Figure BDA0003336085340000052
is a set of training samples, where xjFor the training sample features extracted by the feature extraction network, cjMarking the center coordinates of the marked target of the sample, namely marking the coordinates of the center point of the boundary box; represents a convolution calculation; λ is a regularization factor; f is an optimized classification filter.
The optimized classification filter f is a back propagation optimized classification filter, and adopts a steepest gradient descent method as shown in formula (7):
Figure BDA0003336085340000053
in the formula (f)(i)Representing the classification filter after the ith sub-optimization; α represents a learning rate;
Figure BDA0003336085340000054
representing a gradient calculation;
(1.3) taking the subsequent frame in the current sequence to be tracked as a test frame, and performing feature extraction, namely performing feature extraction on the test frame by using a feature extraction network to obtain a feature x of the test frametest
And (1.4) performing convolution processing by using the classification filter new _ filter optimized in the step (1.2) and the characteristics of the test frame obtained in the step (1.3), as shown in a formula (8).
stest=xtest*fnew (8)
stestIs a test frame response graph; wherein xtestFeatures representing test frames; represents a convolution calculation; f. ofnewThe optimized classification filter new _ filter;
(1.5) obtaining a response graph s of the test frame according to the step (1.4)testJudging the tracking state of the current test frame, wherein the tracking state is divided into a normal state, an uncertain state and a state which can not be found; the normal state represents a simple scene of the test frame, and the target to be tracked and the background can be simply identified through a classification task; the uncertain state represents that the scene of the current test frame is complex and is influenced by an interfering object and a background, so that the target to be tracked and the background are difficult to accurately identify; the condition that the current test frame scene is complex and is shielded or the target to be tracked and the background cannot be identified cannot be found; the specific implementation mode comprises the following steps:
(1.5.1) response graph s of the test frame obtained according to the step (1.4)testAnd judging the tracking state of the current test frame, wherein the tracking state is divided into a normal state, an uncertain state and an unavailable state:
if the response graph of the test framestestIf only the target center is the highest response, it indicates that only the target to be tracked or the target to be tracked and the background have a definite boundary, that is, the tracking state of the current test frame is a normal state, the scene representing the test frame is simple, and the target to be tracked and the background can be simply identified through a classification task;
if response map s of test frametestThe middle target area is an area framed and selected around the position of the highest response score by using the size of the target bounding box of the previous frame, the response scores in the target area are relatively disordered at the moment, and the highest response score is also reduced in vain, so that the representation that the target to be tracked is confused with the background and close to the target to be tracked and an interfering object when the target to be tracked is in an uncertain state at the moment is shown;
third, if the response image s of the test frametestThe response score in the target area in the state which cannot be found is more disordered relative to the uncertain state, and meanwhile, the highest response score is more obviously reduced than the uncertain state, so that the target to be tracked is seriously shielded and belongs to the state which cannot be found;
(1.5.2) calculating the response map s of the test frametestThe response score variance of the medium target area and the highest response score of the test frame are used for determining the state of the target to be tracked in the current search area:
(1.5.2.1) calculating the mean variance of the target region response scores of m frames before the test frame by sliding
Figure BDA0003336085340000064
The target area is the response image s in the test frame using the target bounding box size of the previous frametsetRecording the variance sigma of the response score of the target area of m frames before the test frame in the area framed around the position with the highest response score, as shown in formula (9):
Figure BDA0003336085340000061
in the formula, scoreiFor each position score in the corresponding target region in the response map of the test frame,
Figure BDA0003336085340000065
response map s for a test frametestThe average value of each position of the corresponding target area is obtained; w h is the response map s of the test frametestThe size of the middle corresponding target area width and height;
(1.5.2.2) calculating the mean value of the mean value according to the formula (10), i.e. the mean value of the variance of the response score of m frames
Figure BDA0003336085340000062
Figure BDA0003336085340000063
In the formula, σjRepresenting a target region response score variance for each frame within the target region of the m frames;
(1.5.2.3) at the same time, the highest response score of the test frame is the response map s of the test frametestThe maximum value of the medium response score, the highest response score max _ score of m frames before the test frame is recorded, and the mean of the highest response scores of the m frames is calculated using equation (11)
Figure BDA0003336085340000071
Figure BDA0003336085340000072
Where max _ scorejRepresents the highest response score per frame;
(1.5.2.4) combining the mean variance of the target region response scores of the m frames before the test frame obtained in the step (1.5.2.2) according to the normal state, the uncertain state and the unavailable state condition stated in the step (1.5.1)
Figure BDA0003336085340000073
And (1.5.2.3) obtaining the highest response score mean value of m frames before the test frame, and judging the tracking state of the test frame:
if the formula (12) is satisfied, it indicates that the tracking state of the test frame is an uncertain state:
Figure BDA0003336085340000074
if equation (13) is satisfied, the tracked state of the test frame is the not-found state:
Figure BDA0003336085340000075
the other cases are regarded as normal states;
wherein the content of the first and second substances,
Figure BDA0003336085340000076
response score variance mean of the target area of m frames before the current test frame;
Figure BDA0003336085340000077
response map s of test frame m frames before test frametestMean of the highest response scores; k is a radical of1,k2Is a scaling factor.
Step 2: regression tasks in the tracking process:
(2.1) response graph s of the test frame obtained in the classification task according to step (1.4)testPredicting the position of the target center;
(2.1.1) response graph s of the test frame obtained according to the step (1.4)testResponse map s of test frametestTaking the position coordinate corresponding to the maximum value of the medium response score as a first response point, and if the tracking state of the test frame is a normal tracking state and no interference object is encountered, taking the first response point as a predicted target center position;
(2.1.2) response map s at test frametestMiddle target area, i.e. response map s in test frame using last frame target bounding box sizetestThe response graph s of the test frame in the area selected around the maximum position of the medium response scoretestThe area outside the selected area of the middle frame is outside the target area, and the position corresponding to the highest response score outside the target area is regarded as a second response point;
(2.1.3) when the highest response score of the second response point is greater than 0.5 times the highest response score of the first response point, then the second response point is considered to be the target analog in the background;
(2.1.4) assuming that the current first response point position is c1[x1,y1]The second response point position is c2[x2,y2]The response graph s of the target central point of the previous frame of the current test frame in the test frametestIs at a position of c0[x0,y0]I.e. the center point of the target area in the response map, c1And c2Relative to c0The amounts of positional deviation of (a) are respectively expressed by equations (14) and (15):
Figure BDA0003336085340000081
Figure BDA0003336085340000082
(2.1.5) judging the real position of the current target:
if the offset calculated according to the equations (14) and (15) is in different value ranges, returning different response point positions as predicted target center positions, as shown in equations (16) and (17):
c1,(d1>Ω&d2<Ω)|(d1>Ω&d2>Ω)|(d1<d2&d1<Ω&d2<Ω) (16)
c2,(d1<Ω&d2>Ω)|(d1>d2&d1<Ω&d2<Ω) (17)
in the above formulas (16) and (17), Ω represents a threshold value in the value range;
(2.2) combining the inherent characteristics of the pedestrian target in the motion process, correcting (2.1) the predicted target center position to obtain the final predicted target center position, namely: according to the inherent characteristics of a pedestrian target in the moving process, the target tracking is normal in the normal state, the pedestrian target does not have violent scale transformation in the moving process, and the pedestrian moves relatively smoothly, so that the offset of the target center in the normal tracking state is recorded, the relative offset of the target center between two frames in the v frame is recorded in a sliding frame mode, the average value of the offset is calculated to serve as the predicted offset, and the value range of v is 12-18; and (3) if the tracking state is in an uncertain state, correcting the predicted target center position obtained in the step (2.1) through the predicted offset, and finally obtaining the final predicted target center position.
(2.3) according to the final predicted target center position obtained in the step (2.2) and the target boundary frame of the previous frame, if the frame is the second frame, taking the target boundary frame marked by the first frame as an initial candidate boundary frame, and if the frame is the subsequent frame, taking the boundary frame predicted in the step (2.4) as the initial candidate boundary frame; and combining the target boundary frame marked by the reference frame as a reference boundary frame, and generating a candidate boundary frame set around the target center position finally predicted in the step (2.2); the specific implementation process is as follows:
(2.3.1) obtaining a final predicted target center position according to the step (2.2), taking a target boundary frame of the previous frame as an initial candidate boundary frame, taking the target boundary frame marked by the first frame as the initial candidate boundary frame if the frame is the second frame, taking the boundary frame predicted in the step (2.4) as the initial candidate boundary frame if the frame is the subsequent frame, and randomly generating a candidate boundary frames with different proportions around the final predicted target center position to form an initial candidate boundary frame set;
(2.3.2) according to the inherent characteristics of the pedestrian target in the motion process, the scale change of the pedestrian target in the motion process is relatively smooth, the target boundary frame of the artificially labeled reference frame in the step (1.1) is combined to serve as a reference candidate boundary frame, and b reference candidate boundary frames with different proportions are randomly generated around the finally predicted target center position to form a reference candidate boundary frame set;
(2.3.3) fusing the initial candidate bounding box set obtained in the step (2.3.1) and the reference candidate bounding box set obtained in the step (2.3.2) to obtain a + b candidate bounding boxes serving as candidate bounding box sets;
the value range of a in the step (2.3.1) is 7-15; the value range of b in the step (2.3.2) is 3-7.
(2.4) sending the a + b optimized candidate bounding box sets obtained in the step (2.3) and the characteristics of the test frame obtained in the step (1.3) to a bounding box prediction module for target bounding box prediction;
the process of target bounding box prediction in step (2.4) comprises the following steps:
(2.4.1) since the specified tracked target is uncertain before labeling, in the process of predicting the bounding box, the target information labeled in the reference frame needs to be combined to perform the prediction of the bounding box, and therefore, the target information labeled in the reference frame needs to be extracted, that is: performing feature extraction on the reference frame by using the feature extraction network ResNet50 in the step (1.1), sending the reference frame into a ResNet50 feature extraction network, sequentially performing four residual Block processing, extracting features of the reference frame, which are output layer1 through Block3 and output layer2 through Block4, fusing the features after rolling and Pr Pooling, and obtaining a modulation vector after passing through a full connection layer, wherein the modulation vector can be used as pedestrian target information marked in the reference frame;
(2.4.2) for the test frame, as shown in fig. 4, the feature of the test frame obtained by using the ResNet50 in the feature extraction network in step (1.1) is utilized, and the PrPooling pooling operation is performed on a + b bounding box regions respectively after passing through two convolutional layers, so as to extract internal features, and the cross-over ratio iou (cross over unit) of a + b bounding boxes is predicted respectively through a full connection layer by combining modulation vectors, namely information of pedestrian targets marked by reference frames; calculating the gradient of the bounding box through an intersection ratio of IoU, and optimizing a + b candidate bounding boxes respectively to obtain an optimized candidate bounding box which is used as a new optimized candidate bounding box set;
(2.4.3) repeating the step (2.4.2) to carry out iterative optimization, and after 5 iterations, taking the coordinate average value of the three candidate bounding boxes with the intersection ratio of IoU being the maximum as the coordinate of the predicted bounding box, namely the final predicted target bounding box.
(2.5) updating the new _ filter of the classification filter according to the tracking state of the current test frame judged in the step (1.5), wherein the pedestrian can have scenes of shielding, staggered motion among pedestrians and background dissolving in the motion process, and the new _ filter of the classification filter is optimized and updated by judging the tracking state of the current target, so that background or interference object information can be prevented from being introduced; the new _ filter is optimized and updated according to the tracking state of the test frame as follows:
(2.5.1) when the tracking state of the current test frame is determined to be a normal tracking state, optimizing and updating the new _ filter of the classification filter every n frames or when an interfering object is encountered; the value range of n is 15-25;
the new _ filter optimization updating mode of the classification filter is to use the test frame response image s obtained in the step (1.4)2And a test frame tag
Figure BDA0003336085340000101
Implementation of, wherein the response map s of the test frame2Expressed as a 19 × 19 two-dimensional matrix:
(i) the boundary frame predicted in the step (2.4.3) is scaled on a 19 × 19 two-dimensional matrix, and then a target position c of the pedestrian target of the test frame on the 19 × 19 two-dimensional matrix can be obtained;
(ii) calculating the target mask parameter m according to the step (1.2.3)cAnd a label yhnCalculating target mask parameters of the test frame
Figure BDA0003336085340000102
And a label
Figure BDA0003336085340000103
(iii) Calculating a test frame response map s using equation (2)testAnd a test frame tag
Figure BDA0003336085340000104
The residual r (s, c) between; in this case, in equation (2), s is a response map s of the test frametest;vcIs a spatial weight; m iscTarget mask parameters for test frames
Figure BDA0003336085340000105
yhnLabels for test frames
Figure BDA0003336085340000106
(iv) Obtaining a test frame response image s according to the operation of the step (1.2.3)testAnd a test frame tag
Figure BDA0003336085340000107
And optimizing and updating the new _ filter of the classification filter according to the loss difference, so as to obtain a new optimized classification filter new _ filter for the foreground background classification of the target of the subsequent test frame.
(2.5.2) when the tracking state of the current test frame is determined to be an uncertain state, optimizing and updating the new _ filter of the classification filter;
(2.5.3) when the tracking state of the current test frame is determined as the state can not be found, the new _ filter of the classification filter is not updated optimally.
(2.6) each frame of target identification detection in the whole tracked video sequence is completed by repeating the steps (1.3) to (2.5), and finally, the single-target tracking of the pedestrian is realized, and the specific process is as follows:
(2.6.1) acquiring information of a specified target pedestrian selected from the reference frame of the tracked video sequence through the steps (1.1) to (1.2), and using the obtained new _ filter for classification of a foreground and a background of the target of a subsequent test frame;
(2.6.2) repeating the steps (1.3) to (2.5) for each frame in the subsequent test frames until the last frame, thereby completing the identification and detection of the specified pedestrian target for each frame in the whole video sequence, and finally realizing the tracking of the single specified pedestrian target of the reference frame on the whole video sequence.
The invention has the advantages that: the invention designs a discriminant pedestrian single-target tracking method integrating pedestrian characteristics and a new online updating strategy, and mainly researches the application of a discriminant model based on the online updating strategy in pedestrian single-target tracking. The online training is a solution in a discriminant model, a classification filter needs to be obtained by predicting information of a first frame, if update conditions are met in a subsequent tracking process, a predicted result is used as a new training sample to be filled into a training sample set, the classification filter is optimized through the new sample set, but the current tracking condition is uncertain in the updating process, excessive background or interference object information is possibly introduced through an update mode of separating a plurality of frames, the tracking is inaccurate or even a drift phenomenon is caused in the tracking process, and meanwhile, the target center is predicted to be inaccurate under different states. In order to solve the above problems, the present invention determines the tracking state of the test frame by convolving the classification filter with the test frame characteristics to obtain a response map, thereby adjusting the update strategy. When pedestrians are likely to be shielded and move in a staggered manner in the moving process, the pedestrians are fused with different states such as a background and the like. The current technology cannot determine the position of the current target when the pedestrian target is shielded, so that the drift phenomenon possibly occurs after the follow-up target tracking; the drift phenomenon may also occur when a pedestrian target and other pedestrians leave after moving in a staggered manner. In order to solve the problems of shielding and drifting, the target center of the test frame prediction is adjusted by combining the characteristics of pedestrians, so that the tracking is more accurate. Compared with other methods, the method has higher tracking precision on the single-target pedestrian, achieves real-time speed, has both precision and speed, and has certain practical value.
[ description of the drawings ]
Fig. 1 is a system framework schematic diagram of an online updating strategy pedestrian single-target tracking method integrating pedestrian characteristics according to the present invention.
Fig. 2 is a structural diagram of an initializer module of an online updating strategy pedestrian single-target tracking method integrating pedestrian characteristics according to the invention.
FIG. 3 is a structural diagram of an optimizer module of the online update strategy pedestrian single-target tracking method integrating pedestrian characteristics according to the present invention.
Fig. 4 is a schematic structural diagram of a bounding box prediction module of an online updating strategy pedestrian single-target tracking method integrating pedestrian characteristics according to the present invention.
[ detailed description ] embodiments
As shown in fig. 1, an online updating strategy pedestrian single-target tracking method fusing pedestrian characteristics is characterized by comprising the following steps:
step 1: classification tasks in the tracking process:
(1.1) extracting the features of the reference frame in the video sequence and each frame in the video sequence through a feature extraction network, namely: selecting a first frame of a video sequence as a reference frame, selecting a target to be tracked in a mode of manually marking a target boundary frame, and identifying and detecting the selected target to be tracked of the reference frame in each frame, namely a test frame, in a subsequent video sequence, so that the target to be tracked in the video sequence is tracked, and the pedestrian single-target tracking process can be realized;
the feature extraction network in the step (1.1) is composed of ResNet50, ResNet50 is composed of 4 residual blocks (ResidualBlock) which are connected in series, the names of the 4 residual blocks are Block1, Block2, Block3 and Block4, and the 4 residual blocks contain 50 convolution operations, which is a known technology; two convolutional layers are connected behind the ResNet50 to form a backbone network for feature extraction, which is shown in FIG. 1 and is used for extracting the features of the current reference frame or test frame image.
(1.2) utilizing the characteristics of each frame in the reference frames extracted from the reference frames selected in the step (1.1) and the artificially marked target boundary frame thereof, and training a classification filter on line through a model predictor for distinguishing the background in front of the target to be tracked in the subsequent frames and predicting the position of the target;
the model predictor in the step (1.2) is composed of an initializer module and an optimizer module as shown in fig. 1; the initializer module, as shown in fig. 2, can effectively provide initial estimation of the classification filter by only using the appearance of the target to be tracked; the optimizer module, as shown in fig. 3, is configured to optimize the initially estimated classification filter init _ filter, finally obtain an optimized classification filter new _ filter, perform pre-target and background classification on subsequent frames of the tracked video sequence, and predict a center position of a target to be tracked to perform rough positioning.
In the step (1.2), the classification filter new _ filter is obtained by utilizing the information of the reference frame through model predictor online training, and the method specifically comprises the following steps:
(1.2.1) respectively carrying out turning, mirroring, blurring and rotating data enhancement operations on the reference frame by utilizing the reference frame information of the current tracked video sequence, including the reference frame image information and a manually-specified target annotation bounding box, obtaining images with respective operation effects, forming a set by the images to serve as an image set subjected to data enhancement processing, and simultaneously obtaining the corresponding annotation bounding box after the data enhancement;
(1.2.2) performing feature extraction on the image set processed in the step (1.2.1) by using the feature extraction network in the step (1.1) to obtain a group of training sample features, and sending the training sample features to a model initialization module for initial estimation of a classification filter, so as to obtain an initial estimated classification filter init _ filter, as shown in fig. 2;
the initial estimation of the classification filter by the model initialization module in the step (1.2.2) is to perform PrRoi Pooling Pooling operation on the training sample characteristics, extract the characteristics in the labeling boundary box, obtain the characteristics as target characteristics, and output the target characteristics as the initial estimated classification filter init _ filter.
(1.2.3) performing feature extraction on the initially estimated classification filter init _ filter obtained in the step (1.2.2) and the image set processed in the step (1.2.1) by using the feature extraction network in the step (1.1) to obtain a group of training sample features, sending the training sample features into an optimizer module, and performing optimization to obtain an optimized classification filter new _ filter for background classification before a subsequent test frame target.
The specific implementation process of the step (1.2.3), as shown in fig. 3, can be described by the following steps:
calculating reference frame response graph sref
Performing convolution operation on the classification filter init _ filter initially estimated in the step (1.2.2) and the training sample characteristics to obtain a response diagram srefI.e. the response map of the reference frame, as shown in equation (1):
sref=xref*finit (1)
in the formula, xrefRepresenting the features of the reference frame, namely the training sample features; represents a convolution calculation; f. ofinitA classification filter init _ filter for initial estimation;
② calculating reference frame response picture srefAnd reference frame tags
Figure BDA0003336085340000141
Difference between r (s, c):
according to the response graph s of the reference frame calculated in the step (1.2.2)refRepresented in the form of a 19 × 19 two-dimensional matrix; scaling the reference frame pedestrian target labeling bounding box on a 19 x 19 two-dimensional matrix to obtain the position c of the labeled pedestrian target in the 19 x 19 two-dimensional matrix;
Figure BDA0003336085340000142
Figure BDA0003336085340000143
in formula (2) and formula (3), t represents each position of a 19 × 19 two-dimensional matrix, and c represents a target position;
Figure BDA0003336085340000144
representing spatial coefficients obtained from training; rhokIs a distance calculation function determined by the equations (4-1), (4-2):
Figure BDA0003336085340000145
Figure BDA0003336085340000146
in this embodiment, N is 10, and in a 19 × 19 two-dimensional matrix, the distance between t and c is calculated as the farthest
Figure BDA0003336085340000147
So setting N to 10, i.e., k to 9, indicates all positions away from the target center, and therefore the same process can be performed; the value range of tau is 0.45-0.55, and the experimental result is optimal when the value of tau is 0.5, the background and the target corresponding area can be better distinguished.
Wherein, yhnReal labels representing the current frame response map; m iscRepresenting mask parameters, in order to determine the area of the current target in the 19 × 19 two-dimensional matrix, the area m corresponding to the targetc1, in the background area mc0 is approximately distributed; label yhnAnd the mask parameter mcIs represented by a 19 × 19 two-dimensional matrix;
obtaining a reference frame label from the reference frame through formula (2) and formula (3)
Figure BDA0003336085340000148
And mask parameters
Figure BDA0003336085340000149
And calculating the response image s of the reference frame obtained in the step (i)refAnd reference frame tags
Figure BDA0003336085340000151
The difference r (s, c) between them, as shown in equation (5):
r(s,c)=vc·(mcs+(1-mc)max(0,s)-yhn) (5)
where s is the response map s of the reference frameref;vcIs a spatial weight; m iscMask parameters for reference frames
Figure BDA0003336085340000152
yhnTags for reference frames
Figure BDA0003336085340000153
Adding regularization to the difference r (s, c) obtained in the step (c) to obtain L (f) shown as a formula (6), and taking the L (f) as a reference frame response image srefAnd reference frame tags
Figure BDA0003336085340000154
The loss is transmitted reversely, so that the aim of optimizing the classification filter is fulfilled;
Figure BDA0003336085340000155
in the formula (I), the compound is shown in the specification,
Figure BDA0003336085340000156
is a set of training samples, where xjFor the training sample features extracted by the feature extraction network, cjMarking the center coordinates of the marked target of the sample, namely marking the coordinates of the center point of the boundary box; represents a convolution calculation; λ is a regularization factor; f is an optimized classification filter.
The optimized classification filter f in the step (III) is a back propagation optimized classification filter, and adopts a steepest gradient descent method shown as a formula (7):
Figure BDA0003336085340000157
in the formula (f)(i)Representing the classification filter after the ith sub-optimization; α represents a learning rate;
Figure BDA0003336085340000158
representing a gradient calculation;
in the embodiment, the optimized classification filter f is obtained after five times of iterative optimization, and the optimized classification filter new _ filter is used for classification of a foreground background of a target of a subsequent test frame.
(1.3) taking the subsequent frame in the current sequence to be tracked as a test frame, and performing feature extraction, namely performing feature extraction on the test frame by using a feature extraction network to obtain a feature x of the test frametest
And (1.4) performing convolution processing by using the classification filter new _ filter optimized in the step (1.2) and the characteristics of the test frame obtained in the step (1.3), as shown in a formula (8).
stest=xtest*fnew (8)
stestIs a test frame response graph; wherein xtestFeatures representing test frames; represents a convolution calculation; f. ofnewThe optimized classification filter new _ filter;
(1.5) obtaining a response graph s of the test frame according to the step (1.4)testJudging the tracking state of the current test frame, wherein the tracking state is divided into a normal state, an uncertain state and a state which can not be found; the normal state represents a simple scene of the test frame, and the target to be tracked and the background can be simply identified through a classification task; the uncertain state represents that the scene of the current test frame is complex and is influenced by an interfering object and a background, so that the target to be tracked and the background are difficult to accurately identify; the condition that the current test frame scene is complex and is shielded or the target to be tracked and the background cannot be identified cannot be found;
the specific implementation manner of the step (1.5) comprises the following steps:
(1.5.1) response graph s of the test frame obtained according to the step (1.4)testAnd judging the tracking state of the current test frame, wherein the tracking state is divided into a normal state, an uncertain state and an unavailable state:
if the response graph s of the test frametestIf only the target center is the highest response, it indicates that only the target to be tracked or the target to be tracked and the background have a definite boundary, that is, the tracking state of the current test frame is a normal state, the scene representing the test frame is simple, and the target to be tracked and the background can be simply identified through a classification task;
if response map s of test frametestThe middle target area is an area framed and selected around the position of the highest response score by using the size of the target bounding box of the previous frame, the response scores in the target area are relatively disordered at the moment, and the highest response score is also reduced in vain, so that the representation that the target to be tracked is confused with the background and close to the target to be tracked and an interfering object when the target to be tracked is in an uncertain state at the moment is shown;
third, if the response image s of the test frametestThe response score in the target area in the state which cannot be found is more disordered relative to the uncertain state, and meanwhile, the highest response score is more obviously reduced than the uncertain state, so that the target to be tracked is seriously shielded and belongs to the state which cannot be found;
(1.5.2) calculating the response map s of the test frametestThe response score variance of the medium target area and the highest response score of the test frame are used for determining the state of the target to be tracked in the current search area:
(1.5.2.1) calculating the mean variance of the target region response scores of m frames before the test frame by sliding
Figure BDA0003336085340000161
The target area is the response image s in the test frame using the target bounding box size of the previous frametsetRecording the variance sigma of the response score of the target area of m frames before the test frame in the area framed around the position with the highest response score, as shown in formula (9):
Figure BDA0003336085340000162
in the formula, scoreiFor each position score in the corresponding target region in the response map of the test frame,
Figure BDA0003336085340000171
response map s for a test frametestThe average value of each position of the corresponding target area is obtained; w h is the response map s of the test frametestThe size of the middle corresponding target area width and height;
(1.5.2.2) is further calculated according to equation (10)Calculating the average value, namely the average value of the variance of the response score of the m frames
Figure BDA0003336085340000172
Figure BDA0003336085340000173
In the formula, σjRepresenting a target region response score variance for each frame within the target region of the m frames;
(1.5.2.3) at the same time, the highest response score of the test frame is the response map s of the test frametestThe maximum value of the medium response score, the highest response score max _ score of m frames before the test frame is recorded, and the mean of the highest response scores of the m frames is calculated using equation (11)
Figure BDA0003336085340000174
Figure BDA0003336085340000175
Where max _ scorejRepresents the highest response score per frame;
(1.5.2.4) combining the mean variance of the target region response scores of the m frames before the test frame obtained in the step (1.5.2.2) according to the normal state, the uncertain state and the unavailable state condition stated in the step (1.5.1)
Figure BDA0003336085340000176
And (1.5.2.3) obtaining the highest response score mean value of m frames before the test frame, and judging the tracking state of the test frame:
if the formula (12) is satisfied, it indicates that the tracking state of the test frame is an uncertain state:
if the formula (12) is satisfied, it indicates that the tracking state of the test frame is an uncertain state:
Figure BDA0003336085340000177
if equation (13) is satisfied, the tracked state of the test frame is the not-found state:
Figure BDA0003336085340000178
the other cases are regarded as normal states;
wherein the content of the first and second substances,
Figure BDA0003336085340000179
response score variance mean of the target area of m frames before the current test frame;
Figure BDA00033360853400001710
response map s of test frame m frames before test frametestMean of the highest response scores; k is a radical of1,k2Is a scaling factor.
In this example, m is 25, k1=0.75,k20.5. And order
Figure BDA0003336085340000181
To test the mean of variance of the target region response scores for the 25 frames preceding the frame,
Figure BDA0003336085340000182
response map s for test frame 25 frames before test frametestThe highest response score, and if the number of frames before the test frame is less than 25 frames, the calculation proceeds from the test frame until there are no frames, 25 frames and the scaling factor k1,k2The values are the best choices obtained by experimental summary.
Step 2: regression tasks in the tracking process:
(2.1) response graph s of the test frame obtained in the classification task according to step (1.4)testPredicting the position of the target center;
the step (2.1) specifically comprises the following steps:
(2.1.1) obtaining according to the step (1.4)Response map s of the incoming test frametestResponse map s of test frametestTaking the position coordinate corresponding to the maximum value of the medium response score as a first response point, and if the tracking state of the test frame is a normal tracking state and no interference object is encountered, taking the first response point as a predicted target center position;
(2.1.2) response map s at test frametestMiddle target area, i.e. response map s in test frame using last frame target bounding box sizetestThe response graph s of the test frame in the area selected around the maximum position of the medium response scoretestThe area outside the selected area of the middle frame is outside the target area, and the position corresponding to the highest response score outside the target area is regarded as a second response point;
(2.1.3) when the highest response score of the second response point is greater than 0.5 times the highest response score of the first response point, then the second response point is considered to be the target analog in the background;
(2.1.4) assuming that the current first response point position is c1[x1,y1]The second response point position is c2[x2,y2]The response graph s of the target central point of the previous frame of the current test frame in the test frametestIs at a position of c0[x0,y0]I.e. the center point of the target area in the response map, c1And c2Relative to c0The amounts of positional deviation of (a) are respectively expressed by equations (14) and (15):
Figure BDA0003336085340000183
Figure BDA0003336085340000184
(2.1.5) judging the real position of the current target:
if the offset calculated according to the equations (14) and (15) is in different value ranges, returning different response point positions as predicted target center positions, as shown in equations (16) and (17):
c1,(d1>Ω&d2<Ω)|(d1>Ω&d2>Ω)|(d1<d2&d1<Ω&d2<Ω) (16)
c2,(d1<Ω&d2>Ω)|(d1>d2&d1<Ω&d2<Ω) (17)
in the above formulas (16) and (17), Ω represents a threshold value in the value range;
(2.2) correcting (2.1) the predicted target center position by combining the inherent characteristics of the pedestrian target in the motion process to obtain the final predicted target center position;
the step (2.2) specifically comprises the following steps: according to the inherent characteristics of a pedestrian target in the moving process, the target tracking is normal in the normal state, the pedestrian target does not have violent scale transformation in the moving process, and the pedestrian moves relatively smoothly, so that the offset of the target center in the normal tracking state is recorded, the relative offset of the target center between two frames in the v frame is recorded in a sliding frame mode, the average value of the offset is calculated to serve as the predicted offset, and the value range of v is 12-18; if the tracking state is in an uncertain state, correcting the predicted target center position obtained in the step (2.1) through the predicted offset, and finally obtaining the final predicted target center position;
in this example, v is 16, and the resulting optimal choice is summarized by experiments.
(2.3) according to the final predicted target center position obtained in the step (2.2) and the target boundary frame of the previous frame, if the frame is the second frame, taking the target boundary frame marked by the first frame as an initial candidate boundary frame, and if the frame is the subsequent frame, taking the boundary frame predicted in the step (2.4) as the initial candidate boundary frame; and combining the target boundary frame marked by the reference frame as a reference boundary frame, and generating a candidate boundary frame set around the target center position finally predicted in the step (2.2);
the specific implementation process of the step (2.3) is as follows:
(2.3.1) obtaining a final predicted target center position according to the step (2.2), taking a target boundary frame of the previous frame as an initial candidate boundary frame, taking the target boundary frame marked by the first frame as the initial candidate boundary frame if the frame is the second frame, taking the boundary frame predicted in the step (2.4) as the initial candidate boundary frame if the frame is the subsequent frame, and randomly generating a candidate boundary frames with different proportions around the final predicted target center position to form an initial candidate boundary frame set; in this embodiment, a is 10;
(2.3.2) according to the inherent characteristics of the pedestrian target in the motion process, the scale change of the pedestrian target in the motion process is relatively smooth, the target boundary frame of the artificially labeled reference frame in the step (1.1) is combined to serve as a reference candidate boundary frame, and b reference candidate boundary frames with different proportions are randomly generated around the finally predicted target center position to form a reference candidate boundary frame set; in this embodiment, b is 4;
(2.3.3) fusing the initial candidate bounding box set obtained in the step (2.3.1) and the reference candidate bounding box set obtained in the step (2.3.2) to obtain a + b candidate bounding boxes serving as candidate bounding box sets;
the value range of a in the step (2.3.1) is 7-15; the value range of b in the step (2.3.2) is 3-7. In this embodiment, a-10 and b-4 are parameters that are optimized by experiments.
(2.4) sending the a + b optimized candidate bounding box sets obtained in the step (2.3) and the characteristics of the test frame obtained in the step (1.3) to a bounding box prediction module for target bounding box prediction;
the target bounding box prediction process in the step (2.4) comprises the following steps:
(2.4.1) since the specified tracked target is uncertain before labeling, in the process of predicting the bounding box, the target information labeled in the reference frame needs to be combined to perform the prediction of the bounding box, and therefore, the target information labeled in the reference frame needs to be extracted, that is: as shown in fig. 4, a feature extraction network ResNet50 in step (1.1) is used to perform feature extraction on a reference frame, the reference frame is sent to a ResNet50 feature extraction network, four residual Block processes are performed in sequence, the features of the reference frame, which are output layer1 of Block3 and layer2 output by Block4, are extracted, and are pooled and fused after being subjected to convolution and Pr Pooling, and then a modulation vector is obtained after passing through a full connection layer, so that pedestrian target information marked in the reference frame can be obtained;
(2.4.2) for the test frame, as shown in the test frame branch of fig. 4, using the features of the test frame obtained by ResNet50 in the feature extraction network in step (1.1), passing through two convolutional layers, then performing PrPooling pooling operation on a + b (in the embodiment, 14) bounding box regions respectively, extracting internal features, combining modulation vectors, that is, information of pedestrian targets labeled by reference frames, and then predicting the intersection-to-parallel ratio IoU of a + b (in the embodiment, 14) bounding boxes respectively through a full-connection layer; calculating the gradient of the bounding box by using an intersection ratio of IoU, and optimizing a + b (14 in the embodiment) candidate bounding boxes respectively to obtain an optimized candidate bounding box which is used as a new optimized candidate bounding box set;
(2.4.3) repeating the step (2.4.2) to carry out iterative optimization, and after 5 iterations, taking the coordinate average value of the three candidate bounding boxes with the intersection ratio of IoU being the maximum as the coordinate of the predicted bounding box, namely the final predicted target bounding box.
In this embodiment, after 5 iterations, the average value of the coordinates of the IoU largest three candidate bounding boxes is taken as the coordinates of the predicted bounding box, that is, the final predicted target bounding box.
(2.5) updating the new _ filter of the classification filter according to the tracking state of the current test frame judged in the step (1.5), wherein the pedestrian can have scenes of shielding, staggered motion among pedestrians and background dissolving in the motion process, and the new _ filter of the classification filter is optimized and updated by judging the tracking state of the current target, so that background or interference object information can be prevented from being introduced;
in the step (2.5), the new _ filter is optimized and updated according to the tracking state of the test frame as follows:
(2.5.1) when the tracking state of the current test frame is determined to be a normal tracking state, optimizing and updating the new _ filter of the classification filter every n frames or when an interfering object is encountered;
the value range of n in the step (2.5.1) is 15-25. In this embodiment, since the time for updating the classification filter is 20, if the speed is obviously reduced due to frequent updating, more background information is introduced to affect the performance of the final overall model; since the video is dynamically changed, if the video is not updated for a long time, the performance of the final overall model is affected by the deterioration of the classification effect of the classification filter, wherein 20 frames are the optimal choice obtained through experimental summary.
(2.5.2) when the tracking state of the current test frame is determined to be an uncertain state, optimizing and updating the new _ filter of the classification filter;
(2.5.3) when the tracking state of the current test frame is determined as the state can not be found, the new _ filter of the classification filter is not updated optimally.
In the step (2.5.1), the new _ filter optimization updating mode of the classification filter is to use the test frame response image s obtained in the step (1.4)2And a test frame tag
Figure BDA0003336085340000211
Implementation of, wherein the response map s of the test frame2Expressed as a 19 × 19 two-dimensional matrix:
(i) the boundary frame predicted in the step (2.4.3) is scaled on a 19 × 19 two-dimensional matrix, and then a target position c of the pedestrian target of the test frame on the 19 × 19 two-dimensional matrix can be obtained;
(ii) calculating the target mask parameter m according to the step (1.2.3)cAnd a label yhnCalculating target mask parameters of the test frame
Figure BDA0003336085340000212
And a label
Figure BDA0003336085340000213
(iii) Calculating a test frame response map s using equation (2)testAnd a test frame tag
Figure BDA0003336085340000221
The residual r (s, c) between; at this time, the formula (2)) Where s is the response map s of the test frametest;vcIs a spatial weight; m iscTarget mask parameters for test frames
Figure BDA0003336085340000222
yhnLabels for test frames
Figure BDA0003336085340000223
(iv) Obtaining a test frame response image s according to the operation of the step (1.2.3)testAnd a test frame tag
Figure BDA0003336085340000224
And optimizing and updating the new _ filter of the classification filter according to the loss difference, so as to obtain a new optimized classification filter new _ filter for the foreground background classification of the target of the subsequent test frame.
And (2.6) completing target identification detection of each frame in the whole tracked video sequence by repeating the steps (1.3) to (2.5), and finally realizing single-target tracking of the pedestrians.
The specific process of the step (2.6) is as follows:
(2.6.1) acquiring information of a specified target pedestrian selected from the reference frame of the tracked video sequence through the steps (1.1) to (1.2), and using the obtained new _ filter for classification of a foreground and a background of the target of a subsequent test frame;
(2.6.2) repeating the steps (1.3) to (2.5) for each frame in the subsequent test frames until the last frame, thereby completing the identification and detection of the specified pedestrian target for each frame in the whole video sequence, and finally realizing the tracking of the single specified pedestrian target of the reference frame on the whole video sequence.
In the embodiment, an online updating strategy pedestrian single-target tracking method fusing pedestrian characteristics is constructed by using Python language and PyTorch framework.
The implementation operation mainly involved comprises a classification task and a regression task, wherein a new updating strategy is adopted in the classification task, the target center position is corrected in the regression task, and a candidate boundary box generated based on the first frame boundary box is newly added as an innovation point.
And the video sequence is used as input, and a pedestrian target boundary box required to be tracked is manually marked in the first frame to be used as a pedestrian target object to be tracked in the subsequent frame. The method comprises the steps of performing data enhancement on a first frame image, namely obtaining a training image set after operations such as turning, mirroring and shifting, and obtaining a corresponding artificial labeling pedestrian target boundary frame as a training sample set according to an enhancement mode. And (3) performing feature extraction on the training sample set by using a backbone network, wherein the backbone network consists of a ResNet50 network pre-trained by an ImageNet data set and a two-layer convolution network finished by offline training, and a training sample feature set is obtained after processing. Training a sample feature set as input and sending the input to a prediction classification filter in a model predictor, wherein an initializer module sent to the model predictor can effectively provide initial estimation of the classification filter by only using target appearance, namely manually marking a boundary box and performing Pr Pooling operation; and secondly, the initially estimated classification filter init _ filter sum is extracted to an optimizer module, the optimizer module optimizes the initially estimated classification filter init _ filter through a steepest gradient descent method, the iterative optimization is carried out for 5 times after experimental summary, the learning rate is set to be 0.6, i is set to be 5 in the formula (7), the result is better when alpha is set to be 0.6, and finally the optimized classification filter new _ filter is obtained. And identifying and detecting the target object marked on the reference frame in the subsequent frame of the current video sequence, thereby realizing the tracking effect of a single specified pedestrian target on the current video sequence. And (3) in the tracking process of the subsequent frame, taking the subsequent frame as a test frame, performing feature extraction by using a twin network idea that the subsequent frame shares the same backbone network with the reference frame, performing convolution operation on the extracted features and the optimized classification filter new _ filter to obtain a response graph, obtaining the central position of the predicted target according to the step (2.1), and correcting the central position of the predicted target according to the current tracking state to obtain the final central position of the predicted target.
And predicting the size of a target boundary frame of the previous frame, if the current test frame is the second frame, taking the size of a reference frame artificially marked boundary frame as an initial candidate boundary frame, randomly generating 14 candidate boundary frames at the central position of the final predicted target by combining the size of the reference frame artificially marked boundary frame, obtaining 14 candidate boundary frames IoU through regression branch calculation, directly optimizing the 14 candidate boundary frames through gradient descent, and finally taking the coordinate average value of the three candidate boundary frames with the maximum IoU as the final predicted boundary frame to finish the target identification and detection of the current test frame. And then, identifying and detecting each frame of the subsequent test frame of the current video sequence until the last frame, thereby completing the identification and detection of the specified pedestrian target on each frame in the whole video sequence and finally realizing the tracking of the single specified pedestrian target on the whole video sequence.
Table 1 shows the comparison result of the tracking performance of the method and other methods on a pedestrian video sequence dataset, and the method achieves 30FPS on GTX1650 and achieves real-time tracking. The method has the advantages of both precision and speed and certain practical value.
Ours ATOM DIMP SiamCAR SiamDW SiamRPN++ Ocean
Success rate 0.740 0.696 0.681 0.655 0.650 0.633 0.617
TABLE 1

Claims (10)

1. An online updating strategy pedestrian single-target tracking method fused with pedestrian characteristics is characterized by comprising the following steps:
step 1: classification tasks in the tracking process:
(1.1) extracting the features of the reference frame in the video sequence and each frame in the video sequence through a feature extraction network, namely: selecting a first frame of a video sequence as a reference frame, selecting a target to be tracked in a mode of manually marking a target boundary frame, and identifying and detecting the selected target to be tracked of the reference frame in each frame, namely a test frame, in a subsequent video sequence, so that the target to be tracked in the video sequence is tracked, and the pedestrian single-target tracking process can be realized;
(1.2) utilizing the characteristics of each frame in the reference frames extracted from the reference frames selected in the step (1.1) and the artificially marked target boundary frame thereof, and training a classification filter on line through a model predictor for distinguishing the background in front of the target to be tracked in the subsequent frames and predicting the position of the target;
(1.3) taking the subsequent frame in the current sequence to be tracked as a test frame, and performing feature extraction, namely performing feature extraction on the test frame by using a feature extraction network to obtain a feature x of the test frametest
(1.4) performing convolution processing by using the classification filter new _ filter optimized in the step (1.2) and the characteristics of the test frame obtained in the step (1.3), as shown in a formula (8):
stest=xtest*fnew (8)
in the formula, stestIs a test frame response graph; wherein xtestFeatures representing test frames; represents a convolution calculation; f. ofnewThe optimized classification filter new _ filter;
(1.5) obtaining a response graph s of the test frame according to the step (1.4)testJudging the tracking state of the current test frame, wherein the tracking state is divided into a normal state, an uncertain state and a state which can not be found; the normal state represents a simple scene of the test frame, and the target to be tracked and the background can be simply identified through a classification task; the uncertain state represents that the scene of the current test frame is complex and is influenced by an interfering object and a background, so that the target to be tracked and the background are difficult to accurately identify; the condition that the current test frame scene is complex and is shielded or the target to be tracked and the background cannot be identified cannot be found;
step 2: regression tasks in the tracking process:
(2.1) response graph s of the test frame obtained in the classification task according to step (1.4)testPredicting the position of the target center;
(2.2) correcting (2.1) the predicted target center position by combining the inherent characteristics of the pedestrian target in the motion process to obtain the final predicted target center position;
(2.3) according to the final predicted target center position obtained in the step (2.2) and the target boundary frame of the previous frame, if the current frame is the second frame, taking the target boundary frame marked by the first frame as an initial candidate boundary frame, if the current frame is the subsequent frame, taking the boundary frame predicted in the step (2.4) as an initial candidate boundary frame, and combining the target boundary frame marked by the reference frame as a reference boundary frame to generate a candidate boundary frame set around the final predicted target center position obtained in the step (2.2);
(2.4) sending the a + b optimized candidate bounding box sets obtained in the step (2.3) and the characteristics of the test frame obtained in the step (1.3) to a bounding box prediction module for target bounding box prediction;
(2.5) updating the new _ filter of the classification filter according to the tracking state of the current test frame judged in the step (1.5), wherein the pedestrian can have scenes of shielding, staggered motion among pedestrians and background dissolving in the motion process, and the new _ filter of the classification filter is optimized and updated by judging the tracking state of the current target, so that background or interference object information can be prevented from being introduced;
and (2.6) completing target identification detection of each frame in the whole tracked video sequence by repeating the steps (1.3) to (2.5), and finally realizing single-target tracking of the pedestrians.
2. The pedestrian single-target tracking method of the online updating strategy fused with the pedestrian characteristics according to claim 1, characterized in that the feature extraction network in the step (1.1) is a ResNet50 module structure; the ResNet50 module structure is formed by connecting 4 residual blocks in series, wherein the names of the 4 residual blocks are Block1, Block2, Block3 and Block 4; and the ResNet50 module structure is connected with two convolutional layers, so that a backbone network for feature extraction is formed, and the backbone network is used for extracting the features of the current reference frame or test frame image.
3. The pedestrian single-target tracking method of the online updating strategy fused with the pedestrian characteristics as claimed in claim 1, wherein the model predictor in the step (1.2) is composed of an initializer module and an optimizer module; the initializer module can effectively provide initial estimation of the classification filter only by using the appearance of the target to be tracked; and the optimizer module is used for optimizing the initially estimated classification filter init _ filter to finally obtain an optimized classification filter new _ filter, performing pre-target and background classification on subsequent frames of the tracked video sequence, and predicting the central position of a target to be tracked to perform rough positioning.
4. The method for tracking the pedestrian single target by the online updating strategy fused with the pedestrian characteristics as claimed in claim 1, wherein in the step (1.2), the classification filter new _ filter is obtained by online training of a model predictor by using the information of the reference frame, and the method specifically comprises the following steps:
(1.2.1) respectively carrying out turning, mirroring, blurring and rotating data enhancement operations on the reference frame by utilizing the reference frame information of the current tracked video sequence, including the reference frame image information and a manually-specified target annotation bounding box, obtaining images with respective operation effects, forming a set by the images to serve as an image set subjected to data enhancement processing, and simultaneously obtaining the corresponding annotation bounding box after the data enhancement;
(1.2.2) extracting the characteristics of the image set processed in the step (1.2.1) by using the characteristic extraction network in the step (1.1) to obtain a group of training sample characteristics, and sending the training sample characteristics to a model initialization module for initial estimation of a classification filter to obtain an initial estimated classification filter init _ filter;
(1.2.3) performing feature extraction on the initially estimated classification filter init _ filter obtained in the step (1.2.2) and the image set processed in the step (1.2.1) by using the feature extraction network in the step (1.1) to obtain a group of training sample features, sending the training sample features into an optimizer module, and performing optimization to obtain an optimized classification filter new _ filter for background classification before a subsequent test frame target.
5. The method for tracking the pedestrian single target by the online updating strategy fused with the pedestrian characteristics as claimed in claim 4, wherein the initial estimation of the classification filter by the model initialization module in the step (1.2.2) is to extract the features in the labeling bounding box by performing PrRoi Pooling Pooling operation on the training sample features, and the obtained features are target features and are used as output, namely the classification filter init _ filter of the initial estimation;
the specific implementation process of the step (1.2.3) can be described by the following steps:
calculating reference frame response graph sref
Performing convolution operation on the classification filter init _ filter initially estimated in the step (1.2.2) and the training sample characteristics to obtain a response diagram srefI.e. the response map of the reference frame, as shown in equation (1):
sref=xref*finit (1)
in the formula, xrefRepresenting the features of the reference frame, namely the training sample features; represents a convolution calculation; f. ofinitA classification filter init _ filter for initial estimation;
② calculating reference frame response picture srefAnd reference frame tags
Figure FDA0003336085330000041
Difference between r (s, c):
according to the response graph s of the reference frame calculated in the step (1.2.2)refRepresented in the form of a 19 × 19 two-dimensional matrix; scaling the reference frame pedestrian target labeling bounding box on a 19 x 19 two-dimensional matrix to obtain the position c of the labeled pedestrian target in the 19 x 19 two-dimensional matrix;
Figure FDA0003336085330000042
Figure FDA0003336085330000043
in formula (2) and formula (3), t represents each position of a 19 × 19 two-dimensional matrix, and c represents a target position;
Figure FDA0003336085330000044
representing spatial coefficients obtained from training; rhokIs a distance calculation function determined by the equations (4-1), (4-2):
Figure FDA0003336085330000045
Figure FDA0003336085330000046
wherein, yhnRepresenting the response map of the current frameThe real tag of (1); m iscRepresenting mask parameters, in order to determine the area of the current target in the 19 × 19 two-dimensional matrix, the area m corresponding to the targetc1, in the background area mc0 is approximately distributed; label yhnAnd the mask parameter mcIs represented by a 19 × 19 two-dimensional matrix;
obtaining a reference frame label from the reference frame through formula (2) and formula (3)
Figure FDA0003336085330000047
And mask parameters
Figure FDA0003336085330000048
And calculating the response image s of the reference frame obtained in the step (i)refAnd reference frame tags
Figure FDA0003336085330000049
The difference r (s, c) between them, as shown in equation (5):
r(s,c)=vc·(mcs+(1-mc)max(0,s)-yhn) (5)
where s is the response map s of the reference frameref;vcIs a spatial weight; m iscMask parameters for reference frames
Figure FDA00033360853300000410
yhnTags for reference frames
Figure FDA00033360853300000411
Adding regularization to the difference r (s, c) obtained in the step (c) to obtain L (f) shown as a formula (6), and taking the L (f) as a reference frame response image srefAnd reference frame tags
Figure FDA0003336085330000051
The loss is transmitted reversely, so that the aim of optimizing the classification filter is fulfilled;
Figure FDA0003336085330000052
in the formula (I), the compound is shown in the specification,
Figure FDA0003336085330000053
is a set of training samples, where xjFor the training sample features extracted by the feature extraction network, cjMarking the center coordinates of the marked target of the sample, namely marking the coordinates of the center point of the boundary box; represents a convolution calculation; λ is a regularization factor; f is an optimized classification filter.
6. The method for tracking the pedestrian single target by the online updating strategy fused with the pedestrian characteristics according to claim 5, wherein the optimization classification filter f in the step (iii) is a back propagation optimization classification filter, and a steepest gradient descent method shown as a formula (7) is adopted:
Figure FDA0003336085330000054
in the formula (f)(i)Representing the classification filter after the ith sub-optimization; α represents a learning rate;
Figure FDA0003336085330000055
representing the gradient calculation.
7. The pedestrian single-target tracking method based on the online updating strategy fused with the pedestrian characteristics according to claim 1, wherein the specific implementation manner of the step (1.5) is composed of the following steps:
(1.5.1) response graph s of the test frame obtained according to the step (1.4)testAnd judging the tracking state of the current test frame, wherein the tracking state is divided into a normal state, an uncertain state and an unavailable state:
if the response graph s of the test frametestOnly the target center is the highest response, which indicates that only the target center is left at this timeThe tracking target or the target to be tracked and the background have a clear boundary, namely the tracking state of the current test frame is a normal state, the scene representing the test frame is simple, and the target to be tracked and the background can be simply identified through a classification task;
if response map s of test frametestThe middle target area is an area framed and selected around the position of the highest response score by using the size of the target bounding box of the previous frame, the response scores in the target area are relatively disordered at the moment, and the highest response score is also reduced in vain, so that the representation that the target to be tracked is confused with the background and close to the target to be tracked and an interfering object when the target to be tracked is in an uncertain state at the moment is shown;
third, if the response image s of the test frametestThe response score in the target area in the state which cannot be found is more disordered relative to the uncertain state, and meanwhile, the highest response score is more obviously reduced than the uncertain state, so that the target to be tracked is seriously shielded and belongs to the state which cannot be found;
(1.5.2) calculating the response map s of the test frametestThe response score variance of the medium target area and the highest response score of the test frame are used for determining the state of the target to be tracked in the current search area:
(1.5.2.1) calculating the mean variance of the target region response scores of m frames before the test frame by sliding
Figure FDA0003336085330000068
The target area is the response image s in the test frame using the target bounding box size of the previous frametsetRecording the variance sigma of the response score of the target area of m frames before the test frame in the area framed around the position with the highest response score, as shown in formula (9):
Figure FDA0003336085330000061
in the formula, scoreiFor each position score in the corresponding target region in the response map of the test frame,
Figure FDA0003336085330000062
response map s for a test frametestThe average value of each position of the corresponding target area is obtained; w h is the response map s of the test frametestThe size of the middle corresponding target area width and height;
(1.5.2.2) calculating the mean value of the mean value according to the formula (10), i.e. the mean value of the variance of the response score of m frames
Figure FDA0003336085330000063
Figure FDA0003336085330000064
In the formula, σjRepresenting a target region response score variance for each frame within the target region of the m frames;
(1.5.2.3) at the same time, the highest response score of the test frame is the response map s of the test frametestThe maximum value of the medium response score, the highest response score max _ score of m frames before the test frame is recorded, and the mean of the highest response scores of the m frames is calculated using equation (11)
Figure FDA0003336085330000065
Figure FDA0003336085330000066
In the formula, max-scorejRepresents the highest response score per frame;
(1.5.2.4) combining the mean variance of the target region response scores of the m frames before the test frame obtained in the step (1.5.2.2) according to the normal state, the uncertain state and the unavailable state condition stated in the step (1.5.1)
Figure FDA0003336085330000067
And (1.5.2.3) obtaining the highest response score mean value of m frames before the test frame, and judging the trace state of the test frameState:
if the formula (12) is satisfied, it indicates that the tracking state of the test frame is an uncertain state:
Figure FDA0003336085330000071
if equation (13) is satisfied, the tracked state of the test frame is the not-found state:
Figure FDA0003336085330000072
the other cases are regarded as normal states;
wherein the content of the first and second substances,
Figure FDA0003336085330000073
response score variance mean of the target area of m frames before the current test frame;
Figure FDA0003336085330000074
response map s of test frame m frames before test frametestMean of the highest response scores; k is a radical of1,k2Is a scaling factor.
8. The pedestrian single-target tracking method of the online updating strategy fused with the pedestrian characteristics according to claim 1, wherein the step (2.1) specifically refers to:
(2.1.1) response graph s of the test frame obtained according to the step (1.4)testResponse map s of test frametestTaking the position coordinate corresponding to the maximum value of the medium response score as a first response point, and if the tracking state of the test frame is a normal tracking state and no interference object is encountered, taking the first response point as a predicted target center position;
(2.1.2) response map s at test frametestMiddle target area, i.e. response map s in test frame using last frame target bounding box sizetestPosition of maximum value of medium response scoreArea of surrounding frame, response map s of test frametestThe area outside the selected area of the middle frame is outside the target area, and the position corresponding to the highest response score outside the target area is regarded as a second response point;
(2.1.3) when the highest response score of the second response point is greater than 0.5 times the highest response score of the first response point, then the second response point is considered to be the target analog in the background;
(2.1.4) assuming that the current first response point position is c1[x1,y1]The second response point position is c2[x2,y2]The response graph S of the target central point of the previous frame of the current test frame in the test frametestIs at a position of c0[x0,y0]I.e. the center point of the target area in the response map, c1And c2Relative to c0The amounts of positional deviation of (a) are respectively expressed by equations (14) and (15):
Figure FDA0003336085330000075
Figure FDA0003336085330000076
(2.1.5) judging the real position of the current target:
if the offset calculated according to the equations (14) and (15) is in different value ranges, returning different response point positions as predicted target center positions, as shown in equations (16) and (17): in the formula, Ω represents a threshold value in the value range.
c1,(d1>Ω&d2<Ω)|(d1>Ω&d2>Ω)|(d1<d2&d1<Ω&d2<Ω) (16)
c2,(d1<Ω&d2>Ω)|(d1>d2&d1<Ω&d2<Ω) (17)
9. The pedestrian single-target tracking method of the online updating strategy fused with the pedestrian characteristics according to claim 1, wherein the step (2.2) specifically refers to: according to the inherent characteristics of the pedestrian target in the moving process, the target tracking is normal in the normal state, the pedestrian target does not have violent scale transformation in the moving process, and the pedestrian moves relatively smoothly, so that the offset of the target center in the normal tracking state is recorded, the relative offset of the target center between two frames in the v frame is recorded in a sliding frame mode, the average value of the offset is calculated to serve as the predicted offset, and the value range of v is 12-18; and (3) if the tracking state is in an uncertain state, correcting the predicted target center position obtained in the step (2.1) through the predicted offset, and finally obtaining the final predicted target center position.
10. The pedestrian single-target tracking method of the online updating strategy fused with the pedestrian characteristics according to claim 1, wherein the step (2.3) is implemented by the following specific steps:
(2.3.1) obtaining a final predicted target center position according to the step (2.2), taking a target boundary frame of the previous frame as an initial candidate boundary frame, taking the target boundary frame marked by the first frame as the initial candidate boundary frame if the frame is the second frame, taking the boundary frame predicted in the step (2.4) as the initial candidate boundary frame if the frame is the subsequent frame, and randomly generating a candidate boundary frames with different proportions around the final predicted target center position to form an initial candidate boundary frame set, wherein the value range of a is 7-15;
(2.3.2) according to the inherent characteristics of the pedestrian target in the motion process, the scale change of the pedestrian target in the motion process is relatively smooth, the target boundary frame of the artificially labeled reference frame in the step (1.1) is combined to serve as a reference candidate boundary frame, and b reference candidate boundary frames with different proportions are randomly generated around the finally predicted target center position to form a reference candidate boundary frame set; wherein the value range of b is 3-7;
(2.3.3) fusing the initial candidate bounding box set obtained in the step (2.3.1) and the reference candidate bounding box set obtained in the step (2.3.2) to obtain a + b candidate bounding boxes serving as candidate bounding box sets;
the target bounding box prediction process in the step (2.4) comprises the following steps:
(2.4.1) since the specified tracked target is uncertain before labeling, in the process of predicting the bounding box, the target information labeled in the reference frame needs to be combined to perform the prediction of the bounding box, and therefore, the target information labeled in the reference frame needs to be extracted, that is: performing feature extraction on the reference frame by using the feature extraction network ResNet50 in the step (1.1), sending the reference frame into a ResNet50 feature extraction network, sequentially performing 4 residual Block processing, extracting features of the reference frame, which are output layer1 through Block3 and output layer2 through Block4, fusing the features after rolling and Pr Pooling, and obtaining a modulation vector after passing through a full connection layer, wherein the modulation vector can be used as pedestrian target information marked in the reference frame;
(2.4.2) for the test frame, utilizing the features of the test frame obtained by ResNet50 in the feature extraction network in the step (1.1), passing through two convolution layers, then respectively carrying out PrPooling pooling operation on a + b bounding box areas, extracting internal features, combining modulation vectors, namely information of pedestrian targets marked by the reference frame, and then respectively predicting the intersection-to-parallel ratio IoU of a + b bounding boxes through a full connection layer; calculating the gradient of the bounding box through an intersection ratio of IoU, and optimizing a + b candidate bounding boxes respectively to obtain an optimized candidate bounding box which is used as a new optimized candidate bounding box set;
(2.4.3) repeating the step (2.4.2) to carry out iterative optimization, and after 5 iterations, taking the coordinate average value of three candidate bounding boxes with the intersection ratio of IoU being the maximum as the coordinate of the predicted bounding box, namely the final predicted target bounding box;
in the step (2.5), the new _ filter is optimized and updated according to the tracking state of the test frame as follows:
(2.5.1) when the tracking state of the current test frame is determined to be a normal tracking state, optimizing and updating the new _ filter of the classification filter every n frames or when an interfering object is encountered; wherein the value range of n is 15-25;
the new _ filter optimization updating mode of the classification filter is to use the test frame response image s obtained in the step (1.4)2And a test frame tag
Figure FDA0003336085330000091
Implementation of, wherein the response map s of the test frame2Expressed as a 19 × 19 two-dimensional matrix:
(i) the boundary frame predicted in the step (2.4.3) is scaled on a 19 × 19 two-dimensional matrix, and then a target position c of the pedestrian target of the test frame on the 19 × 19 two-dimensional matrix can be obtained;
(ii) calculating the target mask parameter m according to the step (1.2.3)cAnd a label yhnCalculating target mask parameters of the test frame
Figure FDA0003336085330000101
And a label
Figure FDA0003336085330000102
(iii) Calculating a test frame response map s using equation (2)testAnd a test frame tag
Figure FDA0003336085330000103
The residual r (s, c) between; in this case, in equation (2), s is a response map s of the test frametest;vcIs a spatial weight; m iscTarget mask parameters for test frames
Figure FDA0003336085330000104
yhnLabels for test frames
Figure FDA0003336085330000105
(iv) Obtaining a test frame response image s according to the operation of the step (1.2.3)testAnd a test frame tag
Figure FDA0003336085330000106
Optimizing and updating the new _ filter of the classification filter according to the loss difference between the target objects, and obtaining a new optimized classification filter new _ filter for the foreground background classification of the target of the subsequent test frame;
(2.5.2) when the tracking state of the current test frame is determined to be an uncertain state, optimizing and updating the new _ filter of the classification filter;
(2.5.3) when the tracking state of the current test frame is determined to be the state which can not be found, optimizing and updating the new _ filter of the classification filter;
the specific process of the step (2.6) is as follows:
(2.6.1) acquiring information of a specified target pedestrian selected from the reference frame of the tracked video sequence through the steps (1.1) to (1.2), and using the obtained new _ filter for classification of a foreground and a background of the target of a subsequent test frame;
(2.6.2) repeating the steps (1.3) to (2.5) for each frame in the subsequent test frames until the last frame, thereby completing the identification and detection of the specified pedestrian target for each frame in the whole video sequence, and finally realizing the tracking of the single specified pedestrian target of the reference frame on the whole video sequence.
CN202111294661.6A 2021-11-03 2021-11-03 Pedestrian single-target tracking method based on online updating strategy and fusing pedestrian characteristics Pending CN114067240A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111294661.6A CN114067240A (en) 2021-11-03 2021-11-03 Pedestrian single-target tracking method based on online updating strategy and fusing pedestrian characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111294661.6A CN114067240A (en) 2021-11-03 2021-11-03 Pedestrian single-target tracking method based on online updating strategy and fusing pedestrian characteristics

Publications (1)

Publication Number Publication Date
CN114067240A true CN114067240A (en) 2022-02-18

Family

ID=80273653

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111294661.6A Pending CN114067240A (en) 2021-11-03 2021-11-03 Pedestrian single-target tracking method based on online updating strategy and fusing pedestrian characteristics

Country Status (1)

Country Link
CN (1) CN114067240A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115471787A (en) * 2022-08-09 2022-12-13 东莞先知大数据有限公司 Construction site object stacking detection method and device and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115471787A (en) * 2022-08-09 2022-12-13 东莞先知大数据有限公司 Construction site object stacking detection method and device and storage medium
CN115471787B (en) * 2022-08-09 2023-06-06 东莞先知大数据有限公司 Method and device for detecting stacking of objects on site and storage medium

Similar Documents

Publication Publication Date Title
CN110427839B (en) Video target detection method based on multi-layer feature fusion
CN110097568B (en) Video object detection and segmentation method based on space-time dual-branch network
CN107424171B (en) Block-based anti-occlusion target tracking method
CN112184752A (en) Video target tracking method based on pyramid convolution
CN110175649B (en) Rapid multi-scale estimation target tracking method for re-detection
CN107633226B (en) Human body motion tracking feature processing method
CN110276264B (en) Crowd density estimation method based on foreground segmentation graph
CN110942471B (en) Long-term target tracking method based on space-time constraint
CN111882586B (en) Multi-actor target tracking method oriented to theater environment
CN112052802B (en) Machine vision-based front vehicle behavior recognition method
CN111582349B (en) Improved target tracking algorithm based on YOLOv3 and kernel correlation filtering
CN112651998B (en) Human body tracking algorithm based on attention mechanism and double-flow multi-domain convolutional neural network
CN110032952B (en) Road boundary point detection method based on deep learning
CN112308921B (en) Combined optimization dynamic SLAM method based on semantics and geometry
CN110555868A (en) method for detecting small moving target under complex ground background
CN110310305B (en) Target tracking method and device based on BSSD detection and Kalman filtering
CN113592894B (en) Image segmentation method based on boundary box and co-occurrence feature prediction
CN113362341B (en) Air-ground infrared target tracking data set labeling method based on super-pixel structure constraint
CN113888461A (en) Method, system and equipment for detecting defects of hardware parts based on deep learning
CN111754545A (en) Dual-filter video multi-target tracking method based on IOU matching
CN112329784A (en) Correlation filtering tracking method based on space-time perception and multimodal response
CN113920170A (en) Pedestrian trajectory prediction method and system combining scene context and pedestrian social relationship and storage medium
CN111429485B (en) Cross-modal filtering tracking method based on self-adaptive regularization and high-reliability updating
CN113052184A (en) Target detection method based on two-stage local feature alignment
CN115359407A (en) Multi-vehicle tracking method in video

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination