CN106952288B

CN106952288B - Based on convolution feature and global search detect it is long when block robust tracking method

Info

Publication number: CN106952288B
Application number: CN201710204379.1A
Authority: CN
Inventors: 李映; 林彬; 杭涛
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2017-03-31
Filing date: 2017-03-31
Publication date: 2019-09-24
Anticipated expiration: 2037-03-31
Also published as: CN106952288A

Abstract

The present invention relates to it is a kind of based on convolution feature and global search detect it is long when block robust tracking method, by using convolution feature and multiple dimensioned correlation filtering method in tracking module, the feature representation ability of tracking target appearance model is enhanced, so that tracking result has very strong robustness for factors such as illumination variation, target scale variation, target rotations；Further through the global search testing mechanism of introducing, so that when target by it is long when block cause tracking failure when, detection module can detect target again, recover tracking module from mistake, accordingly even when can also be tracked long lasting for ground in the case where target appearance variation.

Description

Based on convolution feature and global search detect it is long when block robust tracking method

Technical field

The invention belongs to computer vision fields, are related to a kind of method for tracking target, and in particular to one kind is based on convolution feature With global search detection it is long when block robust tracking method.

Background technique

The main task of target following is to obtain the position of specific objective and motion information in video sequence, is supervised in video The fields such as control, human-computer interaction have a wide range of applications.During tracking, illumination variation, background are complicated, target is rotated or contracted The complexity of Target Tracking Problem can all be increased by the factors such as putting, especially when target by it is long when block when, then be easier to cause with Track failure.

Document " Tracking-learning-detection, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012,34 (7): the tracking that 1409-1422 " is proposed is (referred to as TLD) traditional track algorithm and detection algorithm are combined for the first time, improve tracking result using testing result, improves and is The reliability and robustness of system.Its track algorithm is based on optical flow method, and detection algorithm generates a large amount of detection window, for each inspection Survey window, it is necessary to last testing result could be become by three detector receiving.For occlusion issue, TLD provides one A effective resolving ideas can carry out long time-tracking (Long-term Tracking) to target.But TLD is used Be shallow-layer manual features, it is limited to the characterization ability of target, and the design of detection algorithm is also complex, there is certain change Into space.

Summary of the invention

Technical problems to be solved

In order to avoid the shortcomings of the prior art, the present invention proposes a kind of to detect based on convolution feature and global search Block robust tracking method when long, solve video frequency motion target during tracking due to it is long when block or target translates out the visual field Except etc. factors cause display model to drift about, thus easily lead to tracking failure the problem of.

Technical solution

It is a kind of based on convolution feature and global search detect it is long when block robust tracking method, it is characterised in that step is such as Under:

Step 1 reads the first frame image data and the initial position message [x, y, w, h] where target in video, wherein X, y indicate that the abscissa and ordinate of target's center, w, h indicate the width and height of target.(x, y) corresponding coordinate points are denoted as P, Centered on P, size is that the target prime area of w × h is denoted as R_init, then the scale of target is denoted as scale, it is initialized as 1.

Step 2 determines the region R comprising target and background information centered on P_bkg, R_bkgSize be M × N, M =2w, N=2h.Using VGGNet-19 as CNN model, convolution characteristic pattern is extracted to R' in the 5th layer of convolutional layer (conv5-4) z_{target_init}.Then according to z_{target_init}Construct the object module of tracking moduleT ∈ { 1,2 ..., T }, T CNN Molded passage number, calculation method are as follows:

Wherein, the variable of capitalization is expression of the corresponding non-capitalized variables on frequency domain, gaussian filtering templateM, n be Gaussian function independent variable, m ∈ { 1,2 ..., M }, n ∈ { 1,2 ..., N }, σ_targetFor the bandwidth of Gaussian kernel,⊙ representative element multiplication operation, upper scribing line indicate complex conjugate, λ₁To adjust Whole parameter (in order to avoid denominator is 0 and is introduced), is set as 0.0001.

Step 3 extracts the image subblock of S different scale centered on P, and S is set as 33.The size of each sub-block is w × h × s, variable s are the scale factor of image subblock, s ∈ [0.7,1.4].Then the HOG for extracting each image subblock respectively is special Sign becomes the HOG feature vector of S dimension, is named as scale feature vector here, is denoted as z after merging_{scale_init}.Again According to z_{scale_init}Construct the Scale Model W of tracking module_scale, calculation method calculates with step 2Similar to (by scale Feature vector replaces convolution characteristic pattern), specific as follows:

Wherein,S' is Gaussian function independent variable, s' ∈ { 1,2 ..., S }, σ_scaleFor Gaussian kernel Bandwidth,λ₂For adjusting parameter, it is set as 0.0001.

Step 4 is to target prime area R_initGray feature is extracted, obtained gray feature expression is a two-dimensional matrix, Here the matrix is named as target appearance representing matrix, is denoted as A_k, the current frame number of subscript k expression, k=1 when initial.Then will The Filtering Model D of detection module is initialized as A₁, i.e. D=A₁, the history that reinitializes target expression set of matrices A_his。A_hisWork With being current and each frame before the target appearance representing matrix of storage, i.e. A_his={ A₁,A₂,...,A_k, A when initial_his= {A₁}。

Step 5 reads next frame image, and still centered on P, extraction size is R_bkgThe process scaling of × scale Target search region afterwards.Then the convolution feature in target search region is extracted, and by the CNN network in step 2 with bilateral The mode of interpolation samples R_bkgSize obtain the convolution characteristic pattern z of present frame_{target_cur}, recycle object moduleMeter Calculate target confidence map f_target, calculation method is as follows:

Wherein,For inverse Fourier transform.(x, y) is modified to f by the coordinate of final updating P_targetIn peak response The corresponding coordinate of value:

Step 6 is extracted the image subblock of S different scale, is then extracted each image subblock respectively centered on P HOG feature obtains the scale feature vector z of present frame after merging_{scale_cur}(with z in step 3_{scale_init}Calculation method).Again Utilize Scale Model W_scaleCalculate scale confidence map:

The scale scale of final updating target, calculation method are as follows:

So far, output of the available tracking module in present frame (kth frame): with coordinate be (x, y) P centered on, greatly Small is R_initThe image subblock TPatch of × scale_k.In addition, the f that completion will have been calculated_targetIn maximum response brief note For TPeak_k, i.e. TPeak_k=f_target(x,y)。

The entire image of Filtering Model D and present frame is carried out convolution, meter in a manner of global search by step 7 detection module Calculate the similarity degree of Filtering Model D and each position of present frame.The highest preceding j value (j is set as 10) of responsiveness is taken, and respectively Centered on the corresponding location point of this j value, extraction size is R_initThe j image subblock of × scale.By this j image Block generates an image subblock set DPatches as element_k, i.e. output of the detection module in kth frame.

The set DPatches that step 8 exports detection module_kIn each image subblock, it is defeated with tracking module to calculate separately it TPatch out_kBetween pixel Duplication, it is available j value, will wherein highest value be denoted asIfLess than threshold value(It is set as 0.05), being determined as that target is blocked completely, the learning rate β for needing to inhibit tracking module in model modification, And go to step 9；Otherwise initial learning rate β is pressed_init(β_initIt is set as 0.02) being updated, and goes to step 10.The calculation formula of β It is as follows:

Step 9 is according to DPatches_kIn each image subblock center, respectively extract size be R_bkgThe j mesh of × scale Region of search is marked, to each target search extracted region convolution characteristic pattern and calculates target confidence according to the method in step 5 Scheme, the maximum response on available j target search region.It is compared again in this j response, by maximum value It is denoted as DPeak_k.If DPeak_kGreater than TPeak_k, then the coordinate of P is updated again, and (x, y) is modified to DPeak_kCorresponding Coordinate.And recalculate target scale feature vector and target scale scale (with the calculation in step 6).

Step 10 target is determined as P in the optimal place-centric of present frame, and optimal scale is determined as scale.In the picture Indicate new target area R_new, i.e., centered on P, the rectangle frame of wide and high respectively w × scale, h × scale.In addition, It will calculate and completed and can obtain the convolution characteristic pattern of optimal objective place-centric P and be abbreviated as z_target；Equally, will The scale feature vector for accessing optimal objective scale scale is abbreviated as z_scale。

Step 11 utilizes z_target、z_scaleAnd the object module in the tracking module of previous frame foundationAnd scale Model W_scale, model modification is carried out in a manner of weighted sum respectively, calculation method is as follows:

Wherein, β is the learning rate after step 8 calculates.

Step 12 is to new target area R_newThe target appearance representing matrix A of present frame is obtained after extracting gray feature_k, By A_kBeing added to history target indicates set of matrices A_his.If set A_hisMiddle element number is greater than_c(_cBe set as 20), then from A_hisMiddle random selection_cOne three-dimensional matrice C of a Element generation_k, C_kCorresponding (:, i) is A_hisIn any one element it is (i.e. two-dimentional Matrix A_k)；Otherwise A is used_hisMiddle all elements generator matrix C_k.Then to C_kAverage to obtain two-dimensional matrix, by this two The matrix Filtering Model D new as detection module is tieed up, calculation method is as follows:

Step 13 judges whether to have handled picture frame all in video, and algorithm terminates if having handled, and otherwise goes to step 5 It continues to execute.

Beneficial effect

It is proposed by the present invention it is a kind of based on convolution feature and global search detect it is long when block robust tracking method, respectively Tracking module and detection module are devised, during tracking, two module cooperative work: tracking module mainly utilizes convolutional Neural The convolution feature that network (Convolutional Neural Network, CNN) extracts target is used to construct the target mould of robust Type, and pass through histograms of oriented gradients (Histogram of Oriented Gradient, HOG) feature construction Scale Model, Determine the place-centric and scale of target respectively in conjunction with correlation filtering method；Detection module extracts gray feature building target Filtering Model, is used for quickly detecting in entire image to target by the way of global search and judges the generation blocked, one Denier target is blocked (or other factors lead to target appearance acute variation) completely, and detection module utilizes testing result amendment tracking The position of target, and inhibit the model modification of tracking module, prevent from introducing unnecessary noise drift about so as to cause model and with Track failure.

Superiority: by using convolution feature and multiple dimensioned correlation filtering method in tracking module, tracking mesh is enhanced The feature representation ability of display model is marked, so that tracking result is for factors such as illumination variation, target scale variation, target rotations With very strong robustness；Further through the global search testing mechanism of introducing so that when target by it is long when block cause tracking lose When losing, detection module can detect target again, recover tracking module from mistake, accordingly even when in target appearance In the case where variation, it can also be tracked long lasting for ground.

Detailed description of the invention

Fig. 1: based on convolution feature and global search detect it is long when block robust tracking method flow chart

Specific embodiment

Now in conjunction with embodiment, attached drawing, the invention will be further described:

Step 5 reads next frame image, and still centered on P, extraction size is R_bkgThe process scaling of × scale Target search region afterwards.Then the convolution feature in target search region is extracted, and by the CNN network in step 2 with bilateral The mode of interpolation samples R_bkgSize obtain the convolution characteristic pattern z of present frame_{target_cur}, recycle object module Calculate target confidence map f_target, calculation method is as follows:

Wherein,For inverse Fourier transform.(x, y) is modified to f by the coordinate of final updating P_targetIn maximum ring Corresponding coordinate should be worth:

The scale scale of final updating target, calculation method are as follows:

The set DPatches that step 8 exports detection module_kIn each image subblock, it is defeated with tracking module to calculate separately it TPatch out_kBetween pixel Duplication, it is available j value, will wherein highest value be denoted asIfLess than threshold Value(It is set as 0.05), being determined as that target is blocked completely, needs to inhibit learning rate of the tracking module in model modification β, and go to step 9；Otherwise initial learning rate β is pressed_init(β_initIt is set as 0.02) being updated, and goes to step 10.The calculating of β is public Formula is as follows:

Wherein, β is the learning rate after step 8 calculates.

Claims

1. it is a kind of based on convolution feature and global search detect it is long when block robust tracking method, it is characterised in that step is such as Under:

Step 1: reading the first frame image data and the initial position message [x, y, w, h] where target in video, wherein x, y Indicate that the abscissa and ordinate of target's center, w, h indicate the width and height of target；(x, y) corresponding coordinate points are denoted as P, with P Centered on, size is that the target prime area of w × h is denoted as R_init, then the scale of target is denoted as scale, it is initialized as 1；

Step 2: centered on P, determining the region R comprising target and background information_bkg, R_bkgSize be M × N, M= 2w, N=2h；Using VGGNet-19 as CNN model, at the 5th layer convolutional layer, that is, conv5-4 layers to R_bkgExtract convolution characteristic pattern z_{target_init}；Then according to z_{target_init}Construct the object module of tracking moduleT ∈ { 1,2 ..., T }, T CNN Molded passage number, calculation method are as follows:

Wherein: the variable of capitalization is expression of the corresponding non-capitalized variables on frequency domain, gaussian filtering templateM, n be Gaussian function independent variable, m ∈ { 1,2 ..., M }, n ∈ { 1,2 ..., N }, σ_targetFor the bandwidth of Gaussian kernel,⊙ representative element multiplication operation, upper scribing line indicate complex conjugate, λ₁To adjust Whole parameter；

Step 3: centered on P, extracting the image subblock of S different scale, S is set as 33；The size of each sub-block is w × h × s, variable s are the scale factor of image subblock, s ∈ [0.7,1.4]；Then the HOG feature of each image subblock is extracted respectively, Become the HOG feature vector of S dimension after merging, and be named as scale feature vector, is denoted as z_{scale_init}；Further according to z_{scale_init}Construct the Scale Model W of tracking module_scale, calculation method is as follows:

Wherein,S' is Gaussian function independent variable, s' ∈ { 1,2 ..., S }, σ_scaleFor the band of Gaussian kernel Width,λ₂For adjusting parameter；

Step 4: to target prime area R_initGray feature is extracted, the two-dimensional matrix of gray feature expression is obtained, is named as target Appearance representing matrix, is denoted as A_k, the current frame number of subscript k expression, k=1 when initial；Then at the beginning of will test the Filtering Model D of module Beginning turns to A₁, i.e. D=A₁, the history that reinitializes target expression set of matrices A_his；A_hisStore current and each frame before target Appearance representing matrix, i.e. A_his={ A₁,A₂,...,A_k, A when initial_his={ A₁}；

Step 5: reading next frame image, still centered on P, extraction size is R_bkg× scale after scaling Target search region；Then the convolution feature in target search region is extracted, and by the CNN network in step 2 with bilateral interpolation Mode sample R_bkgSize obtain the convolution characteristic pattern z of present frame_{target_cur}, recycle object moduleIt calculates Target confidence map f_target, calculation method is as follows:

Wherein,For inverse Fourier transform；(x, y) is modified to f by the coordinate of final updating P_targetIn maximum response institute Corresponding coordinate:

Step 6: centered on P, extracting the image subblock of S different scale, the HOG for then extracting each image subblock respectively is special Sign, obtains the scale feature vector z of present frame after merging_{scale_cur}, with z in step 3_{scale_init}Calculation method；Recycle ruler Spend model W_scaleCalculate scale confidence map:

The scale scale of final updating target, calculation method are as follows:

Tracking module is obtained in the output of current kth frame: with coordinate be (x, y) P centered on, size R_initThe figure of × scale As sub-block TPatch_k；In addition, the f that completion will have been calculated_targetIn maximum response be abbreviated as TPeak_k, i.e. TPeak_k= f_target(x,y)；

Step 7: the entire image of Filtering Model D and present frame is carried out convolution in a manner of global search by detection module, is calculated The similarity degree of Filtering Model D and each position of present frame；The highest preceding j value of responsiveness is taken, and corresponding with j value respectively Centered on location point, extraction size is R_initThe j image subblock of × scale；Using j image subblock as element, one is generated A image subblock set DPatches_k, i.e. output of the detection module in kth frame；

Step 8: calculating separately the set DPatches of detection module output_kIn each image subblock and tracking module output TPatch_kBetween pixel Duplication, obtain j value, will wherein highest value be denoted asIfLess than threshold valueDetermine It is blocked completely for target, the learning rate β for needing to inhibit tracking module in model modification, and goes to step 9；Otherwise it is learned by initial Habit rate β_initIt is updated, and goes to step 10；

The calculation formula of the β is as follows:

Step 9: according to DPatches_kIn each image subblock center, respectively extract size be R_bkgThe j target of × scale is searched Rope region to each target search extracted region convolution characteristic pattern and calculates target confidence map according to the method in step 5, obtains Maximum response onto j target search region；Maximum value in j response is denoted as DPeak_k；If DPeak_kGreatly In TPeak_k, then the coordinate of P is updated again, and (x, y) is modified to DPeak_kCorresponding coordinate；And recalculate target scale Feature vector and target scale scale, using the calculation in step 6；

Step 10: target is determined as P in the optimal place-centric of present frame, and optimal scale is determined as scale；It indicates in the picture New target area R out_new, centered on P, the rectangle frame of wide and high respectively w × scale, h × scale；In addition, by It calculates and completes and can obtain the convolution characteristic pattern of optimal objective place-centric P and be abbreviated as z_target；Equally, it will access The scale feature vector of optimal objective scale scale is abbreviated as z_scale；

Step 11: utilizing z_target、z_scaleAnd the object module in the tracking module of previous frame foundationWith scale mould Type W_scale, model modification is carried out in a manner of weighted sum respectively, calculation method is as follows:

W_scale=W_{scale_new}；

Step 12: to new target area R_newThe target appearance representing matrix A of present frame is obtained after extracting gray feature_k, by A_k Being added to history target indicates set of matrices A_his；If set A_hisMiddle element number is greater than c, then from A_hisMiddle random selection c One three-dimensional matrice C of Element generation_k, C_kCorresponding (:, i) is A_hisIn any one element, i.e. two-dimensional matrix A_k；Otherwise A is used_his Middle all elements generator matrix C_k；Then to C_kIt averages to obtain two-dimensional matrix, using this two-dimensional matrix as detection module New Filtering Model D, calculation method are as follows:

Step 13: algorithm terminates if having handled picture frame all in video, otherwise goes to step 5 and continues to execute.

2. according to claim 1 based on convolution feature and global search detect it is long when block robust tracking method, it is special Sign is: the adjusting parameter λ₁And λ₂It is set as 0.0001.

3. according to claim 1 based on convolution feature and global search detect it is long when block robust tracking method, it is special Sign is: the j value is set as 10.

4. according to claim 1 based on convolution feature and global search detect it is long when block robust tracking method, it is special Sign is: the threshold valueIt is set as 0.05.

5. according to claim 1 based on convolution feature and global search detect it is long when block robust tracking method, it is special Sign is: the initial learning rate β_initIt is set as 0.02.

6. according to claim 1 based on convolution feature and global search detect it is long when block robust tracking method, it is special Sign is: the c is set as 20.