CN109993775A

CN109993775A - Monotrack method based on feature compensation

Info

Publication number: CN109993775A
Application number: CN201910258571.8A
Authority: CN
Inventors: 杨云; 白杨
Original assignee: Yunnan University YNU
Current assignee: Yunnan University YNU
Priority date: 2019-04-01
Filing date: 2019-04-01
Publication date: 2019-07-09
Anticipated expiration: 2039-04-01
Also published as: CN109993775B

Abstract

The invention discloses it is a kind of based on posteriority pixel color histogram, histograms of oriented gradients, convolutional neural networks feature compensation video target tracking method, in simple scenario using simple feature to guarantee real-time, complex scene uses complex characteristic, to guarantee accuracy.It is combined by two kinds of features of posteriority pixel histogram and histograms of oriented gradients, obtained response characteristic figure can be good at the fairly simple situation of adaptive video scene；One classifier of training is to judge that it is insincere that the former merges the target when obtained response obtains；Finally further according to the judging result of classifier, the convolutional neural networks tracker for choosing whether the relatively slow still performance more robust of speed to be added is corrected to the target for occurring deviateing is tracked, or is given for change again to tracking lost target.The present invention improves the precision that target sizes and position are judged in video, and it can well adapt to prolonged target following task, to reach the scene of practical application.

Description

Monotrack method based on feature compensation

Technical field

The invention belongs to the monotrack technical field of computer vision, more particularly to a kind of based on feature compensation Monotrack method.

Background technique

In computer vision field, the problem of tracing task is all a core all the time, it is widely used in video Many aspects such as monitoring, human-computer interaction, robot vision perception, military guidance.Monotrack is in the first frame of video The positions and dimensions of tracking target are manually marked with rectangle frame, what then tracking needed to do is exactly after video In continuous frame, equally with rectangle frame immediately following firmly this object manually marked.Similar target detection, be in still image or Target is scanned and searched in dynamic video within the scope of whole frame, summarizes to say, target detection is concerned with positioning and classification.And Target following, it is of interest that how to lock certain people or object in real time, it and pay no attention to oneself tracking what is.Due to tracking The requirement of method real-time, the expense that whole frame search calculates are very expensive, hence it is evident that are poorly suitable for this scene, and the object tracked There is continuity over time and space, therefore the search range of tracking can greatly reduce.However also exactly because For there are this continuitys, during tracking, some complex scenes, there are the variations of illumination, the deformation of appearance, quickly fortune It moves, block the disturbing factors such as similar with background, the model of most of tracking needs constantly more during tracing task New model itself, therefore, once model learning to background information, is easy for generating error, and this error can accumulate always Go down, causes finally to lose target.

Currently, the track algorithm overwhelming majority of mainstream is short-term tracking (short-term tracking), it is primarily present Following defect:

(1) poor robustness

It can not be given for change again after trace model loses target, this kind of algorithms are mainly in the position of tracking target With article above and below the precision of size, do not have higher robustness, do not adapt to prolonged tracing task, such model its It should can not be used in reality scene well in fact.

(2) speed is slow

Either end to end neural network structure trace model or depth convolution characteristic pattern in conjunction with correlation filtering Trace model will spend the high calculating time, therefore answer in actual scene although higher accuracy rate can be obtained With less.And other are based on correlation filtering tradition trace model, although cracking speed can be reached, but in accuracy and Shandong Show not good enough on stick.

(3) error accumulation

Since, there are various disturbing factors, model is difficult all correctly to trace into mesh in each frame in video scene Mark, therefore will learn updating template to background or other interference informations, this error will constantly accumulate, and be it is a kind of not Reversible process.

For defect existing for above-mentioned tracking, to accomplish to be well applied in reality scene, still cutting Access point is placed on long-term follow (long-term tracking), i.e., improves robustness as much as possible, guarantee that speed reaches real-time While can adapt to prolonged tracing task.

Summary of the invention

The purpose of the present invention is to provide one kind based on posteriority pixel color histogram, histograms of oriented gradients, convolution mind Feature compensation video target tracking method through network, with realize construction one robustness it is good, speed is fast, ensure accuracy and While robustness is promoted, fully ensure that model there can be higher frame per second in terms of speed；The present invention, which improves in video, to be judged The precision of target sizes and position, and it can well adapt to prolonged target following task, to reach practical application Scene.

The technical scheme adopted by the invention is that providing a kind of feature compensation video frequency object tracking side based on Fusion Features Method, comprising the following steps:

S1 establishes the target following model branch of color histogram feature:

S11 calls OpenCV kit, is first with the target image manually marked before target following task starts Basis cuts out the target subgraph E with background information；

Target subgraph with background information according to a certain percentage according to the size of target is done foreground area by S12 With the separation of background area；Meanwhile scale compression is carried out to pixel in the integer range that pixel value is 0-32, and rely on respectively The identical foreground mask of size and background exposure mask calculate in corresponding foreground area and background area relative to each pixel value Pixel ratio, i.e. foreground pixel ratio ρ (O) and background pixel ratio ρ (B), the expression formula of pixel ratio ρ be as follows:

ρ (O)=N (O)/| O |； (1-1)

ρ (B)=N (B)/| B |； (1-2)

Wherein, the image-region of O character representation prospect O, the image-region of B character representation background B；N (O) indicates prospect O Image-region in non-zero pixels value number, N (B) indicate background B image-region in non-zero pixels value number；| O | table Show the total number of pixel value in the image-region of prospect O, | B | indicate the total number of pixel value in the image-region of background B；It is based on The weight beta of present frame posteriority pixel color histogram template is calculated in formula (1-1), (1-2)_t:

Wherein, t indicates present frame, and λ is hyper parameter；

S13, in video next frame, using former frame target's center as in the image range at region of search center, with described S12 cuts out subgraph e, and carries out scale to pixel and compress to obtain ψ；It is obtained according to formula (1-1), (1-2) and formula (2) The weight beta of the posteriority pixel color histogram template of former frame_t-1, color histogram, which is finally obtained, using integrogram formula responds f_hist:

Wherein, ψ is the compressed subgraph of the channel M pixel, is defined on present frame and cuts on picture e；ψ_tIt is current The compressed subgraph of the channel frame M pixel；H represents each pixel in integer range of the correspondence of picture；U represents H net Each corresponding grid in lattice, ψ [u] are the corresponding pixels on ψ, and superscript T is matrix transposition；

S14, while completing the tracing task of a frame every time, in the position for the present frame that prediction obtains, to posteriority pixel The weight beta of histogram template_tIt is updated, i.e., foreground pixel ratio ρ (O) and background pixel ratio ρ (B) is carried out more respectively Newly, the pixel ratio ρ of the prospect O of present frame after being updated_t(O) and update after present frame background B pixel ratio ρ_t(B):

ρ_t(O)=(1- η_hist)ρ_t-1(O)+η_histρ′_t(O)

ρ_t(B)=(1- η_hist)ρ_t-1(B)+η_histρ′_t(B)； (4)

Wherein ρ '_tIt (O) is the pixel ratio in the image-region of the prospect O of present frame, ρ '_tIt (B) is the background B of present frame Image-region in pixel ratio, ρ_t-1It (O) is the pixel ratio in the image-region of the prospect O of former frame, ρ_t-1(B) before being Pixel ratio in the image-region of the background B of one frame；η_histThe weight updated for pixel ratio；

S2 establishes the target following model branch of histograms of oriented gradients feature:

S21 is selected on the target image to be tracked in S11 with rectangle frame, and it is different but same to cut out another size Target area subgraph E ' with background information, and extract the histograms of oriented gradients feature φ of K channel three-dimensional^k, it is multiplied by The template of histograms of oriented gradients feature is calculated in cosine window function in OpenCV packet

Wherein,It is to be defined on frequency domain to pass through the variable that discrete Fourier transform obtains；U is represented in Γ grid Each corresponding grid, Γ represent φ^kEach grid in integer range of upper correspondence；Superscript i represents the every of K channel One channel,It is the conjugation of Fourier transformation the latter gaussian signal；It indicates to be conjugated with * in a frequency domain, e indicates element Multiplication,It is histograms of oriented gradients feature φ^kEach Channel elements obtained by Fourier transform, K are the numbers in channel Mesh；

S22, by histograms of oriented gradients template obtained in S21It carries out inverse Fourier transform and obtains h [u], regarding In frequency next frame, using former frame target's center in the image range at region of search center, to cut a subgraph e ', and extract The histograms of oriented gradients feature φ of current subgraph, is calculated present frame histograms of oriented gradients score using linear function f_hog:

f_hog(φ, h)=∑_u∈Γh[u]^Tφ[u]； (7)

S23 is straight to direction gradient in the position for the present frame that prediction obtains after the tracing task for completing each frame The template of square figure feature is updated, that is, respectively obtains updated final signalWith

Wherein,WithIt is to calculate separately to obtain the signal for indicating present frame from formula (6),WithIndicate previous The signal of frame,WithIndicate updated final signal；η_hogFor the weight of histograms of oriented gradients template renewal；

S3, Fusion Features and establishes classifier:

The color histogram respectively obtained in S13, S22 is responded f by S31_hist, histograms of oriented gradients score f_hogPass through Define linear function f (x) carry out Fusion Features to get:

F (x)=γ_hogf_hog(x)+γ_histf_hist(x)； (9)

Wherein, γ_hogFor the weight of histograms of oriented gradients response, γ_hsitFor the weight of color histogram response, f is taken (x) coordinate of point corresponding to maximum value is the centre coordinate of target；

S32 passes through f, f_hogTraining classifier: a collection of video sequence is selected, by the Fusion Features in S31, is exported respectively f、f_hog, and enabling the input of data set is X=[max (f_hog)；Max (f)], enabling output label is h '_θ, indicate the true of data set Value, h '_θTracking box for 0 or 1 integer, 0 expression model has had deviated from target, and 1 indicates without departing from target；Logic is enabled to return Return function h_θThe output of presentation class device:

Data are divided into training set in the ratio of 7:3 and verifying collects, training set passes through under cross entropy loss function and gradient Algorithm is dropped, after successive ignition calculates convergence, obtains the parameter θ of Logic Regression Models in formula (10)；Again with verifying collection data The fine tuning for carrying out hyper parameter calculates the correct result that parameter obtains under different value, selects by the way that the parameter of different numerical value is arranged The highest value of correct result is selected as final parameter value, so that classifier reaches preferable classification results on verifying collection；

S4 judges whether to need to access convolutional neural networks structure tracker:

S41, by f, f_hogIn the classifier that input S32 is obtained, exported；Classifier (i.e. formula 10) exports in S32 Successive value in select 0.5 as threshold value；It when output is greater than 0.5, indicates that the result of Fusion Model is credible, does not have to switching Access convolutional neural networks tracker；When output is less than 0.5, indicates that the result of Fusion Model is not received, access need to be switched Convolutional neural networks tracker；

S42, when the target response score that convolutional neural networks tracker predicted current frame goes out is higher, recycling S14, S23 is respectively updated posteriority pixel histogram template and histograms of oriented gradients template；Later into next frame with Track task, until completing all video frames.

The beneficial effects of the present invention are:

(1) present invention merges manifold method, combines the characteristic that these features have, can be good at coping with Illumination variation, object deformation, the video scenes such as blocks at motion blur, and quick using simple feature in simple scene Tracing task is completed, reduces the influence of interference information in the complicated more robust feature of scene switching.

(2) present invention adds the methods of self-test classifier, show model in handoff features more intelligent, more When the template of new feature, inhibits study to arrive invalid information, reduce the accumulation of error；Meanwhile classifier is relatively simple, does not need too More computing costs.

(3) tracker for the neural network structure that the present invention selects will not learn the letter to interference without updating template Breath, in the case where target occlusion, there is good performance.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.

Fig. 1 is foreground and background exposure mask schematic diagram.

Fig. 2 is posteriority pixel histogram and response diagram schematic diagram.

Fig. 3 is histograms of oriented gradients and response diagram.

Fig. 4 is the monotrack algorithm schematic diagram based on feature compensation.

Fig. 5 is the accuracy of each algorithm and robustness distribution map under reset mechanism.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

In target tracking domain, mainly have deformation, illumination variation, quickly movement, background is similar, Plane Rotation, scale Change, block, going out the difficult points such as the visual field.

Detailed process is as follows:

S1 establishes the target following model branch of color histogram feature:

S11 is built upon under the scene of monotrack task due to this method, that is, needs to appoint in target following Before business starts, OpenCV kit is called, the mode manually marked selects the target to be tracked with rectangle frame, and cuts The target subgraph E of background information is had out, and the characteristic that then model can just have according to selected target and its background carries out It distinguishes, completes subsequent tracing task.Therefore, under such scene, it is either based on which kind of feature, model can all be risen by video It is subsequent to match to generate an initial feature template according to the respective mode of model for the image in target frame that beginning frame is selected The candidate region of frame image, to predict position and the size of target.

S12, on the first frame basis that frame has selected the target to be tracked, the model of this color histogram will Target subgraph E with background information according to a certain percentage according to the size of target is done a foreground area and background The separation in region.Because pixel value range is 0~255, such as calculated with original pixel value, a large amount of calculating times will be spent, so needing One is wanted to do a scale compression to pixel, the scale selected here is 8, i.e., calculates in 0~32 integer range, significantly Lift scheme speed.And (as shown in Fig. 1-a, white object region is 1 black of value back to the identical foreground mask of support size respectively Scenic spot thresholding be 0 single channel image) and background exposure mask (as shown in Fig. 1-b, black objects regional value be 0 white background area Single channel image for 1) it calculates in two regions, relative to the pixel ratio of each pixel value, i.e. foreground pixel ratio ρ (O) and background pixel ratio ρ (B):

ρ (O)=N (O)/| O |； (1-1)

ρ (B)=N (B)/| B |； (1-2)

Wherein, the image-region of O character representation prospect O, the image-region of B character representation background B；N (O) indicates prospect O Image-region in non-zero pixels value number, N (B) indicate background B image-region in non-zero pixels value number；| O | table Show the total number of pixel value in the image-region of prospect O, | B | indicate the total number of pixel value in the image-region of background B；Respectively After obtaining the pixel ratio of foreground and background, so that it may the weight of present frame posteriority pixel color histogram template be calculated β_t:

T indicates present frame, and λ is hyper parameter.

S13 is search with former frame target's center in video next frame after establishing posteriority pixel histogram template In the image range of regional center, subgraph e is similarly cut out, and scale is carried out to pixel and compresses to obtain ψ.According to formula (1-1), (1-2) obtain the weight beta of the posteriority pixel color histogram template of former frame_t-1, such as Fig. 2-a, Fig. 2-b and Fig. 2-c institute Show, finally obtains color histogram response f using integrogram formula_hist:

S1-4, during online tracking, all delicate or violent variation is occurring for the scene in video at any time, for For color histogram feature, the influence of the disturbing factors such as illumination variation, motion blur is especially serious.Therefore in order to preferably suitable Many variations present in video scene are answered, while completing the tracing task of a frame every time, in the present frame that prediction obtains Position, need the weight beta to posteriority pixel histogram template_tIt is updated, that is, respectively to the pixel of foreground and background Ratio ρ (O), ρ (B) update:

ρ_t(O)=(1- η_hist)ρ_t-1(O)+η_histρ′_t(O)

ρ_t(B)=(1- η_hist)ρ_t-1(B)+η_histρ′_t(B)； (4)

Wherein, ρ '_tIt (O) is the pixel ratio in the image-region of the prospect O of present frame, ρ '_tIt (B) is the background B of present frame Image-region in pixel ratio, ρ_t-1It (O) is the pixel ratio in the image-region of the prospect O of former frame, ρ_t-1(B) before being Pixel ratio in the image-region of the background B of one frame；η_histThe weight updated for pixel ratio；

S2-1 is cut out on the first frame basis that frame has selected the tracking target for manually marking out with back The target area subgraph E ' of scape information extracts the histograms of oriented gradients feature φ of K channel three-dimensional^k, such as Fig. 3-a, Fig. 3-b It is shown, inhibit the influence of the subgraph peripheral part of surrounding multiplied by the cosine window function in an OpenCV packet.It calculates The template of histograms of oriented gradients feature

Wherein,It is to be defined on frequency domain to pass through the variable that discrete Fourier transform obtains, because of correlation filtering In model in have cross-correlation operation, this will spend very high calculating time overhead, therefore after variable is done Fourier transform, Convolutional calculation in the time domain can be converted into the product of the element in frequency domain and calculate, and can greatly reduce and calculate the time.U is represented Each corresponding grid, Γ represent φ in Γ grid^kEach grid in integer range of upper correspondence；Superscript i represents K Each channel in channel,It is the conjugation of Fourier transformation the latter gaussian signal；It indicates to be conjugated with * in a frequency domain, e Indicate element multiplication,It is histograms of oriented gradients feature φ^kEach Channel elements obtained by Fourier transform, K are The number in channel.

S22 establishes histograms of oriented gradients templateAfterwards, it carries out inverse Fourier transform and obtains h [u], under video In one frame, using former frame target's center as in the image range at region of search center, a subgraph e ' searched is cut, is extracted The histograms of oriented gradients feature φ of current subgraph, so that it may it is straight that present frame direction gradient be calculated with a linear function Square figure score f_hog, effect picture is as shown in Fig. 3-c:

f_hog(φ, h)=∑_u∈Γh[u]^Tφ[u]； (7)

S23, in the stage tracked online, histograms of oriented gradients is same as target changes in scene, and makes It at interference, is especially affected caused by object deformation, therefore, also needs after the tracing task for completing each frame, The position for predicting obtained present frame, is updated the template of histograms of oriented gradients feature:

Wherein,WithIt is to calculate separately to obtain the signal for indicating present frame from formula (6),WithIndicate previous The signal of frame,WithIndicate updated final signal；η_hogFor the weight of histograms of oriented gradients template renewal.

S3, Fusion Features and establishes classifier:

S31, because there is illumination variation in the scene in color histogram feature, when the disturbing factors such as fuzzy pictures, It is affected to model, and histograms of oriented gradients feature has deformation in target, it is right quickly when the disturbing factors such as movement Model is affected.So two kinds of Fusion Features can be reduced the interference of these factors to a certain extent, tracking is improved Model accuracy and robustness, make it in tracing task, can predict the more accurate position of target and size, and and do not allow Target easy to be lost.Here the color histogram respectively obtained in S13, S22 is responded into f_hist, histograms of oriented gradients score f_hog By define linear function f (x) carry out Fusion Features to get:

F (x)=γ_hogf_hog(x)+γ_histf_hist(x)； (9)

Wherein, γ_hogFor the weight of histograms of oriented gradients response, γ_hsitFor the weight of color histogram response, f is taken (x) coordinate of point corresponding to maximum value is the centre coordinate of target.

S32, although after two kinds of features are merged, good effect can be shown in most of scene, Be it is similar for a part of such as background, block, the more complicated video scene in the visual field out, still have biggish performance boost Space.Therefore other more robusts are added, the tracker of the better neural network structure of effect improves the performance of model.It considers The neural network speed of service is slower, and current general hardware device is not able to satisfy the requirement of real-time, therefore only in first two It, could be maximum using the tracker of neural network when the model of Fusion Features cannot complete the tracing task of present frame well The performance of limit performance model.Cope with such demand most critical is exactly that Fusion Features model is allowed to know when to need to cut The tracker for changing neural network structure, by analyzing f, f_hist、f_hogIt is (single only there are input (x) when mapping relations in formula Only symbol indicates not needing statement (x)) three situations of change of the response score under different scenes can be seen that and occur in target Larger deformation or when blocking, f, f_hogThere is larger fluctuation, therefore one point of the two values training can be passed through Mark of the class device as switching tracker.

A collection of video sequence is selected, by the Fusion Features in S31, exports f, f respectively_hog, and enable the input of data set be X=[max (f_hog)；Max (f)], enabling output label is h '_θ, indicate the true value of data set, h '_θIt is indicated for 0 or 1 integer, 0 The tracking box of model has had deviated from target, and 1 indicates without departing from target；There is a concept in target detection friendship and compares (Intersection-over-Union, IoU) indicates the frames images and real image frame Duplication of prediction, makes herein herein Use friendship and the foundation than whether deviateing as measurement target；Experiment by multiple values is attempted, and 0.35 conduct cut off value is one Selection more appropriate, i.e., as IoU > 0.35, h '_θWhen=1, IoU < 0.35, h '_θ=0；Enable logistic regression function h_θIt indicates to divide The output of class device:

Data are divided into training set in the ratio of 7:3 and verifying collects, training set passes through under cross entropy loss function and gradient Algorithm is dropped, after successive ignition calculates convergence, obtains the weight θ in formula (10).Hyper parameter is carried out with verifying collection data again Fine tuning calculate the correct result that parameter obtains under different value by the way that the parameter of different numerical value is arranged, select correct result Highest value is as final parameter value, so that classifier reaches preferable classification results on verifying collection.

S41, after training and finely tuned sorter model in previous step, so that it may judge face in the stage of tracing task Whether the model of Color Histogram feature and histograms of oriented gradients Fusion Features can also adapt to current video scene, to do Whether need to switch the tracker of neural network out.During tracking, f that formula (7), (9) obtain_hog, f as classification The input of device, i.e. formula (10), obtained output, the mark as switched.When previous step verify data carries out adjusting ginseng, A suitable threshold value 0.5 has been had selected, when output is greater than this threshold value, has indicated that the end value of Fusion Model must be believed Appoint, does not have to switching；When output is less than this threshold value, the result of Fusion Model is not received, and at this time just switches nerve net The tracker of network.Here the neural network tracker selected is DaSiamRPN, it combines the thought and structure of target detection RPN (network is suggested in Region Proposal Networks, region), can be good at the scene for coping with some complexity, more can The accurately size after fit object deformation, does not need the template of online updating target, therefore cumulative errors are also just not present Pollute template situation.

S42, the stage tracked online need with formula (4), (8) respectively to posteriority pixel histogram template and direction gradient Histogram template is updated, and carrys out the variation of scene in adaptive video.Similarly, work as in switching DaSiamRPN tracker completion After the tracing task of previous frame, it is still desirable to be updated operation with the two formula.But because DaSiamRPN there is also with The case where track fails, so only just template is updated when the target response score that tracker predicted current frame goes out is higher, so far, The tracing task of next frame is just carried out, until completing all video frames.Entire trace flow is as shown in Figure 4.

Embodiment

In order to assess performance of the invention, need to be tested on video sequence test set.Here VOT is selected The evaluation method, data set, evaluation system of (visual object tracking) contest carry out this experiment.Data set includes 60 video sequences have been directed to block, the movement of illumination variation, target, dimensional variation, camera movement, have gone out the scenes such as visual field, A variety of above-mentioned attributes are likely to occur in a video sequence, the perceptual property of different frame is different, can carry out in this way to model More accurately evaluate.Before VOT proposition, popular evaluation system is that tracker is allowed to carry out just in the first frame of sequence Beginningization allows tracker to go to last frame always later.However tracker may lead to it because of wherein one or two of factor Beginning certain frames just with losing (fail), so the very small part of sequence is only utilized in final evaluation system, cause to waste. And VOT is proposed, evaluation system should detect wrong (failure) when tracker is with losing, and occur in failure 5 frames after (reinitialize) is reinitialized to tracker, data set can be made full use of in this way.

Referring initially to the experiment scoring of reset mechanism, such as table 1:

The scoring of algorithms of different under 1 reset mechanism of table

A-R rank indicates the index of accuracy rate (Accuracy) and robustness (Robustness) ranking in table 1, Overlap is equivalent to accuracy rate, and representative is target and the target Duplication manually really marked that tracking is predicted, The bigger explanation of Overlap, prediction it is more accurate；Failure is the stability for evaluating tracking, and numerical value is smaller, is stablized Property is better.By being compared with 7 trackings, it can be seen that make number one in the accuracy of this method, stability comes Third position.The scoring tendency of all algorithms in table can also be more intuitively found out by Fig. 5.However it can not in actual scene It is reset again after failing in the presence of tracking, it is clear that the first appraisement system has more reference value for actual scene, and experiment is commented Divide such as table 2:

Scoring of the table 2 without algorithms of different under reset mechanism

AUC in table 2 (Area Under Curve, the area surrounded under curve with reference axis) is an evaluation algorithms quality Index, value is bigger, and the performance for illustrating algorithm is better.Speed index FPS (Frames Per Second, transmission frame per second per second), It is also that the bigger value the faster.As can be seen that will not be by scoring again after also meaning that tracking failure under no reset mechanism In the case where system positioning target, the relatively other 7 method accuracys rate of this method have reached highest, and accuracy rate ranking first three Algorithm in it is fastest.In addition, being CPU:Intel Core i7-6700, GPU:GeForce GT in the machine hardware configuration It is tested on 730, the prestissimo for obtaining SiamFC method in table is only 3FPS, and this method can reach most fast 30FPS, therefore It is higher in accuracy rate while still having original speed as can be seen that compare other methods, more adaptation actual scene.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the scope of the present invention.It is all Any modification, equivalent replacement, improvement and so within the spirit and principles in the present invention, are all contained in protection scope of the present invention It is interior.

Claims

1. a kind of monotrack method based on feature compensation, which comprises the following steps:

S1 establishes the target following model branch of color histogram feature:

S11 calls OpenCV kit before target following task starts, first based on the target image manually marked, Cut out the target subgraph E with background information；

Target subgraph with background information according to a certain percentage according to the size of target is done foreground area and back by S12 The separation of scene area；Meanwhile scale compression is carried out to pixel in the integer range that pixel value is 0-32, and rely on size respectively Identical foreground mask and background exposure mask calculate the picture in corresponding foreground area and background area relative to each pixel value Plain ratio, i.e. foreground pixel ratio ρ (O) and background pixel ratio ρ (B), the expression formula of pixel ratio ρ are as follows:

ρ (O)=N (O)/| O |； (1-1)

ρ (B)=N (B)/| B |； (1-2)

Wherein, the image-region of O character representation prospect O, the image-region of B character representation background B；The figure of N (O) expression prospect O As the number of non-zero pixels value in region, N (B) indicates the number of non-zero pixels value in the image-region of background B；| O | before expression The total number of pixel value in the image-region of scape O, | B | indicate the total number of pixel value in the image-region of background B；Based on formula The weight beta of present frame posteriority pixel color histogram template is calculated in (1-1), (1-2)_t:

Wherein, t indicates present frame, and λ is hyper parameter；

S13, in video next frame, using former frame target's center as in the image range at region of search center, with the S12, Subgraph e is cut out, and scale is carried out to pixel and compresses to obtain ψ；It is obtained according to formula (1-1), (1-2) and formula (2) previous The weight beta of the posteriority pixel color histogram template of frame_t-1, color histogram response f is finally obtained using integrogram formula_hist:

Wherein, ψ is the compressed subgraph of the channel M pixel, is defined on present frame and cuts on picture e；ψ_tFor the M of present frame The compressed subgraph of channel pixel；H represents each pixel in integer range of the correspondence of picture；U is represented in H grid Each corresponding grid, ψ [u] are the corresponding pixels on ψ, and superscript T is matrix transposition；

S14, while completing the tracing task of a frame every time, in the position for the present frame that prediction obtains, to posteriority pixel histogram The weight beta of artwork version_tIt is updated, i.e., foreground pixel ratio ρ (O) and background pixel ratio ρ (B) is updated respectively, is obtained The pixel ratio ρ of the prospect O of present frame after to update_t(O) and update after present frame background B pixel ratio ρ_t(B):

ρ_t(O)=(1- η_hist)ρ_t-1(O)+η_histρ′_t(O)

ρ_t(B)=(1- η_hist)ρ_t-1(B)+η_histρ′_t(B)； (4)

Wherein, ρ '_tIt (O) is the pixel ratio in the image-region of the prospect O of present frame, ρ '_tIt (B) is the figure of the background B of present frame As the pixel ratio in region, ρ_t-1It (O) is the pixel ratio in the image-region of the prospect O of former frame, ρ_t-1It (B) is former frame Background B image-region in pixel ratio；η_histThe weight updated for pixel ratio；

S21 is selected on the target image to be tracked in S11 with rectangle frame, cut out another size it is different but again with The target area subgraph E ' of background information, and extract the histograms of oriented gradients feature φ of K channel three-dimensional^k, it is multiplied by The template of histograms of oriented gradients feature is calculated in cosine window function in OpenCV packet

Wherein,It is to be defined on frequency domain to pass through the variable that discrete Fourier transform obtains；U represents corresponding in Γ grid Each grid, Γ represents φ^kEach grid in integer range of upper correspondence；Superscript i represents each of K channel Channel,It is the conjugation of Fourier transformation the latter gaussian signal；It indicates to be conjugated with * in a frequency domain, e indicates that element multiplies Method,It is histograms of oriented gradients feature φ^kEach Channel elements obtained by Fourier transform, K are the numbers in channel；

S22, by histograms of oriented gradients template obtained in S21It carries out inverse Fourier transform and obtains h [u], under video In one frame, using former frame target's center in the image range at region of search center, to cut a subgraph e ', and extract current Present frame histograms of oriented gradients score f is calculated using linear function in the histograms of oriented gradients feature φ of subgraph_hog:

f_hog(φ, h)=∑_u∈Γh[u]^Tφ[u]； (7)

S23, after the tracing task for completing each frame, in the position for the present frame that prediction obtains, to histograms of oriented gradients The template of feature is updated, that is, respectively obtains updated final signalWith

Wherein,WithIt is to calculate separately to obtain the signal for indicating present frame from formula (6),WithIndicate former frame Signal,WithIndicate updated final signal；η_hogFor the weight of histograms of oriented gradients template renewal；

S3, Fusion Features and establishes classifier:

The color histogram respectively obtained in S13, S22 is responded f by S31_hist, histograms of oriented gradients score f_hogPass through definition Linear function f (x) carry out Fusion Features to get:

F (x)=γ_hogf_hog(x)+γ_histf_hist(x)； (9)

Wherein, γ_hogFor the weight of histograms of oriented gradients response, γ_hsitFor the weight of color histogram response, f (x) is taken most The coordinate of the corresponding point of big value is the centre coordinate of target；

S32 passes through f, f_hogTraining classifier: selecting a collection of video sequence, by the Fusion Features in S31, export respectively f, f_hog, and enabling the input of data set is X=[max (f_hog)；Max (f)], enabling output label is h '_θ, indicate the true of data set Value, h '_θTracking box for 0 or 1 integer, 0 expression model has had deviated from target, and 1 indicates without departing from target；Logic is enabled to return Return function h_θThe output of presentation class device:

Data are divided into training set in the ratio of 7:3 and verifying collects, training set is calculated by cross entropy loss function and gradient decline Method obtains the parameter θ of Logic Regression Models in formula (10) after successive ignition calculates convergence；Again with verifying collection data into The fine tuning of row hyper parameter calculates the correct result that parameter obtains under different value, selects by the way that the parameter of different numerical value is arranged The highest value of correct result is as final parameter value, so that classifier reaches preferable classification results on verifying collection；

S41, by f, f_hogIn the classifier that input S32 is obtained, exported；It is selected in the successive value that classifier exports in S32 0.5 is used as threshold value；It when output is greater than 0.5, indicates that the result of Fusion Model is credible, does not have to switching access convolutional Neural net Network tracker；When output less than 0.5 when, indicate that the result of Fusion Model is not received, need to switch access convolutional neural networks with Track device；

S42 reuses S14, S23 when the target response score that convolutional neural networks tracker predicted current frame goes out is higher, Posteriority pixel histogram template and histograms of oriented gradients template are updated respectively；Appoint later into the tracking of next frame Business, until completing all video frames.