CN108734109B

CN108734109B - Visual target tracking method and system for image sequence

Info

Publication number: CN108734109B
Application number: CN201810373435.9A
Authority: CN
Inventors: 刘李漫; 刘佳
Original assignee: South Central University for Nationalities
Current assignee: Hangzhou Tuke Intelligent Information Technology Co.,Ltd.
Priority date: 2018-04-24
Filing date: 2018-04-24
Publication date: 2020-11-17
Anticipated expiration: 2038-04-24
Also published as: CN108734109A

Abstract

The invention discloses a visual target tracking method and a system facing to an image sequence, wherein the visual target tracking method comprises the following steps of training a convolution regression model for target tracking by utilizing a given initialization image and a target rectangular frame to be tracked; predicting the position of the target by using a convolution regression model obtained by training; further predicting the size of the target on the basis of the target position prediction result; and updating the convolution regression model according to the position and the size of the target obtained by tracking. The invention relates to the technologies of target integral regression model training, target texture regression model training, target position prediction, target size prediction, tracking model updating and the like, can fully overcome the interference of various environmental factors in a tracking scene, realizes accurate prediction of the target position and size, and has higher commercial value and research significance.

Description

Visual target tracking method and system for image sequence

Technical Field

The invention relates to the technical field of computer vision, in particular to a visual target tracking method and system for an image sequence.

Background

In the field of computer vision, it is generally necessary to automatically recognize and analyze video information by using an intelligent algorithm, so as to realize intelligent control over equipment. The target tracking algorithm based on the visual image sequence can fully utilize the existing target detection algorithm facing to a single image, quickly and reliably track the motion track of a target in a video, and improve technical support for understanding and analyzing the video.

With the rapid expansion of industrial production scale, the automation and intelligence degree in the industrial product production process also needs to be improved continuously. For example, in video surveillance systems, intelligent algorithms are needed to automatically identify and detect anomalous events occurring in video. The visual target tracking algorithm can automatically track each target in the video and obtain the motion trail of the target, and a key technical means is provided for analyzing and understanding abnormal events in the video. However, the conventional visual target tracking algorithm has the following defects:

(1) the target size cannot be well predicted, and particularly when the target is obviously deformed, the target size cannot be accurately predicted by a traditional tracking algorithm, so that the target is lost in subsequent tracking, and reliable bottom layer information cannot be provided for video analysis and understanding.

(2) The target tracking can not be accurately and reliably carried out under the interference of various environmental factors.

In view of this, it is urgently needed to improve the existing visual target tracking algorithm, and a visual target tracking algorithm capable of overcoming the interference of various environmental factors and accurately predicting the position and size of a target is provided.

Disclosure of Invention

The invention aims to solve the technical problems that the position and the size of a target cannot be predicted, the tracking process is easily interfered by the environment, and the target cannot be accurately and reliably tracked by the conventional visual target tracking algorithm.

In order to solve the above technical problems, the technical solution adopted by the present invention is to provide a visual target tracking method for an image sequence, comprising the following steps:

training a convolution regression model for target tracking by using a given initialization image and a target rectangular frame to be tracked;

predicting the position of the target by using a convolution regression model obtained by training;

further predicting the size of the target on the basis of the target position prediction result;

and updating the convolution regression model according to the position and the size of the target obtained by tracking.

In the above scheme, the method for training the convolution regression model includes the following steps:

step 10, constructing a feature extraction network for expressing the target morphology characteristics, wherein the network can be realized based on any feature extraction method for expressing target information;

step 11, extracting the features corresponding to the current input image by using the feature extraction network in the step 10;

step 12, constructing a convolution regression model which is realized based on a single convolution network layer and faces to the whole target, wherein the size of a convolution kernel of the convolution layer is consistent with that of the target in a characteristic space, meanwhile, an output channel of the convolution layer network is 1, and the output of the convolution layer network can be used for predicting the position of the target;

step 13, generating a corresponding training label graph based on the features extracted in the step 11, wherein the training label graph is generated according to a two-dimensional Gaussian function, the peak value of the training label graph corresponds to the real position of the target, and the single convolutional layer in the step 12 is optimized in an iterative manner by using a gradient descent algorithm;

step 14, constructing a convolution regression model which is realized based on a single convolution network layer and faces to the target texture, wherein the size of a convolution kernel of the convolution layer is consistent with that of the target in a feature space, meanwhile, an output channel of the convolution layer network is 1, and the output of the convolution layer network can be used for predicting the prospect of the target;

step 15, generating a corresponding training label map based on the features extracted in step 11, wherein a rectangular frame is used for marking the foreground of the target in the label map, and the single convolutional layer in step 12 is iteratively optimized by using a gradient descent algorithm;

and step 16, finishing the initial training of the convolution regression model.

In the above scheme, the target position prediction method specifically includes the following steps:

step 20, extracting the features corresponding to the current input image by using the feature extraction network constructed in the step 10 to prepare for subsequent target tracking;

step 21, inputting the image features obtained in the step 20 into the target whole-oriented convolution regression network obtained in the step 12, and calculating to obtain a model based on the target whole regression modelTarget position prediction map H (x)_t,y_t)；

Step 22, inputting the image features obtained in the step 20 into the convolution regression network facing the target texture obtained in the step 14, and calculating to obtain a target foreground prediction map T (x) based on the target texture regression model_t,y_t)；

Step 23, performing a mean filtering operation on the target foreground prediction image obtained in step 22, wherein the size of the filtering template is consistent with that of the target, and calculating to obtain a target position prediction image F (x) based on the target foreground_t,y_t)；

Step 24, superposing the two target position prediction maps obtained in the steps 21 and 23 to obtain a final target position prediction map, and predicting a target position according to an index corresponding to the maximum value in the position prediction maps, wherein the calculation formula is as follows:

in the above scheme, in step 23, a target position prediction map F (x) based on the target foreground is obtained_t,y_t) The calculation formula of (2) is as follows:

wherein w_t-1And h_t-1Respectively identifying the size of the target, R (x), obtained in the last frame tracking_t,y_t,w_t-1,h_t-1) Representing coordinates of (x)_t,y_t) A size of w_t-1,h_t-1The rectangular frame of (2). T (i, j) represents the value corresponding to each pixel point in the rectangular frame of the target foreground prediction image of the target texture regression model, i, j is the pixel point in the rectangular frame

In the above solution, the target size prediction method includes the steps of:

step 30, extracting the corresponding features of the current input image by using the feature extraction network constructed in the step 10 to prepare for subsequent target tracking;

step 31, obtaining in step 30The obtained image features are input into the convolution regression network facing the target texture obtained in the step 14, and a target foreground prediction image T (x) based on a target texture regression model is obtained through calculation_t,y_t)；

Step 32, obtaining the position x of the current target_t,y_tAnd knowing the size w of the object in the previous frame_t,h_tThen, the target size is calculated as w_t,h_tA posterior probability of (d);

step 33, calculating posterior probabilities corresponding to the target candidate sizes by repeatedly using the method in step 32, and selecting the target size with the maximum posterior probability as a final target size predicted value;

in step 34, the target size prediction is finished.

In the above scheme, the target size in step 32 is w_t,h_tThe formula for calculating the posterior probability of (2) is: p (w)_t,h_t|O,x_t,y_t,w_t-1,h_t-1)＝P(O|x_t,y_t,w_t,h_t)P(w_t,h_t|w_t-1,h_t-1) Wherein P (O | x)_t,y_t,w_t,h_t) The position and size state of the object is represented as (x)_t,y_t,w_t,h_t) Probability of (1), P (O | x)_t,y_t,w_t,h_t)P(w_t,h_t|w_t-1,h_t-1) Representing the probability of a state transition of the target size between two adjacent frames,

P(O|x_t,y_t,w_t,h_t)＝A(w_t,h_t)-B(w_t,h_t) Wherein A (w)_t,h_t) Representing a candidate target rectangular box (x)_t,y_t,w_t,h_t) Average target foreground probability of (1), B (w)_t,h_t) Represents the target rectangle box (x)_t,y_t,w_t,h_t) Average target foreground probability of surrounding background area.

In the above scheme, updating the convolution regression model includes the following steps:

step 40, generating a label graph for training a convolution regression model facing the whole target according to the predicted target position, and updating the parameters of the single convolution layer network in the step 12 by using a gradient descent method;

step 41, generating a labeled graph for training a convolution regression model facing to the target texture according to the predicted target size, and updating the network parameters of the single convolution layer in the step 14 by using a gradient descent method;

and step 42, finishing the updating of the convolution regression model.

The invention also provides an image sequence-oriented visual target tracking system, which comprises:

the training module is used for training a convolution regression model for target tracking;

the target position prediction module predicts the position of a target by using a convolution regression model obtained by training;

the target size prediction module is used for further predicting the size of the target on the basis of the target position prediction result;

and the updating module is used for updating the convolution regression model according to the position and the size of the target obtained by tracking.

Compared with the prior art, the method relates to the technologies of target integral regression model training, target texture regression model training, target position prediction, target size prediction, tracking model updating and the like, can fully overcome the interference of various environmental factors in a tracking scene, realizes accurate prediction of the target position and size, and has higher commercial value and research significance.

Drawings

FIG. 1 is a schematic flow chart of the present invention;

FIG. 2 is an input initial frame image of the present invention;

FIG. 3 is a schematic diagram of a training process of a convolution regression model according to the present invention;

FIG. 4 is a schematic of the integral regression model of the present invention;

FIG. 5 is a schematic diagram of a texture regression model of the present invention;

FIG. 6 is a schematic diagram illustrating a target location prediction process according to the present invention;

FIG. 7 is a graph of the present invention based on global regression model target location prediction;

FIG. 8 is a graph of target foreground prediction based on a texture regression model according to the present invention;

FIG. 9 is a graph of the prediction of target location based on a texture regression model of the present invention;

FIG. 10 is a flowchart illustrating a target size prediction process according to the present invention.

Detailed Description

The invention provides a technical method for tracking a visual target facing an image sequence, which can fully overcome the interference of various environmental factors in a tracking scene, realize accurate prediction of the position and the size of the target and have higher commercial value and research significance. The invention is described in detail below with reference to the drawings and the detailed description.

As shown in fig. 1 and fig. 2, the technical method for tracking a visual target facing an image sequence according to the present invention may specifically include the following steps:

and finally, updating the convolution regression model according to the position and the size of the target obtained by tracking.

Correspondingly, the invention also provides an image sequence-oriented visual target tracking system, which comprises a training module, a target position prediction module, a target size prediction module and an updating module.

Training a convolution regression model for target tracking by a training module;

and the updating module updates the convolution regression model according to the position and the size of the target obtained by tracking.

The invention relates to the technologies of target integral regression model training, target texture regression model training, target position prediction, target size prediction, tracking model updating and the like, can fully overcome the interference of various environmental factors in a tracking scene, realizes accurate prediction of the target position and size, and has higher commercial value and research significance.

As shown in fig. 3, the method for training the convolution regression model specifically includes the following steps:

constructing a feature extraction network for expressing the target morphology characteristics, wherein the network can be realized based on any feature extraction method for expressing target information;

extracting the corresponding features of the current input image by using a feature extraction network;

constructing a convolution regression model which is realized based on a single convolution network layer and faces the whole target, wherein the size of a convolution kernel of the convolution layer is consistent with that of the target in a characteristic space, an output channel of the convolution layer network is 1, and the output of the convolution layer network can be used for predicting the position of the target;

generating a corresponding training label graph based on the extracted features for expressing the target information, wherein the training label graph is generated according to a two-dimensional Gaussian function, the peak value of the training label graph corresponds to the real position of the target, and the single convolutional layer is iteratively optimized by utilizing a gradient descent algorithm;

constructing a convolution regression model which is realized based on a single convolution network layer and faces to target textures, wherein the size of a convolution kernel of the convolution layer is consistent with that of a target in a characteristic space, an output channel of the convolution layer network is 1, and the output of the convolution layer network can be used for predicting the prospect of the target;

generating a corresponding training label graph based on the extracted features for expressing target information, wherein a rectangular box is used for marking the foreground of a target in the label graph, and a gradient descent algorithm is used for iteratively optimizing a single convolutional layer;

and finishing the initial training of the convolution regression model.

As shown in fig. 4 to 9, the target position prediction method specifically includes the following steps:

extracting the corresponding characteristics of the current input image by using the constructed characteristic extraction network for expressing the target morphology characteristics to prepare for subsequent target tracking;

inputting the obtained image features into the convolution regression network facing the whole target, and calculating to obtain a target position prediction graph H (x) based on a target whole regression model_t,y_t)；

Inputting the obtained image features into a convolution regression network facing to target textures, and calculating to obtain a target foreground prediction image T (x) based on a target texture regression model_t,y_t)；

Carrying out mean value filtering operation on the target foreground prediction image, wherein the size of the filtering template is consistent with that of the target, and calculating to obtain a target position prediction image F (x) based on the target foreground_t,y_t) Wherein F (x)_t,y_t) The calculation formula of (2) is as follows:

wherein w_t-1And h_t-1Respectively identifying the size of the target, R (x), obtained in the last frame tracking_t,y_t,w_t-1,h_t-1)R(x_t,y_t,w_t-1,h_t-1) Representing coordinates of (x)_t,y_t) A size of w_t-1,h_t-1I, j are pixel points in the rectangular frame;

two kinds of target position prediction maps H (x) obtained as described above_t,y_t) And F (x)_t,y_t) Superposing the two images together to obtain a final target position prediction image, predicting the target position according to the index corresponding to the maximum value in the position prediction image, wherein the calculation formula is as follows:

as shown in fig. 10, the method for predicting the target size includes the following steps:

inputting the obtained image features into a convolution layer network to obtain a target texture-oriented convolution regression network, and calculating to obtain a target foreground prediction image T (x) based on a target texture regression model_t,y_t)；

Obtaining the position x of the current target_t,y_tAnd knowing the size w of the object in the previous frame_t,h_tThen, the target size is calculated as w in the last frame_t,h_tA posteriori probability P (O | x)_t,y_t,w_t,h_t) Wherein the calculation formula is as follows:

P(w_t,h_t|O,x_t,y_t,w_t-1,h_t-1)＝P(O|x_t,y_t,w_t,h_t)P(w_t,h_t|w_t-1,h_t-1) Wherein P (O | x)_t,y_t,w_t,h_t) The position and size state of the object is represented as (x)_t,y_t,w_t,h_t) The probability of (a) of (b) being,

P(O|x_t,y_t,w_t,h_t)＝A(w_t,h_t)-B(w_t,h_t) Wherein A (w)_t,h_t) Representing a candidate target rectangular box (x)_t,y_t,w_t,h_t) Average target foreground probability of (1), B (w)_t,h_t) Represents the target rectangle box (x)_t,y_t,w_t,h_t) Average target foreground probability of surrounding background regions;

repeatedly utilizing the method to calculate the posterior probabilities corresponding to the target candidate sizes, and selecting the target size with the maximum posterior probability as a final target size predicted value;

the target size prediction ends.

The convolution regression model updating mainly comprises the following steps:

generating a label graph for training a convolution regression model facing the whole target according to the predicted target position, and updating the single convolution layer network parameters by using a gradient descent method;

generating a label graph for training a convolution regression model facing to target textures according to the predicted target size, and updating the single convolution layer network parameters by using a gradient descent method;

and finishing the updating of the convolution regression model.

The method takes a continuous video image sequence as input data, and realizes the continuous tracking of the target in the image sequence through the steps of target integral regression model training, target texture regression model training, target position prediction, target size prediction, tracking model updating and the like after a target rectangular frame needing to be tracked by an algorithm is given. The method can accurately track the target under the conditions that the target rotates or is shielded and the like, simultaneously solves the problem that the size of the target is difficult to accurately predict by a traditional visual target tracking algorithm, and can accurately predict the size of the target when the target deforms. Meanwhile, the method provided by the invention has the characteristics of high tracking accuracy, high running speed, insensitivity to interference of a background environment and the like, and has very wide application prospects in industrial control, automatic production and other occasions.

The present invention is not limited to the above-mentioned preferred embodiments, and any structural changes made under the teaching of the present invention shall fall within the scope of the present invention, which is similar or similar to the technical solutions of the present invention.

Claims

1. A technical method for visual target tracking facing to image sequences is characterized by comprising the following steps:

updating the convolution regression model according to the position and the size of the target obtained by tracking;

the method for training the convolution regression model comprises the following steps:

step 16, finishing the initial training of the convolution regression model;

the target position prediction method specifically comprises the following steps:

step 21, inputting the image features obtained in step 20 into the convolution regression network for the whole target obtained in step 12, and calculating to obtain a target position prediction graph H (x) based on the target whole regression model_t,y_t)；

the target size prediction method comprises the following steps:

step 31, inputting the image features obtained in step 30 into the convolution regression network facing the target texture obtained in step 14, and calculating to obtain a target foreground prediction map T (x) based on the target texture regression model_t,y_t)；

Step 32, obtainingPosition x of the current target_t,y_tAnd knowing the size w of the object in the previous frame_t,h_tThen, the size of the target in the last frame is calculated to be w_t,h_tA posterior probability of (d);

in step 34, the target size prediction is finished.

2. The method of claim 1, wherein the step 23 predicts a map F (x) based on the target position of the target foreground_t,y_t) The calculation formula of (2) is as follows:

wherein w_t-1And h_t-1Respectively identifying the size of the target, R (x), obtained in the last frame tracking_t,y_t,w_t-1,h_t-1) Representing coordinates of (x)_t,y_t) A size of w_t-1,h_t-1The rectangular frame i, j is the pixel point in the rectangular frame.

3. A technical method for image sequence oriented visual target tracking according to claim 1, characterized in that the target size in said step 32 is w_t,h_tThe formula for calculating the posterior probability of (2) is: p (w)_t,h_t|O,x_t,y_t,w_t-1,h_t-1)＝P(O|x_t,y_t,w_t,h_t)P(w_t,h_t|w_t-1,h_t-1) Wherein P (O | x)_t,y_t,w_t,h_t) The position and size state of the object is represented as (x)_t,y_t,w_t,h_t) The probability of (a) of (b) being,

4. The technical method for image sequence-oriented visual target tracking according to claim 1, wherein updating the convolution regression model comprises the following steps:

and step 42, finishing the updating of the convolution regression model.

5. An image sequence oriented visual target tracking system, comprising:

the updating module is used for updating the convolution regression model according to the position and the size of the target obtained by tracking;

step 16, finishing the initial training of the convolution regression model;

Step 22, the image characteristics obtained in the step 20 are processedInputting the convolution regression network facing the target texture obtained in the step 14, and calculating to obtain a target foreground prediction graph T (x) based on a target texture regression model_t,y_t)；

the target size prediction method comprises the following steps:

Step 32, obtaining the position x of the current target_t,y_tAnd knowing the size w of the object in the previous frame_t,h_tThen, the size of the target in the last frame is calculated to be w_t,h_tA posterior probability of (d);

in step 34, the target size prediction is finished.