CN109191491B

CN109191491B - Target tracking method and system of full convolution twin network based on multi-layer feature fusion

Info

Publication number: CN109191491B
Application number: CN201810878152.XA
Authority: CN
Inventors: 邹腊梅; 陈婷; 李鹏; 张松伟; 李长峰; 熊紫华; 李晓光; 杨卫东
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2018-08-03
Filing date: 2018-08-03
Publication date: 2020-09-08
Anticipated expiration: 2038-08-03
Also published as: CN109191491A

Abstract

The invention discloses a target tracking method and a target tracking system of a convolution twin network based on multilayer feature fusion, wherein the method comprises the following steps: according to the target position and size of the image, cutting out target template images and search area images of all images in an image sequence training set, wherein an image pair formed by the target template images and the search area images forms a training data set; constructing a convolution twin network based on multilayer feature fusion; training the multilayer feature fusion-based convolution twin network based on a training data set to obtain a well-trained multilayer feature fusion-based convolution twin network; and (4) performing target tracking by using a trained convolution twin network based on multi-layer feature fusion. In the process of tracking the target, the scoring graphs of different layers are fused, the interference of similar or similar targets is better distinguished by combining high-layer semantic features and bottom-layer detail features, and the problems of target drift and target loss in the tracking process are prevented.

Description

Target tracking method and system of full convolution twin network based on multi-layer feature fusion

Technical Field

The invention belongs to the crossing field of digital image processing, deep learning and pattern recognition, and particularly relates to a target tracking method and a target tracking system of a convolution twin network based on multi-layer feature fusion.

Background

Target tracking has a very important position in computer vision, however, due to the complexity of natural scenes, the sensitivity of the target to illumination changes, the requirements of tracking on real-time performance and robustness, and the existence of factors such as occlusion, posture and scale change, the tracking problem is still difficult. The traditional target tracking method cannot extract abundant characteristics from a target, so that the target and a background are strictly distinguished, a tracking drift phenomenon easily occurs, and the target cannot be tracked for a long time. With the rise of deep learning, a general convolutional neural network can effectively extract characteristics rich in targets, but network parameters are too many, and if online tracking is needed, the requirement of real-time performance cannot be met, and the practical engineering utilization value is limited.

Due to the improvement of hardware performance and the popularization of high-performance computing devices such as a GPU (graphics processing unit) and the like, the tracking instantaneity is not a problem which is difficult to overcome any more, and an effective target appearance model is of great importance in the tracking process. The essence of target tracking is a similarity measurement process, and due to the special structure of the twin convolution network, the target tracking has natural advantages in similarity measurement, and has a convolution structure, so that abundant features can be extracted for target tracking. The pure twin-convolution-based network adopts offline training and online tracking, although the requirement can be met on high-performance computing equipment in real time, the full-convolution twin network only utilizes semantic information extracted from the high layer of the convolution network in the tracking process, and the background similar to the target cannot be well distinguished in a complex scene, so that the problems of tracking drift and target loss are caused.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to solve the technical problems of tracking drift and target loss caused by similar background interference in the prior art.

In order to achieve the above object, in a first aspect, an embodiment of the present invention provides a target tracking method based on a multilayer feature fusion convolutional twin network, where the method includes the following steps:

(1) according to the target position and size of the image, cutting out target template images and search area images of all images in an image sequence training set, wherein an image pair formed by the target template images and the search area images forms a training data set;

(2) constructing a convolution twin network based on multilayer feature fusion, wherein the convolution twin network based on multilayer feature fusion comprises 2 identical first branch convolution networks and second branch convolution networks, the first branch convolution networks are used for obtaining a feature map of a search area image, the second branch convolution networks are used for obtaining a feature map of a target template image, the two branch networks are connected on a feature map of a designated layer, and cross-correlation operation is respectively carried out on the feature map of the target template image and corresponding layers of the feature map of the search area image to obtain corresponding score maps;

(3) training the multilayer feature fusion-based convolution twin network based on the training data set to obtain a well-trained multilayer feature fusion-based convolution twin network;

(4) and calculating a score map of an image in the image sequence to be detected by using the trained convolution twin network based on the multi-layer feature fusion, and tracking the target based on the score map.

Specifically, the step (1) comprises the following steps: the target template image cutting method comprises the following steps: the method comprises the steps of taking a target area as a center of a target rectangular frame, taking the center position of the target area as a target position, respectively expanding p pixels on four sides of the target rectangular frame, filling an exceeding part with image mean pixels if the rectangular frame exceeds an image boundary, and finally scaling the size of a cut target image block to 127 multiplied by 127; the cutting method of the search area image comprises the following steps: respectively expanding 2p pixels on four sides of a target rectangular frame by taking a target area as a center, filling an excess part with image mean pixels if the rectangular frame exceeds an image boundary, and finally scaling the size of a clipped search area image block to 255 multiplied by 255; where p is (w + h)/4, w is the target rectangular frame width pixel, and h is the target rectangular frame length pixel.

Specifically, the step (2) includes: inputting the search area image into a first branch convolution network, and obtaining a first-layer feature map SFM through Conv1₁Then, a second layer characteristic diagram SFM is obtained through the Pool1 and Conv2 layers₂Finally, obtaining the third layer of characteristics through Pool2, Conv3, Conv4 and Conv5Diagram SFM₃(ii) a Inputting the target template image into a second branch convolution network, and obtaining a first-layer feature map GFM (Gaussian filtered model) through Conv1₁Then, a second layer characteristic diagram GFM is obtained through Pool1 and Conv2₂Finally, obtaining a third-layer characteristic diagram GFM through Pool2, Conv3, Conv4 and Conv5₃(ii) a Respectively carrying out cross-correlation operation on the corresponding layers of the target template characteristic diagram and the search area image characteristic diagram to obtain three corresponding score graphs SM₁、SM₂、SM₃The formula is as follows:

SM_i＝GFM_i*SFM_i

wherein, i is 1,2 and 3 respectively, and is cross-correlation operation.

Specifically, the joint loss function L (y, v) constructed in step (3) is calculated as follows:

L(y,v)＝α₁L₁(y,v₁)+α₂L₂(y，v₂)+α₃L₃(y,v₃)

l(y[u]，v_i[u])＝log(1+exp(y[u]×v_i[u]))

wherein L is_iIs a score map SM_iA loss function of l (y [ u ]]，v_i[u]) Is a score map SM_iLogarithmic loss function of each point in α_iIs a score map SM_iWeight of (0 < α)₁＜α₂＜α₃≤1，D_iScore representation map SM_iU is a point in the score plot, c_iIs a score map SM_iCentral point of (2), R_iIs a score map SM_iRadius of (a), k_iIs a score map SM_iStep length of (v)_i[u]Is a score map SM_iThe corresponding value of u in the table, | | | | represents the euclidean distance, i ═ 1,2, 3.

Specifically, the step (4) includes:

1) cutting out a target template image of a 1 st frame image according to the target position and size of the 1 st frame image of the image sequence to be detected, inputting the target template image of the 1 st frame image into a second branch convolution network of a trained multilayer feature fusion convolution twin network, and obtaining a feature map M of the target template image₁，t＝2；

2) Cutting out a search area image of a t-frame image according to the target position and the size of a t-1 frame image of an image sequence to be detected, inputting the search area image of the t-frame into a first branch convolution network of a trained multilayer feature fusion convolution twin network, and obtaining a search area image feature map of the t-frame image;

3) respectively carrying out cross-correlation operation on the target template characteristic diagram of the t-1 th frame and the corresponding layer of the search area image characteristic diagram of the t-th frame to obtain three score diagrams of a target in the search area image of the t-th frame, and then fusing the score diagrams in a linear weighting mode to obtain a final score diagram of the t-th frame;

4) calculating the target position of the target in the image of the t frame according to the final score map of the t frame;

5) cutting out a target template image of the t frame image according to the target position and size in the t frame image, inputting the target template image of the t frame image into a second branch convolution network of a trained multilayer feature fusion convolution twin network, and marking the obtained feature map of the target template image as M_tIf the feature map of the target template image of the t-th frame is

η is a smoothing factor;

6) and t +1, repeating the steps 2) -5) until t is equal to N, and ending the target tracking of the image sequence to be detected, wherein N is the total frame number of the image sequence to be detected.

In order to achieve the above object, in a second aspect, an embodiment of the present invention provides a target tracking system based on a multilayer feature fusion convolution twin network, where the system includes:

the cutting module is used for cutting out target template images and search area images of all images in the image sequence training set according to the target positions and the sizes of the images, and the images formed by the target template images and the search area images form a training data set;

the multi-layer feature fusion-based convolution twin network module comprises 2 identical first branch convolution networks and second branch convolution networks, wherein the first branch convolution networks are used for obtaining a feature map of a search area image, the second branch convolution networks are used for obtaining a feature map of a target template image, the two branch networks are connected on a feature map of a designated layer, and cross-correlation operation is respectively carried out on the feature map of the target template image and corresponding layers of the feature map of the search area image to obtain corresponding score maps;

the training module is used for training the multilayer feature fusion-based convolution twin network based on the training data set to obtain a trained multilayer feature fusion-based convolution twin network;

and the target tracking module is used for calculating a score map of an image in the image sequence to be detected by using the trained convolution twin network based on the multilayer feature fusion and tracking the target based on the score map.

Specifically, the method for cropping the target template image is characterized by comprising the following steps: the method comprises the steps of taking a target area as a center of a target rectangular frame, taking the center position of the target area as a target position, respectively expanding p pixels on four sides of the target rectangular frame, filling an exceeding part with image mean pixels if the rectangular frame exceeds an image boundary, and finally scaling the size of a cut target image block to 127 multiplied by 127; the cutting method of the search area image comprises the following steps: respectively expanding 2p pixels on four sides of a target rectangular frame by taking a target area as a center, filling an excess part with image mean pixels if the rectangular frame exceeds an image boundary, and finally scaling the size of a clipped search area image block to 255 multiplied by 255; where p is (w + h)/4, w is the target rectangular frame width pixel, and h is the target rectangular frame length pixel.

Specifically, the multilayer feature fusion based convolution twin network comprises: searchingInputting the cable region image into a first branch convolution network, and obtaining a first-layer feature map SFM through Conv1₁Then, a second layer characteristic diagram SFM is obtained through the Pool1 and Conv2 layers₂Finally, obtaining a third-layer characteristic diagram SFM through Pool2, Conv3, Conv4 and Conv5₃(ii) a Inputting the target template image into a second branch convolution network, and obtaining a first-layer feature map GFM (Gaussian filtered model) through Conv1₁Then, a second layer characteristic diagram GFM is obtained through Pool1 and Conv2₂Finally, obtaining a third-layer characteristic diagram GFM through Pool2, Conv3, Conv4 and Conv5₃(ii) a Respectively carrying out cross-correlation operation on the corresponding layers of the target template characteristic diagram and the search area image characteristic diagram to obtain three corresponding score graphs SM₁、SM₂、SM₃The formula is as follows:

SM_i＝GFM_i*SFM_i

wherein, i is 1,2 and 3 respectively, and is cross-correlation operation.

Specifically, the joint loss function L (y, v) constructed in the training module is calculated as follows:

L(y,v)＝α₁L₁(y,v₁)+α₂L₂(y,v₂)+α₃L₃(y,v₃)

l(y[u],v_i[u])＝log(1+exp(y[u]×v_i[u]))

wherein L is_iIs a score map SM_iA loss function of l (y [ u ]],v_i[u]) Is a score map SM_iLogarithmic loss function of each point in α_iIs a score map SM_iWeight of (0 < α)₁＜α₂＜α₃≤1，D_iScore representation map SM_iU is a point in the score plot, c_iIs a score map SM_iCentral point of (2), R_iIs a score map SM_iRadius of (a), k_iIs a score map SM_iStep length of (v)_i[u]Is a score map SM_iThe corresponding value of u in the table, | | | | represents the euclidean distance, i ═ 1,2, 3.

Specifically, the target tracking module performs target tracking through the following steps:

3) performing cross-correlation operation on the target template characteristic diagram of the t-1 th frame and the corresponding layer of the target search area image characteristic diagram of the t-1 th frame respectively to obtain three score maps of a target in the search area image of the t-1 th frame, and then fusing the score maps in a linear weighting manner to obtain a final score map of the t-1 th frame;

η is a smoothing factor;

Generally, compared with the prior art, the above technical solution conceived by the present invention has the following beneficial effects:

(1) in the process of tracking the target, the scoring graphs of different layers are fused, and the interference of similar or similar targets can be better distinguished by combining high-layer semantic features and bottom-layer detail features, so that the problems of target drift and target loss in the tracking process are prevented.

(2) The invention uses the fusion score maps obtained by the cross correlation of the multilayer characteristic maps to carry out supervision training and design a new combined loss function, and the design of the combined loss function considers the action sizes of different layer score maps to endow different weights, thereby preventing gradient dispersion and accelerating the convergence process.

Drawings

FIG. 1 is a flowchart of a target tracking method based on a multilayer feature fusion convolution twin network according to an embodiment of the present invention;

FIG. 2 is an exemplary diagram of a target template image and a search area image provided by an embodiment of the invention;

FIG. 3 is a schematic diagram of a convolutional twin network structure based on multi-layer feature fusion according to an embodiment of the present invention;

4(a), 4(b), and 4(c) are images of 36 th frame, 102 th frame, and 136 th frame of the first video sequence for target tracking by using the method of the present invention according to an embodiment of the present invention;

5(a), 5(b), and 5(c) are images of frame 14, frame 24, and frame 470, respectively, of a second video sequence for target tracking using the method of the present invention according to an embodiment of the present invention;

fig. 6(a), fig. 6(b), and fig. 6(c) are images of frame 39, frame 61, and frame 85, respectively, of performing target tracking on a third video sequence by using the method of the present invention according to an embodiment of the present invention;

FIG. 7(a), FIG. 7(b), and FIG. 7(c) are images of frame 23, frame 239, and frame 257, respectively, for performing target tracking on a fourth video sequence by using the method of the present invention according to an embodiment of the present invention;

8(a), 8(b), and 8(c) are images of 14 th frame, 52 th frame, and 98 th frame of a fifth video sequence for target tracking using the method of the present invention according to an embodiment of the present invention;

fig. 9(a), 9(b), and 9(c) are images of frame 23, frame 37, and frame 63, respectively, of a sixth video sequence for object tracking by using the method of the present invention according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

Fig. 1 is a flowchart of a target tracking method based on a multilayer feature fusion convolution twin network according to an embodiment of the present invention. As shown in fig. 1, the method comprises the steps of:

(1) according to the target position and size of the image, target template images and search area images of all images in the image sequence training set are cut out, and an image pair formed by the target template images and the search area images forms a training data set.

The image sequence training set is an image pair consisting of an image and a label graph, and the label graph marks the target position and size of the corresponding image. And cutting out a target template image and a search area image which take the target area as the center from the image through the label graph. The training data set for this example is 4 ten thousand pairs of training images.

The target template image cutting method comprises the following steps: and a target rectangular frame taking the target area as the center, wherein the target area center position represents the target position. And respectively expanding p pixels on four sides of the target rectangular frame to obtain the size of the target template image block as (w +2p) × (h +2p), wherein p is (w + h)/4, w is a target rectangular frame width pixel, and h is a target rectangular frame length pixel. If the rectangular frame exceeds the image boundary, the exceeding part is filled with image mean pixels. Finally, the cropped target image block size is scaled to 127 × 127.

The cutting method of the search area image comprises the following steps: and (2) expanding 2p pixels on four sides of the target rectangular frame by taking the target area as the center to obtain the size of the image block of the search area, wherein the size is (w +4p) × (h +4p), and p is (w + h)/4. If the rectangular frame exceeds the image boundary, the exceeding part is filled with image mean pixels. Finally, the cropped search area image block size is scaled to 255 x 255.

Fig. 2 is an exemplary diagram of a target template image and a search area image according to an embodiment of the present invention. As shown in fig. 2, the 1 st line is a target template image, and the 2 nd line is a corresponding search area image.

(2) And constructing a convolution twin network based on multilayer feature fusion.

Fig. 3 is a schematic diagram of a convolutional twin network structure based on multi-layer feature fusion according to an embodiment of the present invention. As shown in fig. 3, the convolution twin network based on multi-layer feature fusion includes 2 identical first branch convolution networks and second branch convolution networks, the first branch convolution network is used for acquiring the feature map of the search area image, and the second branch convolution network is used for acquiring the feature map of the target template image.

The two branch networks have the same structure and parameters, and each branch network comprises a first convolutional layer Conv1, a first pooling layer Pool1, a second convolutional layer Conv2, a second pooling layer Pool2, a third convolutional layer Conv3, a fourth convolutional layer Conv4 and a fifth convolutional layer Conv5 which are connected in sequence. The specific parameters are as follows: the Conv1 convolution kernel size is 11 × 11, the step size is 2, and the number of channels is 48; the size of the Pool1 convolution kernel is 3 multiplied by 3, the step length is 2, and the number of channels is 48; the Conv2 convolution kernel size is 5 × 5, the step size is 1, and the number of channels is 128; the size of the Pool2 convolution kernel is 3 multiplied by 3, the step length is 1, and the number of channels is 128; conv3, Conv4 and Conv5 convolution kernels are all 3 × 3 in size, step size is 1, the number of Conv3 and Conv4 channels is 192, and the number of Conv5 channels is 128.

Inputting the search area image into a first branch convolution network, and obtaining a first-layer feature map SFM through Conv1 ₁123 × 123 × 48, and then obtaining a second layer characteristic diagram SFM through Pool1 and Conv2 layers₂Size and diameter57 × 57 × 128, and finally obtaining a third-layer feature map SFM through Pool2, Conv3, Conv4 and Conv5₃And size 22 × 22 × 128.

Inputting the target template image into a second branch convolution network, and obtaining a first-layer feature map GFM (Gaussian filtered model) through Conv1₁Size 59 × 59 × 48, followed by Pool1, Conv2 to obtain a second layer profile GFM₂25 × 25 × 128, and finally obtaining a third layer characteristic diagram GFM through Pool2, Conv3, Conv4 and Conv5₃And size 6 × 6 × 128.

The two branch networks are connected on the feature map of the designated layer, and the feature map of the target template image and the corresponding layer of the feature map of the search area image are respectively subjected to cross-correlation operation to obtain corresponding score maps.

Respectively carrying out cross-correlation operation on the corresponding layers of the target template characteristic diagram and the search area image characteristic diagram to obtain three corresponding score graphs SM₁、SM₂、SM₃The sizes are 65 × 65, 33 × 33 and 17 × 17 respectively, and the formula is as follows_i＝GFM_i*SFM_iWherein, i is 1,2 and 3 respectively, and is cross-correlation operation.

(3) And training the multilayer feature fusion-based convolution twin network based on the training data set to obtain the well-trained multilayer feature fusion-based convolution twin network.

And constructing a joint loss function. There is a real label y [ u ] ∈ D { +1, -1} for each point u ∈ D in the score-score graph, and since the target is at the center of the score-score graph, the center of the score-score graph is set as the center of the circle, and the element in the score-score graph is considered to belong to a positive sample within the radius R (considering the stride k of the network), and vice versa, the formula is as follows:

wherein: c is the central point of the score map, and | | represents the euclidean distance.

The loss function used in the training is based on a logarithmic loss function, taking the average of the losses for all points for the overall loss of a single score map. The joint loss function L (y, v) constructed by the invention is as follows:

L(y,v)＝α₁L₁(y,v₁)+α₂L₂(y,v₂)+α₃L₃(y,v₃)

l(y[u],v_i[u])＝log(1+exp(y[u]×v_i[u]))

Specifically, α₁、α₂、α₃The step k is taken as 0.3, 0.6 and 1 respectively, and for the score chart 1, the score chart 2 and the score chart 3, the corresponding values of the step k are respectively 2, 4 and 8.

And (4) minimizing a joint loss function into an objective function, and learning a network parameter W of the multilayer feature fused convolution twin network by adopting a back propagation algorithm.

This embodiment trains 40 times, 5000 times per iteration, using 8 pairs of training images per iteration. In the network training process, along with the convergence of network parameters, the learning rate in the random gradient descent method is set to be 10 in sequence^-2Reduced to 10^-5I.e. after each 10 training sessions, the learning rate of the gradient descent method decreases by a factor of 10.

The target position and target size in the initial frame image of the image sequence to be measured are known. And cutting out a target template image of the 1 st frame image according to the target position and the size of the 1 st frame image of the image sequence to be detected.

the target position and target size in the initial frame image of the image sequence to be measured are known. And cutting out a search area image of the 2 nd frame image according to the target position and the size of the 1 st frame image of the image sequence to be detected.

SM to size 17 × 17₃Bistric interpolated upsampling to a score map of size 65 × 65

SM of size 33 × 33₂Bistric interpolated upsampling to a score map of size 65 × 65

Calculating the final score map SM by adopting the following calculation formula₁₂₃：

Wherein the content of the first and second substances,

and

are respectively a score map SM₂And SM₃The score map obtained after upsampling is w taken in the embodiment₁＝2¹、w₂＝2²、w₃＝2³。

obtaining a final score map SM after the three score maps are superposed according to the weight₁₂₃Then, SM₁₂₃And performing bicubic interpolation to 255 × 255, and recording the position of the maximum score point in the score map as a position p_t。

In order to make the tracking process more continuous, a linear interpolation position p is adopted_tTo determine the target position of the target in the t-th frame image

The specific treatment method is as follows:

where γ is a smoothing factor.

This example γ is taken to be 0.35.

5) Cutting out a target template image of the t frame image according to the target position and size in the t frame image, inputting the target template image of the t frame image into a second branch convolution network of a trained multilayer feature fusion convolution twin network, and obtaining a feature map of the target template imageIs marked as M_tIf the feature map of the target template image of the t-th frame is

η is a smoothing factor;

in this example η, 0.01 is used.

FIG. 4(a) is a 36 th frame of image for object tracking of a first video sequence using the method of the present invention according to an embodiment of the present invention; FIG. 4(b) is a 102 th frame of image for object tracking of a first video sequence using the method of the present invention according to an embodiment of the present invention; fig. 4(c) is a 136 th frame image for performing object tracking on a first video sequence by using the method of the present invention according to an embodiment of the present invention. Therefore, the target tracking method provided by the invention can effectively track the target with the rapid movement, posture change, shielding and similar background interference.

FIG. 5(a) is a 14 th frame of image for object tracking of a second video sequence using the method of the present invention according to an embodiment of the present invention; FIG. 5(b) is a 24 th frame image for object tracking of a second video sequence using the method of the present invention according to an embodiment of the present invention; fig. 5(c) is a 470 th frame image of the second video sequence for object tracking by using the method of the present invention according to the embodiment of the present invention. Therefore, the target tracking method provided by the invention can effectively track the target with posture change, shielding and similar background interference.

FIG. 6(a) is a 39 th frame of image for object tracking of a third video sequence using the method of the present invention according to an embodiment of the present invention; FIG. 6(b) is a 61 st frame of image for object tracking of a third video sequence using the method of the present invention according to an embodiment of the present invention; fig. 6(c) is an image of the 85 th frame of the third video sequence for object tracking by using the method of the present invention according to the embodiment of the present invention. Therefore, the target tracking method provided by the invention can effectively track the targets with posture change, shielding and motion blur.

FIG. 7(a) is a 23 rd frame image of a fourth video sequence for object tracking using the method of the present invention according to an embodiment of the present invention; FIG. 7(b) is a 239 th frame of image for target tracking of a fourth video sequence using the method of the present invention according to an embodiment of the present invention; fig. 7(c) is a 257 th frame image of a fourth video sequence for object tracking by using the method of the present invention according to an embodiment of the present invention. Therefore, the target tracking method provided by the invention can effectively track the target with illumination change and shielding.

FIG. 8(a) is a 14 th frame image of a fifth video sequence for object tracking using the method of the present invention according to an embodiment of the present invention; FIG. 8(b) is a 52 th frame image for object tracking of a fifth video sequence using the method of the present invention according to an embodiment of the present invention; fig. 8(c) is a 98 th frame image of a fifth video sequence for object tracking by using the method of the present invention according to an embodiment of the present invention. Therefore, the target tracking method provided by the invention can effectively track the target with attitude change and similar background interference.

FIG. 9(a) is a 23 rd frame image of a sixth video sequence for object tracking according to an embodiment of the present invention; FIG. 9(b) is a 37 th frame image of a sixth video sequence for object tracking according to an embodiment of the present invention; fig. 9(c) is a 63 rd frame image of the sixth video sequence for object tracking by using the method of the present invention according to the embodiment of the present invention. Therefore, the target tracking method provided by the invention can effectively track the target with illumination change.

The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. The target tracking method of the convolution twin network based on multilayer feature fusion is characterized by comprising the following steps of:

(4) calculating a score map of an image in an image sequence to be detected by using a trained convolution twin network based on multi-layer feature fusion, and tracking a target based on the score map;

the step (2) comprises the following steps:

inputting the search area image into a first branch convolution network, and obtaining a first-layer feature map SFM through Conv1₁Then, a second layer characteristic diagram SFM is obtained through the Pool1 and Conv2 layers₂Finally, obtaining a third-layer characteristic diagram SFM through Pool2, Conv3, Conv4 and Conv5₃；

Inputting the target template image into a second branch convolution network, and obtaining a first-layer feature map GFM (Gaussian filtered model) through Conv1₁Then, a second layer characteristic diagram GFM is obtained through Pool1 and Conv2₂Finally, obtaining a third-layer characteristic diagram GFM through Pool2, Conv3, Conv4 and Conv5₃；

Respectively carrying out cross-correlation operation on the corresponding layers of the target template characteristic diagram and the search area image characteristic diagram to obtain three corresponding score graphs SM₁、SM₂、SM₃The formula is as follows:

SM_i＝GFM_i*SFM_i

wherein, i is 1,2 and 3 respectively, and is cross-correlation operation.

2. The target tracking method of claim 1, wherein step (1) comprises:

the target template image cutting method comprises the following steps: the method comprises the steps of taking a target area as a center of a target rectangular frame, taking the center position of the target area as a target position, respectively expanding p pixels on four sides of the target rectangular frame, filling an exceeding part with image mean pixels if the rectangular frame exceeds an image boundary, and finally scaling the size of a cut target image block to 127 multiplied by 127;

the cutting method of the search area image comprises the following steps: respectively expanding 2p pixels on four sides of a target rectangular frame by taking a target area as a center, filling an excess part with image mean pixels if the rectangular frame exceeds an image boundary, and finally scaling the size of a clipped search area image block to 255 multiplied by 255;

where p is (w + h)/4, w is the target rectangular frame width pixel, and h is the target rectangular frame length pixel.

3. The target tracking method of claim 1, wherein the joint loss function L (y, v) constructed in step (3) is calculated as follows:

L(y,v)＝α₁L₁(y,v₁)+α₂L₂(y,v₂)+α₃L₃(y,v₃)

l(y[u],v_i[u])＝log(1+exp(y[u]×v_i[u]))

wherein L is_iIs a score chart SM_iOf the loss function y [ u ]]The true label representing point u in the score plot,

is a score map SM_iLogarithmic loss function of each point in α_iIs a score map SM_iWeight of (0 < α)₁＜α₂＜α₃≤1，D_iScore representation map SM_iU is a point in the score plot, c_iIs a score map SM_iCentral point of (2), R_iIs a score map SM_iRadius of (a), k_iIs a score map SM_iStep length of (v)_i[u]Is a score map SM_iThe corresponding value of u in the table, | | | | represents the euclidean distance, i ═ 1,2, 3.

4. The target tracking method of claim 1, wherein step (4) comprises:

5) according to the t frame imageCutting out a target template image of the t frame image according to the target position and size in the image, inputting the target template image of the t frame image into a second branch convolution network of a trained multilayer feature fusion convolution twin network, and recording the obtained feature image of the target template image as M_tIf the feature map of the target template image of the t-th frame is

η is a smoothing factor;

5. The target tracking system of the convolution twin network based on the multi-layer feature fusion is characterized by comprising the following components:

the target tracking module is used for calculating a score map of an image in an image sequence to be detected by using a trained convolution twin network based on multi-layer feature fusion, and tracking a target based on the score map;

the multilayer feature fusion based convolution twin network comprises:

SM_i＝GFM_i*SFM_i

wherein, i is 1,2 and 3 respectively, and is cross-correlation operation.

6. The object tracking system of claim 5,

7. The target tracking system of claim 5, wherein the joint loss function L (y, v) constructed in the training module is calculated as follows:

L(y,v)＝α₁L₁(y,v₁)+α₂L₂(y,v₂)+α₃L₃(y,v₃)

wherein L is_iIs a score map SM_iOf the loss function y [ u ]]The true label representing point u in the score plot,

8. The target tracking system of claim 5, wherein the target tracking module performs target tracking by:

1) cutting out a 1 st frame image according to the target position and size of the 1 st frame image of the image sequence to be detectedInputting the target template image of the 1 st frame image into the second branch convolution network of the trained multilayer feature fusion convolution twin network to obtain the feature map M of the target template image₁，t＝2；

η is a smoothing factor;