CN113888595A

CN113888595A - Twin network single-target visual tracking method based on difficult sample mining

Info

Publication number: CN113888595A
Application number: CN202111152770.4A
Authority: CN
Inventors: 黄磊; 高占祺; 魏志强
Original assignee: Ocean University of China
Current assignee: Ocean University of China
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2022-01-04
Anticipated expiration: 2041-09-29
Also published as: CN113888595B

Abstract

The invention discloses a twin network single target tracking method based on difficult sample mining, which comprises the following steps of constructing a training set, constructing a convolution twin network based on difficult sample mining and the like: according to the method, the difficult samples are mined and introduced into the target tracking method, the difficult negative samples are mined in the training process to serve as training data, network parameters are updated, the triple loss of the difficult samples is selected to serve as a loss function, the loss function is optimized continuously, the model continuously excavates the difficult negative samples in the training process through the optimized loss, the network is trained fully, similar targets are distinguished better, the model learns the characteristics with distinguishing capacity, and the target tracking effect is better.

Description

Twin network single-target visual tracking method based on difficult sample mining

Technical Field

The invention belongs to the technical field of computer vision, relates to an image processing technology, and particularly relates to a twin network single-target tracking method based on difficult sample mining.

Background

The single-target visual tracking is one of the more popular research subjects in computer vision, and has wide application in the aspects of intelligent video monitoring, robot visual navigation, medical diagnosis, positioning and tracking of underwater organisms and the like, and has a relatively wide development prospect. Visual target tracking refers to specifying a target to be tracked in a first frame of a video sequence and calibrating an initial position of the target to be tracked, and then predicting a position and a size of the target in a subsequent frame to accurately track the target, given a video sequence.

Early classical algorithms all perform processing in a time domain, and the algorithms involve complex calculation, so that the tracking instantaneity is poor due to large operation amount. And then, an algorithm based on the correlation filtering appears, and compared with the algorithm, the introduction of the correlation filtering enables the target tracking method to convert the calculation into a frequency domain, so that the operation amount is greatly reduced, and the speed is greatly improved. With the development of deep learning, researchers introduce deep learning techniques into target tracking, propose a series of methods and achieve good effects.

In recent years, a method of performing target tracking based on a twin network has received unprecedented attention. The existing method adopts a convolutional neural network to perform feature extraction on target modeling. In the target tracking process, the off-line training of the tracked target is one of the keys of the performance of a relational tracking model, and the selection of training data is particularly important during the off-line training of the model. The existing twin network-based method only uses a target area, directly performs related operations on the features extracted from the target area in the features of a test frame image, has poor robustness, cannot process complex scenes such as similar objects and the like, and has insufficient discrimination capability. When the existing method is used for target tracking, the coordinate distance between an object and an example is usually marked as positive when being smaller than a threshold value, otherwise, the coordinate distance is marked as negative, the similarity score of a positive sample pair is maximized through logic loss, and the similarity score of a negative sample pair is minimized.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a twin network single target tracking method based on difficult sample mining, the difficult sample mining is introduced into the target tracking method, difficult negative samples are mined in the training process as training data, network parameters are updated, the difficult sample triple loss is selected as a loss function, the difficult negative samples are continuously optimized, and through optimizing the loss, the model continuously mines the difficult negative samples in the training process, so that the network is fully trained, similar targets are better distinguished, and the model learns the characteristics with distinguishing capability.

In order to solve the technical problems, the invention adopts the technical scheme that:

a twin network single target tracking method based on difficult sample mining comprises the following steps:

step (1), constructing a training set: cutting out a target template image Z and a search area image X of all images in an image sequence training set according to the target position and the size of the images, dividing the search area image X into a positive example image P and a negative example image N, forming a pair of positive sample pairs by the image Z and the image P, forming a pair of negative sample pairs by the image Z and the image N, and forming a training data set by (Z, P, N) triples formed by the target template image Z, the positive example image P and the negative example image N;

step (2), a convolution twin network based on difficult sample mining is constructed, wherein the network comprises three branches, and the three branches share the weight of the feature extraction network; the three branches are respectively used for obtaining a feature map of a target template image, a feature map of a search area positive sample image and a feature map of a negative sample image, wherein in feature extraction, a difficult sample is defined, and difficult sample mining is introduced to learn features with distinguishing capability;

performing cross-correlation operation on the target template image feature map obtained in the step (2) and the search area image feature map to obtain a response map, wherein the position with a higher value in the response map is determined as the most similar position of the image target object, and the response map is expanded to the size of the original image, so that the position of the target on the image to be searched is determined;

step (4), training a twin network based on difficult sample mining based on the training set in the step (1) to obtain a twin network with training convergence;

and (5) performing online target tracking by using the trained twin network.

Further, the operation of step (1) includes cutting out a target area template image and cutting out a search area image; the target template image cutting method comprises the following steps: the target frame of the template image in target tracking is known, a square area is cut out by taking a tracked target as a center, the center position of the target area represents the target position, q pixels are respectively expanded on four sides of the target frame, and finally the size of the cut target image block is zoomed; the cutting method of the search area image comprises the following steps: respectively expanding 2q pixels on four sides of a target frame by taking a target area as a center, and then scaling the size of the image block of the cut search area; where q is (w + h)/4, w is the width of the target frame, and h is the height of the target frame.

Further, the feature extraction networks of different branches of the twin network in the step (2) are all adjusted ResNet-50, and features of the input image are extracted through the ResNet-50.

Further, the positive sample pairs are image pairs with similar visual features and high reference contrast, and the negative sample pairs are image pairs with similar visual features and low reference contrast; the difficult samples in the dataset are defined as:

P＝{(i,j)|S_v(x_i,x_j)≥α,S_c(y_i,y_j)≥β}

N＝{(m,n)|S_v(x_m,x_n)≥α,S_c(y_m,y_n)<β}

wherein S is_vRepresenting the degree of similarity of visual features, S_cRepresenting the reference contrast similarity, wherein alpha represents a threshold value of the visual feature similarity, and beta represents a threshold value of the reference contrast similarity;

when selecting pictures from a training set for training, selecting a most dissimilar positive sample and a most similar negative sample to form a triple for each picture, and calculating the loss of the difficult sample triple; the difficult sample triplet loss is defined as:

wherein M represents M targets selected from each sample, N represents N random images of each target, (z)₊Represents max (z,0), z means maxd_A,P-mind_A,N+ theta, theta is a threshold parameter set according to actual needs, representing the difference boundary of the positive and negative sample similarities, d_A,PRepresenting the similarity of the template sample to the positive sample, d_A,NRepresenting the distance of the template sample from the negative sample;

through L_hardOptimizing loss, continuously mining positive sample pairs and difficult negative samples by the model in the training process, and learning the characteristics with distinguishing capability.

Further, the step (3) is operated as follows: after feature extraction, fusing different-layer features, wherein the lower-layer features have more target position information and the higher-layer features have more semantic information, performing up-sampling operation on the higher-layer features, then fusing the higher-layer features with the lower-layer features, iteratively generating feature maps obtained after fusing different branches and the multilayer features, performing cross-correlation operation on the target template image feature maps and the positive sample image feature maps and the negative sample image feature maps in a search area respectively to obtain response maps, and expanding the response maps to the size of an original image, thereby determining the position of a target on an image to be searched.

Further, the specific operation of step (4) is as follows:

1) training by using initial positive and negative samples, and enabling the Z-direction P to be close to and far away from N through training to obtain a trained classifier;

2) classifying the samples by using a trained classifier, putting the misclassified samples into a negative sample subset as difficult negative samples, and continuing to train the classifier;

3) and repeating the steps until the performance of the classifier is not improved any more.

Further, the online tracking process in step (5) includes the following steps:

1) reading a first frame of picture of a video sequence to be tracked, acquiring the information of a boundary frame of the first frame of picture, cutting out a target template image Z of the first frame according to the method for cutting out the target template image in the step (1), inputting the Z into a template branch of a twin network converged by training in the step (4), extracting and fusing multilayer features of the template image, and then setting t to be 2;

2) reading a tth frame of a video to be tracked, cutting a search area image of the tth frame according to the determined target position in the tth-1 frame and the method for cutting the search area image in the step (1), inputting the cut tth frame search area image into the search branch of the twin network converged by the training in the step (4), and extracting the characteristics of the tth frame search image;

3) performing cross-correlation operation on the characteristic diagram obtained in the step 1) after multi-layer fusion and the characteristic diagram obtained in the step 2);

4) and setting T to be T +1, judging whether T is less than or equal to T, wherein T is the total frame number of the video sequence to be detected, and if T is less than or equal to T, executing the steps 2) -3), otherwise, ending the tracking process of the video sequence to be detected.

Compared with the prior art, the invention has the advantages that:

aiming at the problem that the existing twin network target tracking method does not consider the effect of a difficult sample on a model, the twin network target tracking method based on difficult sample mining is designed, the difficult sample mining is introduced into a target tracking twin network structure, a difficult negative sample is mined as training data in the training process, the difficult sample triple loss is selected as a loss function, the loss function is continuously optimized, the model is made to learn the characteristic with distinguishing capability, and the target tracking effect is better.

Specifically, in the training process, initial positive and negative samples are used for training, then the trained classifier is used for classifying the samples, the samples which are wrongly classified are used as difficult negative samples and are placed into a negative sample subset, then training is continued, and the training is repeated until the performance of the classifier is not improved any more. Different from the traditional triple training that samples are simple and easily distinguishable samples, the method selects the difficult sample triple, updates network parameters by using the difficult sample in the training process, selects the most dissimilar positive sample and the most similar negative sample of each picture to calculate the difficult triple loss, continuously excavates the difficult negative sample in the training process by optimizing the loss through the model, leads the network to be fully trained, better distinguishes similar targets, solves the problems of local change, background interference and the like in the picture, and has stronger generalization capability.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic overall flow diagram of the present invention;

FIG. 2 is a schematic diagram of a difficult sample mining strategy according to the present invention;

FIG. 3 is a graph illustrating the tracking effect of target tracking on a first video sequence using the method of the present invention;

fig. 4 shows the tracking effect of the object tracking of the second video sequence by using the method of the present invention.

Detailed Description

The invention is further described with reference to the following figures and specific embodiments.

With reference to the overall flow of the present invention shown in fig. 1, a twin network single target tracking method based on difficult sample mining includes the following steps:

and (1) constructing a training set.

According to the target position and size of the image, cutting out a target template image Z and a search area image X of all images in the image sequence training set, dividing the search area image X into a positive example image P and a negative example image N, forming a pair of positive sample pairs by the image Z and the image P, forming a pair of negative sample pairs by the image Z and the image N, and forming a training data set by (Z, P, N) triples formed by the target template image Z, the positive example image P and the negative example image N.

Specifically, the operation of step (1) includes cropping a target area template image and cropping a search area image. The target template image cutting method comprises the following steps: the target frame of the template image in target tracking is known, a square area is cut out by taking a tracked target as a center, the center position of the target area represents the target position, q pixels are respectively expanded on four sides of the target frame, and finally the size of the cut target image block is scaled to 127 multiplied by 127. The cutting method of the search area image comprises the following steps: respectively expanding 2q pixels on four sides of a target frame by taking a target area as a center, and then scaling the image block size of the cut search area to 255 multiplied by 255; where q is (w + h)/4, w is the width of the target frame, and h is the height of the target frame.

And (2) constructing a convolution twin network based on difficult sample mining to obtain characteristic diagrams of different branches.

The network comprises three branches and the three branches share the weight of the feature extraction network; the three branches are respectively used for obtaining a feature map of a target template image, a feature map of a search area positive sample image and a feature map of a negative sample image, wherein in feature extraction, a difficult sample is defined, and difficult sample mining is introduced to learn features with distinguishing capability.

Specifically, in the step (2), the feature extraction networks of different branches of the twin network are all adjusted to be ResNet-50, and the input image is subjected to feature extraction through ResNet-50.

Difficult sample mining is introduced to learn features with discriminative power. In conjunction with the difficult sample mining strategy of the present invention shown in fig. 2, in particular, the present invention considers obtaining valid pairs of difficult samples in terms of both visual feature similarity and reference contrast similarity. Image pairs possessing similar visual features and high reference contrast are defined as positive sample pairs, and image pairs possessing similar visual features and low reference contrast are defined as negative sample pairs.

The difficult samples in the dataset are defined as:

P＝{(i,j)|S_v(x_i,x_j)≥α,S_c(y_i,y_j)≥β}

N＝{(m,n)|S_v(x_m,x_n)≥α,S_c(y_m,y_n)<β}

wherein S is_vRepresenting the degree of similarity of visual features, S_cThe reference contrast similarity is expressed, alpha represents a threshold value of the visual feature similarity, and beta represents a threshold value of the reference contrast similarity.

The traditional triple samples three pictures from training data, which is simple, but most of the samples are simple and easily distinguished sample pairs, and if a large number of training sample pairs are simple sample pairs, the network learning is not facilitated to obtain better features. Therefore, when the method selects the pictures from the training set for training, for each picture, a most dissimilar positive sample and a most similar negative sample are selected to form a triplet, and the triplet loss of the sample difficult to calculate is calculated.

The difficult sample triplet loss is defined as:

wherein M represents M targets selected from each sample, N represents N random images of each target, (z)₊Represents max (z,0), z means maxd_A,P-mind_A,N+ theta, theta is a threshold parameter set according to actual needs, representing the difference boundary of the positive and negative sample similarities, d_A,PRepresenting the similarity of the template sample to the positive sample, d_A,NRepresenting the distance of the template sample from the negative sample.

And (3) performing cross-correlation operation on the target template image characteristic diagram obtained in the step (2) and the search area image characteristic diagram to obtain a response diagram, wherein the position with a higher value in the response diagram is regarded as the most similar position of the image target object, so that the position of the target is determined.

Specifically, step (3) operates as follows: after feature extraction, different-layer features are fused, the lower-layer features have more target position information, the higher-layer features have more semantic information, the higher-layer features are subjected to up-sampling operation firstly, then are fused with the lower-layer features, feature maps with different branches and multi-layer features fused are generated in an iterative mode, and the target template image feature maps are subjected to cross-correlation operation with the search area positive sample image feature map and the search area negative sample image feature map respectively to obtain response maps. And expanding the response image to the size of the original image so as to determine the position of the target on the image to be searched.

And (4) training a twin network based on difficult sample mining based on the training set in the step (1) to obtain a twin network with training convergence.

Specifically, the specific operation of step (4) is as follows:

And (5) performing online target tracking by using the trained twin network.

Specifically, the online tracking process in step (5) includes the steps of:

1) reading a first frame of picture of a video sequence to be tracked, acquiring the information of a boundary frame of the first frame of picture, cutting out a target template image Z of the first frame according to the method for cutting out the target template image in the step (1), inputting the Z into the template branch of the twin network converged by training in the step (4), extracting the multi-layer features of the template image, fusing, and then setting t to 2.

2) Reading the tth frame of the video to be tracked, cutting out the search area image of the tth frame according to the determined target position in the tth-1 frame and the method for cutting out the search area image in the step (1), inputting the cut-out tth frame search area image into the search branch of the convergent twin network of the training in the step (4), and extracting the characteristics of the tth frame search image.

3) Performing cross-correlation operation on the characteristic diagram obtained in the step 1) after multi-layer fusion and the characteristic diagram obtained in the step 2).

4) Setting T to be T +1, and judging whether T is equal to or less than T, wherein T is the total frame number of the video sequence to be detected; and if so, executing the steps 2) -3), otherwise, ending the tracking process of the video sequence to be detected.

Fig. 3 shows the tracking effect of the object tracking of the first video sequence by using the method of the present invention. Therefore, the target tracking method provided by the invention can effectively track the target with similar background interference.

Fig. 4 shows the tracking effect of the object tracking of the second video sequence by using the method of the present invention. Therefore, the target tracking method provided by the invention can effectively track the target with posture change and rapid movement.

In conclusion, the method introduces difficult sample mining into a target tracking twin network structure, designs difficult triple loss, can fully train the network, enhances the distinguishing capability of the classifier, can better distinguish similar targets, solves the problems of local change, background interference and the like in the image, and has stronger generalization capability on the learned model.

It is understood that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art should understand that they can make various changes, modifications, additions and substitutions within the spirit and scope of the present invention.

Claims

1. A twin network single target tracking method based on difficult sample mining is characterized by comprising the following steps:

and (5) performing online target tracking by using the trained twin network.

2. The twin network single target tracking method based on difficult sample mining according to claim 1, wherein the operation of step (1) includes cropping a target area template image and cropping a search area image; the target template image cutting method comprises the following steps: the target frame of the template image in target tracking is known, a square area is cut out by taking a tracked target as a center, the center position of the target area represents the target position, q pixels are respectively expanded on four sides of the target frame, and finally the size of the cut target image block is zoomed; the cutting method of the search area image comprises the following steps: respectively expanding 2q pixels on four sides of a target frame by taking a target area as a center, and then scaling the size of the image block of the cut search area; where q is (w + h)/4, w is the width of the target frame, and h is the height of the target frame.

3. The twin network single-target tracking method based on difficult sample mining as claimed in claim 1, wherein the feature extraction networks of different branches of the twin network in step (2) are adjusted ResNet-50, and the input image is feature extracted through ResNet-50.

4. The twin network single target tracking method based on difficult sample mining as claimed in claim 1, wherein the positive sample pairs are image pairs with similar visual features and high reference contrast, and the negative sample pairs are image pairs with similar visual features and low reference contrast; the difficult samples in the dataset are defined as:

P＝{(i，j)|S_v(x_i，x_j)≥α，S_c(y_i，y_j)≥β}

N＝{(m，n)|S_v(x_m，x_n)≥α，S_c(y_m，y_n)＜β}

wherein M represents M targets selected from each sample, N represents N random images of each target, (z)₊Represents max (z,0), z means maxd_A，P-mind_A，N+ theta, theta is a threshold parameter set according to actual needs, representing the difference boundary of the positive and negative sample similarities, d_A，PRepresenting the similarity of the template sample to the positive sample, d_A，NRepresenting the distance of the template sample from the negative sample;

5. The twin network single target tracking method based on difficult sample mining as claimed in claim 1 or 4, wherein step (3) operates as follows: after feature extraction, fusing different-layer features, wherein the lower-layer features have more target position information and the higher-layer features have more semantic information, performing up-sampling operation on the higher-layer features, then fusing the higher-layer features with the lower-layer features, iteratively generating feature maps obtained after fusing different branches and the multilayer features, performing cross-correlation operation on the target template image feature maps and the positive sample image feature maps and the negative sample image feature maps in a search area respectively to obtain response maps, and expanding the response maps to the size of an original image, thereby determining the position of a target on an image to be searched.

6. The twin network single target tracking method based on difficult sample mining as claimed in claim 1, wherein the specific operation of step (4) is as follows:

7. The twin network single target tracking method based on difficult sample mining as claimed in claim 2, wherein the on-line tracking process in step (5) comprises the steps of: