Background
Visual target tracking is one of the basic tasks in the field of computer vision, and is generally classified into multi-target tracking and single-target tracking according to the number of targets to be tracked. Multi-target tracking is usually only specific to a certain number of classes of objects, such as vehicles, pedestrians, etc., i.e. the class of targets to be tracked is fixed; in general single-target tracking tasks, the target category to be tracked can be arbitrary.
The input of the universal single-target tracking task is a continuous online real-time input frame sequence or a section of offline cached video. Only the target to be tracked (a commonly used representation form is a rectangular boundary box) needs to be selected at the starting time of the end equipment provided with the camera or the first frame of the offline video, and the general single-target tracking algorithm continuously, stably and efficiently gives the accurate position of the target to be tracked in the subsequent frame in the form of the boundary box.
The tracking-by-detection mechanism uses a mechanism for target detection to perform target tracking tasks, and is commonly used in the field of multi-target tracking. Specifically, all targets in the current frame are detected first by using a target detection method, and then targets of the current frame and the previous frame are associated by using a target association strategy (such as a hungarian matching algorithm). This mechanism is equally applicable to the field of general single-target tracking. For a given real-time input frame sequence or a first frame of a section of offline cached video and a corresponding bounding box of a target to be tracked, the tracking-by-detection mechanism takes the target to be tracked (which is a uniform input size and is usually added with a certain proportion of adjacent environmental information) as a template image, and the target to be tracked is sequentially matched with each sub-area with the same size as the template image in a current frame or a certain local area of the current frame, and the sub-area with the highest matching similarity is the position of the target to be tracked.
SiamFC (Siamese Fully-Convolutional Networks) is a universal single-target tracking method based on a twin neural network, and the main idea is based on a tracking-by-detection framework. First, embedding (Feature Space Embedding) using one and the same feature spaceMapping template image z and candidate image x to a high-dimensional feature space, denoted/>, respectivelyAnd/>Then, calculate/>, using another similarity metric function gAndThe formalized representation is shown as formula (101):
in the formula (101), z represents a template image, x represents a candidate image, Representing feature space embedding,/>For a high-dimensional feature space representation of template image z,/>For a high-dimensional feature space representation of candidate image x, g represents a similarity metric function and f represents the entire generic single-target tracking algorithm, siamFC.
In the tracking process, siamFC uses the tracking target boundary box of the previous frame as the center, and adds certain neighborhood space information as the search area of the current frame. The template image z is then represented in high-dimensional feature spaceHigh-dimensional feature space representation of candidate image x with each spatial position within the search area in the manner of a Sliding Window (Sliding Window)A similarity measure is performed, i.e. equation (101) is performed. All similarity measures form a confidence score map. The higher the score, the more similar the template image z and the corresponding candidate image x in the search area are represented. The displacement of the center of the target to be tracked from the previous frame to the current frame is obtained by multiplying the offset of the maximum value in the confidence score graph relative to the center of the graph by the maximum receptive field of the feature space embedded into the network. For scale estimation of the object to be tracked SiamFC the search area is scaled using a predefined scale, and then the above steps are performed on search areas of different scales. The search area scaling corresponding to the maximum value in the maximum values of all the confidence score graphs is the scaling of the final tracking boundary box. However, predefined scaling is difficult to cover target scale variations in a real tracking scene caused by factors such as camera focal length, rapid target motion, etc.
Based on the problems, siamRPN refers to the target detection method Faster R-CNN on the basis of SiamFC, and introduces the RPN (Region Proposal Network) idea into the field of general single target tracking. SiamRPN predefining a plurality of Anchor boxes (Anchor boxes) with different scales and different length-width ratios for each spatial position in the confidence score map, and predicting the central position offset and the length-width deviation of each Anchor Box and the boundary Box of the target to be tracked by using a convolutional neural network. The SiamRPN method greatly improves the tracking accuracy of the universal single-target tracking method based on the twin neural network.
Since the depth feature space embedding used by SiamFC and SiamRPN is AlexNet, the number of network layers is small, the depth is shallow, and the feature representation capability is not rich enough. In order to fully mine input information and enrich high-dimensional feature representation, siamRPN ++ method is based on SiamRPN architecture, and ResNet-50 is used as depth feature space embedded network. In ResNet-50, the significance of the characteristic representation of different depths is more abundant because the receptive field changes greatly. The shallower features focus mainly on detail information such as color, shape, texture, etc., which is very important for target localization; and the deeper features focus more on the semantic information of the target, which is helpful when the target is subjected to motion blur and large deformation. Thus SiamRPN ++ receives the characteristics of block2, block3, and block4 from the ResNet-50 network using three RPN modules, respectively, and finally combines the outputs of the three RPN modules as the final output by means of a linear weighted average. However, in an aerial video scene, the scale of an object to be tracked is usually smaller, the spatial resolution is low, and the distinguishing characteristics are insufficient. SiamRPN ++ using off-line trained weighting parameters, it is difficult to adaptively enhance the discriminative features of the target.
Since SiamRPN and SiamRPN ++ RPN modules use a large number of predefined anchor boxes, prior information such as dimensions and aspect ratios of the anchor boxes is not readily available, nor is it difficult to accurately characterize any application scenario. Thus, based on sialmfc, sialmfc++ does not need to rely on anchor boxes, but instead uses each spatial location of the confidence score map directly to regress the distance of that location to the four margins of the target bounding box. However, the SiamFC ++ method only uses the deepest features of the feature space embedded network, and directly discards the features of the shallower layer, so that the method has poor performance on small target tracking. Although the semantic information of deep features is more rich, the deepest features necessarily lose some detailed information. The targets in the aerial data are usually smaller (relative to the whole image), and the detail information is very important for distinguishing the small targets.
Because the targets in the aerial video scene are usually smaller, detailed information such as appearance, shape, texture and the like is easy to discard, the targets to be tracked are easy to be influenced by other noise or interference objects similar to the features of the targets to be tracked, and the tracker is caused to track drift. In addition, based on the motion smoothness assumption, that is, the displacement difference of the target between two adjacent frames is small, the above-mentioned universal single target tracking method based on the twin neural network adopts a local search method to determine the position of the target in the current frame, that is, the target search area of the current frame is obtained based on the tracking boundary frame of the previous frame plus a certain proportion of context information, instead of directly using the whole image as the search area. Because the field of view of aerial photography is large, and the resolution ratio of the real target of interest is relatively small, the target search area is very small, once the real target to be tracked is influenced by noise or an interfering object (the target similar to the target to be tracked), tracking drift (the algorithm tracking boundary frame and the real boundary frame of the target to be tracked do not completely coincide) or even tracking loss (the coincidence ratio of the algorithm tracking boundary frame and the real boundary frame of the target to be tracked is 0), the search area of the subsequent frame may not contain the target to be tracked, and the tracking fails thoroughly.
Disclosure of Invention
In view of the above, the invention provides a ground target tracking method applied to aerial video, which is used for improving the tracking performance of a general single target tracking method in an aerial video scene.
The invention provides a ground target tracking method applied to aerial videos, which comprises the following steps:
S1: the feature space embedded network of SiamFC ++ is changed from GoogLeNet to ResNet;
S2: embedding the search area image x into the feature space embedded network ResNet, outputting the depth features of the 2 nd block of ResNet Depth feature/>, input to the lowest layer of the first feature pyramid network, output the 3 rd block of ResNetThe depth features input to the middle layer of the first feature pyramid network and output the 4 th block of ResNetInputting to the highest level of the first feature pyramid network; after the first feature pyramid network processes the depth features input to each layer, the depth features/>, of the search area image x, are output at the lowest layer of the first feature pyramid networkOutputting depth features/>, of search area image x at a middle layer of a first feature pyramid networkDepth features/>, of search area image x are output at the highest layer of the first feature pyramid network
S3: the template image z is input into the network ResNet with the same structure and shared parameters in the step S2, and the depth feature of the 2 nd block output of ResNet is embedded into the networkThe depth feature/>, which is input to the lowest layer of the second feature pyramid network with the same structure as in the step S2 but with the non-shared parameters, of the 3 rd block output of ResNetInput to the middle layer of the second feature pyramid network, depth features/>, output ResNet block4Inputting to the highest level of the second feature pyramid network; after the second feature pyramid network processes the depth features input to each layer, outputting the depth features/>' of the template image z at the lowest layer of the second feature pyramid networkOutputting depth features/>, of template image z at a middle layer of a second feature pyramid networkDepth features/>, of template image z are output at the highest layer of the second feature pyramid network
S4: depth characterizationAnd/>Combining and inputting the depth features into a first tracking head networkAnd/>After combination, the depth features are input into a second tracking head network, and the depth features are input into a second tracking head networkAnd/>The combination is input into a third tracking head network; the first tracking head network, the second tracking head network and the third tracking head network have the same structure but do not share parameters, and the three tracking head networks have the same structure as the SiamFC ++ tracking head network;
S5: each tracking header network receives corresponding depth features And/>Outputting a first classification confidence score graph and a target bounding box regression response graph as inputs; wherein k ε {2,3,4};
S6: and selecting the position of the maximum value of the classification confidence scores in the first classification confidence score graphs output by the three tracking head networks, wherein the vector of the target boundary box regression response graph at the position is the boundary box prediction result of the target to be tracked.
In a possible implementation manner, in the above ground target tracking method applied to aerial video provided by the present invention, step S5, each tracking head network receives a corresponding depth featureAnd/>As inputs, a first classification confidence score map and a target bounding box regression response map are output, specifically including:
Each tracking head network includes classification branches for spatial location classification and regression branches for target bounding box regression, combining depth features And/>Respectively inputting a classification branch and a regression branch of the corresponding tracking head network;
for classifying branches, depth features are processed separately using a multi-layer convolution layer that is identical in structure but does not share parameters And/>Performing cross-correlation operation, and respectively transmitting the result of the cross-correlation operation to a classification sub-branch and a centrality sub-branch of the classification branch; classifying the sub-branches to use a1 multiplied by 1 convolution to process the result of the cross-correlation operation and output a second classification confidence score map; the result of cross-correlation operation is processed by using a1 multiplied by 1 convolution on the centrality sub-branch, and a centrality confidence probability map of each spatial position is output; in the test stage, multiplying the central confidence probability map with the second classification confidence score map as a weight to generate a first classification confidence score map;
for regression branches, depth features are processed separately using a multi-layer convolution layer that is identical in structure but does not share parameters And/>Performing cross-correlation operation; regression branching uses a result of a1×1 convolution processing cross-correlation operation to output a target bounding box regression response diagram;
step S5 is formally expressed as:
Wherein, Representing feature space embedded network ResNet,/>Representing depth features of template image z processed by first k blocks of feature space embedding network ResNet and second feature pyramid network,/>Representing depth features of the template image x after the first k blocks of the feature space embedding network ResNet and the first feature pyramid network are processed, wherein k represents index numbers of the feature space embedding network ResNet; i represents the index number of the trace header network, i e {1,2,3}, ζ i represents the i-th trace header network; f i denotes the mapping from input to output of group i, mathematically expressed as:
Wherein, And/>Representing the first classification confidence score map and the target bounding box regression response map output by the ith tracking head network respectively,/>And/>Representing the height and width of the output result of the ith trace header network, respectively.
In a possible implementation manner, in the ground target tracking method applied to aerial video provided by the present invention, step S6, selecting a position where a maximum value of classification confidence scores in a first classification confidence score map output by three tracking head networks is located, where a vector of a target bounding box regression response map at the position is a bounding box prediction result of a target to be tracked, specifically includes:
Selecting the position of the maximum value of the classification confidence scores in the first classification confidence score graphs output by the three tracking head networks, and formally expressing the position as:
wherein p represents the position of the maximum value of all classification confidence scores,
The vector of the target boundary box regression response diagram at the position is the boundary box prediction result of the target to be tracked, and the vector is expressed in a formalized way as follows:
Wherein, Vector b, p 1 E {1,2,3 }/>, of p 2 row and p 3 column in target bounding box regression response diagram representing p 1 tracking head network output
In a possible implementation manner, in the above ground target tracking method applied to aerial video provided by the present invention, in step S2, a search area size adaptive adjustment strategy specifically includes:
The searching area image is a part of each frame of image in the aerial video; during tracking, setting the initial size of a search area image as d 0, representing the tracking quality theta of the current frame image in the aerial video by using the maximum value of the classification confidence scores in the first classification confidence score graphs of the three tracking head networks, and setting a stability threshold tau 1, a loss threshold tau 2 and the tracking quality class number m, wherein m=3; the size of the search area image in the next frame image is:
where μ represents 3 times the maximum receptive field size of the feature space embedded network ResNet and function max represents the maximum value of the elements in the collection.
The invention provides a ground target tracking method applied to aerial videos, and belongs to a universal single target tracking method. The feature space embedded network of SiamFC ++ is changed from GoogLeNet to ResNet; inputting ResNet the search area image x and the template image z, and extracting depth features of the search area image x and the template image z; using two feature pyramid networks to strengthen depth features of the search area image x and the template image z; inputting the depth features into a tracking head network, and outputting a first classification confidence score graph and a target boundary box regression response graph; and selecting the position of the maximum value of the classification confidence score in the first classification confidence score graph, wherein the vector of the target boundary box regression response graph at the position is the boundary box prediction result of the target to be tracked. In the process of extracting depth features, a feature pyramid network is utilized to adaptively fuse the shallow and deep features of a feature space embedded network, so that the feature representation has abundant apparent, shape, texture and other detailed information and strong semantic information, the effect of strengthening the discriminative feature representation of a small target can be achieved, and the problems of tracking drift, even tracking loss and the like caused by small targets due to large aerial visual field range can be avoided. Moreover, a search area size self-adaptive adjustment strategy is provided for enhancing the tracking loss risk resistance of the tracker. Experimental results of multiple-aspect evaluation show that the ground target tracking method applied to aerial video provided by the invention can improve the tracking performance of the universal single-target tracking method in aerial video scenes.
Detailed Description
The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is apparent that the described embodiments are merely examples and are not intended to limit the present invention.
The invention provides a ground target tracking method applied to aerial videos, which comprises the following steps:
S1: the feature space embedded network of SiamFC ++ is changed from GoogLeNet to ResNet;
S2: embedding the search area image x into the feature space embedded network ResNet, outputting the depth features of the 2 nd block of ResNet The lowest layer input to the first feature pyramid network (Feature Pyramid Network, FPN) will output the 3 rd block of ResNet depth features/>Input to the middle layer of the first feature pyramid network, depth features/>, output of the 4 th block of ResNetInputting to the highest level of the first feature pyramid network; after the first feature pyramid network processes the depth features input to each layer, the depth features/>, of the search area image x, are output at the lowest layer of the first feature pyramid networkOutputting depth features/>, of search area image x at a middle layer of a first feature pyramid networkDepth features/>, of search area image x are output at the highest layer of the first feature pyramid network
S3: the template image z is input into the network ResNet with the same structure and shared parameters in the step S2, and the depth feature of the 2 nd block output of ResNet is embedded into the networkThe depth feature/>, which is input to the lowest layer of the second feature pyramid network with the same structure as in the step S2 but with the non-shared parameters, of the 3 rd block output of ResNetInput to the middle layer of the second feature pyramid network, depth features/>, output ResNet block4Inputting to the highest level of the second feature pyramid network; after the second feature pyramid network processes the depth features input to each layer, outputting the depth features/>' of the template image z at the lowest layer of the second feature pyramid networkOutputting depth features/>, of template image z at a middle layer of a second feature pyramid networkDepth features/>, of template image z are output at the highest layer of the second feature pyramid network
S4: depth characterizationAnd/>Combining and inputting the depth features into a first tracking head networkAnd/>After combination, the depth features are input into a second tracking head network, and the depth features are input into a second tracking head networkAnd/>The combination is input into a third tracking head network; the first tracking head network, the second tracking head network and the third tracking head network have the same structure but do not share parameters, and the three tracking head networks have the same structure as the SiamFC ++ tracking head network;
S5: each tracking header network receives corresponding depth features And/>Outputting a first classification confidence score graph and a target bounding box regression response graph as inputs; wherein k ε {2,3,4};
S6: and selecting the position of the maximum value of the classification confidence scores in the first classification confidence score graphs output by the three tracking head networks, wherein the vector of the target boundary box regression response graph at the position is the boundary box prediction result of the target to be tracked.
The following describes in detail the implementation of the ground target tracking method applied to aerial videos provided by the present invention through two specific embodiments.
Example 1: the flow chart is shown in fig. 1, and the overall block diagram is shown in fig. 2.
The first step: the SiamFC ++ universal single-target tracking method is selected as a baseline method, and the characteristic space embedded network of SiamFC ++ is changed from GoogLeNet to ResNet.
And a second step of: the search area image x is input into the feature space embedding network ResNet, depth features of the search area image x are extracted using the feature space embedding network ResNet, and then processed using the first feature pyramid network.
Specifically, depth features that output the 2 nd block of ResNetDepth feature/>, input to the lowest layer of the first feature pyramid network, output the 3 rd block of ResNetInput to the middle layer of the first feature pyramid network, depth features/>, output of the 4 th block of ResNetInput to the highest level of the first feature pyramid network. As shown in fig. 3, in the first feature pyramid network, the features of the three-layer input all use 1×1 convolution to adjust the number of channels to 256, denoted/>, respectively And/>It is known that ResNet deeper depth features have stronger semantic information than shallower depth features, while ResNet shallower depth features have stronger detail information than deeper depth features, therefore, depth features input to the highest layer of the first feature pyramid network and adjusted by the number of channels/>The resolution is enlarged to 2 times of the original resolution through an up-sampling module, and then the up-sampling module is connected with depth features/>, which are input to the middle layer of the first feature pyramid network and are subjected to channel number adjustmentFusion to obtain depth features/>Likewise, depth features/>, obtained after fusionThe resolution is enlarged to 2 times of the original resolution through another up-sampling module, and then the depth features/>, which are input to the lowest layer of the first feature pyramid network and are subjected to channel number adjustment, are obtainedFusion to obtain depth features/>Finally, depth features/>, are output at the lowest layer of the first feature pyramid networkOutputting depth features/>, in a middle layer of a first feature pyramid networkOutputting and inputting depth features/>, which are input to the highest layer of the first feature pyramid network and are subjected to channel number adjustment, at the highest layer of the first feature pyramid networkExpressed as/>
And a third step of: the template image z is input to the feature space embedding network ResNet which has the same structure and shared parameters as in step S2, the depth features of the template image z are extracted using the feature space embedding network ResNet, and then the depth features of the template image z are processed using the second feature pyramid network.
Specifically, depth features that output the 2 nd block of ResNetThe depth feature of the 3 rd block output of ResNet is input to the lowest layer of the second feature pyramid network which has the same structure as in the step S2 but does not share the parametersInput to the middle layer of the second feature pyramid network, depth features/>, output ResNet block4Input to the highest level of the second feature pyramid network. As shown in fig. 3, in the second feature pyramid network, the features of the three-layer input all use a 1×1 convolution to adjust the number of channels to 256, denoted/>, respectivelyAnd/>It is known that ResNet deeper depth features have stronger semantic information than shallower depth features, while ResNet shallower depth features have stronger detail information than deeper depth features, therefore, depth features input to the highest layer of the second feature pyramid network and channel number adjusted/>The resolution is enlarged to 2 times of the original resolution through an up-sampling module, and then the up-sampling module is connected with depth features/>, which are input to the middle layer of the first feature pyramid network and are subjected to channel number adjustmentFusion to obtain depth features/>Likewise, depth features/>, obtained after fusionThe resolution is enlarged to 2 times of the original resolution through another up-sampling module, and then the depth features/>, which are input to the lowest layer of the first feature pyramid network and are subjected to channel number adjustment, are obtainedFusion to obtain depth features/>Finally, outputting depth features/>, of the template image z at the lowest layer of the second feature pyramid networkOutputting depth features of template image z at middle layer of second feature pyramid networkOutputting and inputting depth features/>, which are input to the highest layer of the first feature pyramid network and are subjected to channel number adjustment, at the highest layer of the second feature pyramid networkExpressed as depth features/>
Fourth step: depth characterizationAnd/>After combination, the depth features are input into a first tracking head (TRACKING HEAD) network, and the depth features are input into a second tracking head networkAnd/>Combining and inputting the depth features into a second tracking head networkAnd/>The combination is input into a third tracking head network; wherein the first, second and third trace header networks are identical in structure but not shared in parameters, and the three trace header networks are identical in structure to the SiamFC ++ trace header network, as shown in brackets in fig. 2.
Fifth step: each tracking header network receives corresponding depth featuresAnd/>As inputs, a first classification confidence score map and a target bounding box regression response map are output.
Specifically, each tracking head network includes classification branches for spatial location classification and regression branches for target bounding box regression, depth features to be combinedAnd/>The classification branch and the regression branch of the corresponding trace header network are input, respectively. For example, the combined depth features/>And/>Inputting a classification branch and a regression branch of the first tracking head network respectively, and combining depth features/>And/>Inputting a classification branch and a regression branch of the second tracking head network respectively, and combining depth features/>And/>The classification branch and the regression branch of the third trace header network are input, respectively.
For classifying branches, depth features are processed separately using a multi-layer convolution layer that is identical in structure but does not share parametersAnd/>Performing cross-correlation operation, and respectively transmitting the result of the cross-correlation operation to a classification sub-branch and a centrality sub-branch of the classification branch; classifying the sub-branches to use a1 multiplied by 1 convolution to process the result of the cross-correlation operation and output a second classification confidence score map; the result of cross-correlation operation is processed by using a1 multiplied by 1 convolution on the centrality sub-branch, and a centrality confidence probability map of each spatial position is output; in the test stage, the centrality confidence probability map is multiplied by the second classification confidence score map as a weight to generate a first classification confidence score map.
For regression branches, depth features are processed separately using a multi-layer convolution layer that is identical in structure but does not share parametersAnd/>Performing cross-correlation operation; regression branching uses the result of a1×1 convolution processing cross-correlation operation to output a target bounding box regression response graph.
The above procedure is formally expressed as:
Wherein, Representing feature space embedded network ResNet,/>Representing depth features of template image z processed by first k blocks of feature space embedding network ResNet and second feature pyramid network,/>Representing depth features of the template image x after the first k blocks of the feature space embedding network ResNet and the first feature pyramid network are processed, wherein k represents index numbers of the feature space embedding network ResNet; i represents the index number of the trace header network, i e {1,2,3}, ζ i represents the i-th trace header network; f i denotes the mapping from input to output of group i, mathematically expressed as:
Wherein, And/>Representing the first classification confidence score map and the target bounding box regression response map output by the ith tracking head network respectively,/>And/>Representing the height and width of the output result of the ith trace header network, respectively.
Sixth step: and selecting the position of the maximum value of the classification confidence scores in the first classification confidence score graphs output by the three tracking head networks, wherein the vector of the target boundary box regression response graph at the position is the boundary box prediction result of the target to be tracked.
Specifically, the position of the maximum value of the classification confidence scores in the first classification confidence score graphs output by the three tracking head networks is selected, and the formalized representation is as follows:
wherein p represents the position of the maximum value of all classification confidence scores,
The vector of the target boundary box regression response diagram at the position is the boundary box prediction result of the target to be tracked, and the vector is expressed in a formalized way as follows:
Wherein, Vector b, p 1 E {1,2,3 }/>, of p 2 row and p 3 column in target bounding box regression response diagram representing p 1 tracking head network output
In summary, the specific implementation process of the ground target tracking method applied to aerial video provided by the embodiment 1 of the invention is a set of general single target tracking model framework with strong discrimination capability and small target perception.
Example 2: embodiment 1+ search region size adaptive adjustment strategy.
It should be noted that the search area image is a part of each frame of image in the aerial video. Based on the above ground target tracking method applied to aerial video provided in embodiment 1 of the present invention, embodiment 2 of the present invention further proposes a search area size adaptive adjustment strategy, as shown in fig. 4, specifically, the following manner may be adopted: during tracking, setting the initial size of a search area image as d 0, representing the tracking quality theta of the current frame image in the aerial video by using the maximum value of the classification confidence scores in the first classification confidence score graphs of the three tracking head networks, and setting a stability threshold tau 1, a loss threshold tau 2 and the tracking quality class number m; specifically, when θ is equal to or greater than τ 1, it indicates that the tracking result of the current frame image is "stable and reliable", so that the size of the search area does not need to be enlarged for the next frame image; when θ < τ 2, it indicates that the tracking result of the current frame image is "lost", that is, the predicted target bounding box has completely deviated from the real target bounding box, and a search area as large as possible needs to be used to ensure that the next frame image can retrieve the target to be tracked; when τ 2≤θ<τ1 indicates that the tracking result of the current frame image is "incompletely reliable", the predicted target bounding box has undergone a tracking drift of a certain magnitude, and in order to prevent the risk of tracking loss, the next frame image may be subjected to an "adaptive growth strategy (i.e.) "Enlarge search area size". To sum up, the size of the search area image in the next frame image can be formally expressed as:
where μ represents 3 times the maximum receptive field size of the feature space embedded network ResNet and function max represents the maximum value of the elements in the collection.
In order to better verify the effectiveness of the embodiments 1 and 2 of the present invention, the following description will use the actual test to combine the two embodiments, where the test process corresponds to the tracking process in the actual application, and the test tracking video corresponds to the frame sequence input in real time on line or the video cached off line in the actual application.
In the field of target tracking, the quality of tracking is generally evaluated using the UAV123 dataset. UAV123 is a large-scale aerial video single-target tracking data set, and comprises 123 high-definition unmanned aerial vehicle aerial videos, each video has an average length of 915 frames, and the number of data aggregation frames exceeds 100,000. The data set covers various common categories such as pedestrians, cars, large trucks, bicycles, ships, buildings and the like, and covers most application scenes in practical application, so that the data set based on the evaluation index has strong generalization and universality. Specifically, the main evaluation index of the data set on the target tracking method is the curve coverage area (Success Plot Area Under Curve, success AUC) of the Success graph. The greater the value of the Success AUC is [0,1], the stronger the robustness of the evaluated target tracking method is indicated, and the higher the practical application value is.
The effectiveness of the two technical points (1, the general single-target tracking model framework with small target perception with strong discrimination capability, namely the search region size adaptive adjustment strategy in the embodiments 1,2 and 2) of the present invention is verified based on the UAV123 dataset by comparing the existing SiamFC ++ method, the embodiment 1 of the present invention, the search region size adaptive adjustment strategy and the Success AUC index in the embodiment 2 of the present invention. As shown in table 1, the Success AUC index for the SiamFC ++ method is 0.631; if only the technical point 1 of the invention is used, a set of general single-target tracking model frames with small target perception and strong discrimination capability (namely the embodiment 1 of the invention), the Success AUC index can reach 0.660 and exceed SiamFC ++ method by 2.9 points; if only the technical point 2 is used, the self-adaptive adjustment strategy of the size of the search area is adopted, the initial size of the search area is set to d 0=447,τ1=0.8,τ2 =0.4, m=3, and the success AUC index can reach 0.646 and exceed 1.5 points of the SiamFC ++ method; by combining the two technical points of the invention (namely the embodiment 2 of the invention), the Success AUC index can reach 0.672, which exceeds 4.1 points of SiamFC ++ method; this fully verifies the effectiveness of both technical aspects of the present invention.
TABLE 1
The following compares the ground target tracking method applied to aerial video provided in embodiment 2 of the present invention with the existing SiamRPN method, siamRPN ++ method and the existing Success AUC index of SiamFC ++ method based on the UAV123 dataset, as shown in table 2, the ground target tracking method applied to aerial video provided in embodiment 2 of the present invention has a Success AUC index of 0.672, which exceeds the existing SiamRPN method, siamRPN ++ method and SiamFC ++ method, which indicates that the ground target tracking method applied to aerial video provided in embodiment 2 of the present invention has stronger effectiveness and versatility.
TABLE 2
The invention provides a ground target tracking method applied to aerial videos, and belongs to a universal single target tracking method. The feature space embedded network of SiamFC ++ is changed from GoogLeNet to ResNet; inputting ResNet the search area image x and the template image z, and extracting depth features of the search area image x and the template image z; using two feature pyramid networks to strengthen depth features of the search area image x and the template image z; inputting the depth features into a tracking head network, and outputting a first classification confidence score graph and a target boundary box regression response graph; and selecting the position of the maximum value of the classification confidence score in the first classification confidence score graph, wherein the vector of the target boundary box regression response graph at the position is the boundary box prediction result of the target to be tracked. In the process of extracting depth features, a feature pyramid network is utilized to adaptively fuse the shallow and deep features of a feature space embedded network, so that the feature representation has abundant apparent, shape, texture and other detailed information and strong semantic information, the effect of strengthening the discriminative feature representation of a small target can be achieved, and the problems of tracking drift, even tracking loss and the like caused by small targets due to large aerial visual field range can be avoided. Moreover, a search area size self-adaptive adjustment strategy is provided for enhancing the tracking loss risk resistance of the tracker. Experimental results of multiple-aspect evaluation show that the ground target tracking method applied to aerial video provided by the invention can improve the tracking performance of the universal single-target tracking method in aerial video scenes.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.