CN114219826B - Ground target tracking method applied to aerial video - Google Patents

Ground target tracking method applied to aerial video Download PDF

Info

Publication number
CN114219826B
CN114219826B CN202111156857.9A CN202111156857A CN114219826B CN 114219826 B CN114219826 B CN 114219826B CN 202111156857 A CN202111156857 A CN 202111156857A CN 114219826 B CN114219826 B CN 114219826B
Authority
CN
China
Prior art keywords
network
tracking
target
depth features
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111156857.9A
Other languages
Chinese (zh)
Other versions
CN114219826A (en
Inventor
刘庆杰
扶智宏
温奇
王兵
王薇
李苓苓
董喆
罗伟儿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Beijing Institute of Space Research Mechanical and Electricity
Technology and Engineering Center for Space Utilization of CAS
Original Assignee
Beihang University
Beijing Institute of Space Research Mechanical and Electricity
Technology and Engineering Center for Space Utilization of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University, Beijing Institute of Space Research Mechanical and Electricity, Technology and Engineering Center for Space Utilization of CAS filed Critical Beihang University
Priority to CN202111156857.9A priority Critical patent/CN114219826B/en
Publication of CN114219826A publication Critical patent/CN114219826A/en
Application granted granted Critical
Publication of CN114219826B publication Critical patent/CN114219826B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20016Hierarchical, coarse-to-fine, multiscale or multiresolution image processing; Pyramid transform

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a ground target tracking method applied to aerial videos, which is characterized in that a SiamFC ++ feature space embedded network is changed into ResNet; inputting ResNet the search area image x and the template image z to extract depth features; inputting the depth features into a tracking head network, and outputting a classification confidence score map and a target boundary box regression response map; and selecting the position of the maximum value of the classification confidence score, and taking the vector of the target boundary box regression response graph at the position as a target boundary box prediction result. In the process of extracting depth features, the feature pyramid network is constructed to adaptively integrate the shallow and deep features of the feature space embedded network, so that the feature representation has abundant detail information and strong semantic information, the discriminant feature representation of a small target can be enhanced, and the tracking drift and even the tracking loss problem are avoided. And a search area size self-adaptive adjustment strategy is provided, so that the tracking loss risk resistance of the tracker can be enhanced.

Description

Ground target tracking method applied to aerial video
Technical Field
The invention relates to the technical field of computer vision driven by artificial intelligence technology, in particular to a ground target tracking method applied to aerial videos.
Background
Visual target tracking is one of the basic tasks in the field of computer vision, and is generally classified into multi-target tracking and single-target tracking according to the number of targets to be tracked. Multi-target tracking is usually only specific to a certain number of classes of objects, such as vehicles, pedestrians, etc., i.e. the class of targets to be tracked is fixed; in general single-target tracking tasks, the target category to be tracked can be arbitrary.
The input of the universal single-target tracking task is a continuous online real-time input frame sequence or a section of offline cached video. Only the target to be tracked (a commonly used representation form is a rectangular boundary box) needs to be selected at the starting time of the end equipment provided with the camera or the first frame of the offline video, and the general single-target tracking algorithm continuously, stably and efficiently gives the accurate position of the target to be tracked in the subsequent frame in the form of the boundary box.
The tracking-by-detection mechanism uses a mechanism for target detection to perform target tracking tasks, and is commonly used in the field of multi-target tracking. Specifically, all targets in the current frame are detected first by using a target detection method, and then targets of the current frame and the previous frame are associated by using a target association strategy (such as a hungarian matching algorithm). This mechanism is equally applicable to the field of general single-target tracking. For a given real-time input frame sequence or a first frame of a section of offline cached video and a corresponding bounding box of a target to be tracked, the tracking-by-detection mechanism takes the target to be tracked (which is a uniform input size and is usually added with a certain proportion of adjacent environmental information) as a template image, and the target to be tracked is sequentially matched with each sub-area with the same size as the template image in a current frame or a certain local area of the current frame, and the sub-area with the highest matching similarity is the position of the target to be tracked.
SiamFC (Siamese Fully-Convolutional Networks) is a universal single-target tracking method based on a twin neural network, and the main idea is based on a tracking-by-detection framework. First, embedding (Feature Space Embedding) using one and the same feature spaceMapping template image z and candidate image x to a high-dimensional feature space, denoted/>, respectivelyAnd/>Then, calculate/>, using another similarity metric function gAndThe formalized representation is shown as formula (101):
in the formula (101), z represents a template image, x represents a candidate image, Representing feature space embedding,/>For a high-dimensional feature space representation of template image z,/>For a high-dimensional feature space representation of candidate image x, g represents a similarity metric function and f represents the entire generic single-target tracking algorithm, siamFC.
In the tracking process, siamFC uses the tracking target boundary box of the previous frame as the center, and adds certain neighborhood space information as the search area of the current frame. The template image z is then represented in high-dimensional feature spaceHigh-dimensional feature space representation of candidate image x with each spatial position within the search area in the manner of a Sliding Window (Sliding Window)A similarity measure is performed, i.e. equation (101) is performed. All similarity measures form a confidence score map. The higher the score, the more similar the template image z and the corresponding candidate image x in the search area are represented. The displacement of the center of the target to be tracked from the previous frame to the current frame is obtained by multiplying the offset of the maximum value in the confidence score graph relative to the center of the graph by the maximum receptive field of the feature space embedded into the network. For scale estimation of the object to be tracked SiamFC the search area is scaled using a predefined scale, and then the above steps are performed on search areas of different scales. The search area scaling corresponding to the maximum value in the maximum values of all the confidence score graphs is the scaling of the final tracking boundary box. However, predefined scaling is difficult to cover target scale variations in a real tracking scene caused by factors such as camera focal length, rapid target motion, etc.
Based on the problems, siamRPN refers to the target detection method Faster R-CNN on the basis of SiamFC, and introduces the RPN (Region Proposal Network) idea into the field of general single target tracking. SiamRPN predefining a plurality of Anchor boxes (Anchor boxes) with different scales and different length-width ratios for each spatial position in the confidence score map, and predicting the central position offset and the length-width deviation of each Anchor Box and the boundary Box of the target to be tracked by using a convolutional neural network. The SiamRPN method greatly improves the tracking accuracy of the universal single-target tracking method based on the twin neural network.
Since the depth feature space embedding used by SiamFC and SiamRPN is AlexNet, the number of network layers is small, the depth is shallow, and the feature representation capability is not rich enough. In order to fully mine input information and enrich high-dimensional feature representation, siamRPN ++ method is based on SiamRPN architecture, and ResNet-50 is used as depth feature space embedded network. In ResNet-50, the significance of the characteristic representation of different depths is more abundant because the receptive field changes greatly. The shallower features focus mainly on detail information such as color, shape, texture, etc., which is very important for target localization; and the deeper features focus more on the semantic information of the target, which is helpful when the target is subjected to motion blur and large deformation. Thus SiamRPN ++ receives the characteristics of block2, block3, and block4 from the ResNet-50 network using three RPN modules, respectively, and finally combines the outputs of the three RPN modules as the final output by means of a linear weighted average. However, in an aerial video scene, the scale of an object to be tracked is usually smaller, the spatial resolution is low, and the distinguishing characteristics are insufficient. SiamRPN ++ using off-line trained weighting parameters, it is difficult to adaptively enhance the discriminative features of the target.
Since SiamRPN and SiamRPN ++ RPN modules use a large number of predefined anchor boxes, prior information such as dimensions and aspect ratios of the anchor boxes is not readily available, nor is it difficult to accurately characterize any application scenario. Thus, based on sialmfc, sialmfc++ does not need to rely on anchor boxes, but instead uses each spatial location of the confidence score map directly to regress the distance of that location to the four margins of the target bounding box. However, the SiamFC ++ method only uses the deepest features of the feature space embedded network, and directly discards the features of the shallower layer, so that the method has poor performance on small target tracking. Although the semantic information of deep features is more rich, the deepest features necessarily lose some detailed information. The targets in the aerial data are usually smaller (relative to the whole image), and the detail information is very important for distinguishing the small targets.
Because the targets in the aerial video scene are usually smaller, detailed information such as appearance, shape, texture and the like is easy to discard, the targets to be tracked are easy to be influenced by other noise or interference objects similar to the features of the targets to be tracked, and the tracker is caused to track drift. In addition, based on the motion smoothness assumption, that is, the displacement difference of the target between two adjacent frames is small, the above-mentioned universal single target tracking method based on the twin neural network adopts a local search method to determine the position of the target in the current frame, that is, the target search area of the current frame is obtained based on the tracking boundary frame of the previous frame plus a certain proportion of context information, instead of directly using the whole image as the search area. Because the field of view of aerial photography is large, and the resolution ratio of the real target of interest is relatively small, the target search area is very small, once the real target to be tracked is influenced by noise or an interfering object (the target similar to the target to be tracked), tracking drift (the algorithm tracking boundary frame and the real boundary frame of the target to be tracked do not completely coincide) or even tracking loss (the coincidence ratio of the algorithm tracking boundary frame and the real boundary frame of the target to be tracked is 0), the search area of the subsequent frame may not contain the target to be tracked, and the tracking fails thoroughly.
Disclosure of Invention
In view of the above, the invention provides a ground target tracking method applied to aerial video, which is used for improving the tracking performance of a general single target tracking method in an aerial video scene.
The invention provides a ground target tracking method applied to aerial videos, which comprises the following steps:
S1: the feature space embedded network of SiamFC ++ is changed from GoogLeNet to ResNet;
S2: embedding the search area image x into the feature space embedded network ResNet, outputting the depth features of the 2 nd block of ResNet Depth feature/>, input to the lowest layer of the first feature pyramid network, output the 3 rd block of ResNetThe depth features input to the middle layer of the first feature pyramid network and output the 4 th block of ResNetInputting to the highest level of the first feature pyramid network; after the first feature pyramid network processes the depth features input to each layer, the depth features/>, of the search area image x, are output at the lowest layer of the first feature pyramid networkOutputting depth features/>, of search area image x at a middle layer of a first feature pyramid networkDepth features/>, of search area image x are output at the highest layer of the first feature pyramid network
S3: the template image z is input into the network ResNet with the same structure and shared parameters in the step S2, and the depth feature of the 2 nd block output of ResNet is embedded into the networkThe depth feature/>, which is input to the lowest layer of the second feature pyramid network with the same structure as in the step S2 but with the non-shared parameters, of the 3 rd block output of ResNetInput to the middle layer of the second feature pyramid network, depth features/>, output ResNet block4Inputting to the highest level of the second feature pyramid network; after the second feature pyramid network processes the depth features input to each layer, outputting the depth features/>' of the template image z at the lowest layer of the second feature pyramid networkOutputting depth features/>, of template image z at a middle layer of a second feature pyramid networkDepth features/>, of template image z are output at the highest layer of the second feature pyramid network
S4: depth characterizationAnd/>Combining and inputting the depth features into a first tracking head networkAnd/>After combination, the depth features are input into a second tracking head network, and the depth features are input into a second tracking head networkAnd/>The combination is input into a third tracking head network; the first tracking head network, the second tracking head network and the third tracking head network have the same structure but do not share parameters, and the three tracking head networks have the same structure as the SiamFC ++ tracking head network;
S5: each tracking header network receives corresponding depth features And/>Outputting a first classification confidence score graph and a target bounding box regression response graph as inputs; wherein k ε {2,3,4};
S6: and selecting the position of the maximum value of the classification confidence scores in the first classification confidence score graphs output by the three tracking head networks, wherein the vector of the target boundary box regression response graph at the position is the boundary box prediction result of the target to be tracked.
In a possible implementation manner, in the above ground target tracking method applied to aerial video provided by the present invention, step S5, each tracking head network receives a corresponding depth featureAnd/>As inputs, a first classification confidence score map and a target bounding box regression response map are output, specifically including:
Each tracking head network includes classification branches for spatial location classification and regression branches for target bounding box regression, combining depth features And/>Respectively inputting a classification branch and a regression branch of the corresponding tracking head network;
for classifying branches, depth features are processed separately using a multi-layer convolution layer that is identical in structure but does not share parameters And/>Performing cross-correlation operation, and respectively transmitting the result of the cross-correlation operation to a classification sub-branch and a centrality sub-branch of the classification branch; classifying the sub-branches to use a1 multiplied by 1 convolution to process the result of the cross-correlation operation and output a second classification confidence score map; the result of cross-correlation operation is processed by using a1 multiplied by 1 convolution on the centrality sub-branch, and a centrality confidence probability map of each spatial position is output; in the test stage, multiplying the central confidence probability map with the second classification confidence score map as a weight to generate a first classification confidence score map;
for regression branches, depth features are processed separately using a multi-layer convolution layer that is identical in structure but does not share parameters And/>Performing cross-correlation operation; regression branching uses a result of a1×1 convolution processing cross-correlation operation to output a target bounding box regression response diagram;
step S5 is formally expressed as:
Wherein, Representing feature space embedded network ResNet,/>Representing depth features of template image z processed by first k blocks of feature space embedding network ResNet and second feature pyramid network,/>Representing depth features of the template image x after the first k blocks of the feature space embedding network ResNet and the first feature pyramid network are processed, wherein k represents index numbers of the feature space embedding network ResNet; i represents the index number of the trace header network, i e {1,2,3}, ζ i represents the i-th trace header network; f i denotes the mapping from input to output of group i, mathematically expressed as:
Wherein, And/>Representing the first classification confidence score map and the target bounding box regression response map output by the ith tracking head network respectively,/>And/>Representing the height and width of the output result of the ith trace header network, respectively.
In a possible implementation manner, in the ground target tracking method applied to aerial video provided by the present invention, step S6, selecting a position where a maximum value of classification confidence scores in a first classification confidence score map output by three tracking head networks is located, where a vector of a target bounding box regression response map at the position is a bounding box prediction result of a target to be tracked, specifically includes:
Selecting the position of the maximum value of the classification confidence scores in the first classification confidence score graphs output by the three tracking head networks, and formally expressing the position as:
wherein p represents the position of the maximum value of all classification confidence scores,
The vector of the target boundary box regression response diagram at the position is the boundary box prediction result of the target to be tracked, and the vector is expressed in a formalized way as follows:
Wherein, Vector b, p 1 E {1,2,3 }/>, of p 2 row and p 3 column in target bounding box regression response diagram representing p 1 tracking head network output
In a possible implementation manner, in the above ground target tracking method applied to aerial video provided by the present invention, in step S2, a search area size adaptive adjustment strategy specifically includes:
The searching area image is a part of each frame of image in the aerial video; during tracking, setting the initial size of a search area image as d 0, representing the tracking quality theta of the current frame image in the aerial video by using the maximum value of the classification confidence scores in the first classification confidence score graphs of the three tracking head networks, and setting a stability threshold tau 1, a loss threshold tau 2 and the tracking quality class number m, wherein m=3; the size of the search area image in the next frame image is:
where μ represents 3 times the maximum receptive field size of the feature space embedded network ResNet and function max represents the maximum value of the elements in the collection.
The invention provides a ground target tracking method applied to aerial videos, and belongs to a universal single target tracking method. The feature space embedded network of SiamFC ++ is changed from GoogLeNet to ResNet; inputting ResNet the search area image x and the template image z, and extracting depth features of the search area image x and the template image z; using two feature pyramid networks to strengthen depth features of the search area image x and the template image z; inputting the depth features into a tracking head network, and outputting a first classification confidence score graph and a target boundary box regression response graph; and selecting the position of the maximum value of the classification confidence score in the first classification confidence score graph, wherein the vector of the target boundary box regression response graph at the position is the boundary box prediction result of the target to be tracked. In the process of extracting depth features, a feature pyramid network is utilized to adaptively fuse the shallow and deep features of a feature space embedded network, so that the feature representation has abundant apparent, shape, texture and other detailed information and strong semantic information, the effect of strengthening the discriminative feature representation of a small target can be achieved, and the problems of tracking drift, even tracking loss and the like caused by small targets due to large aerial visual field range can be avoided. Moreover, a search area size self-adaptive adjustment strategy is provided for enhancing the tracking loss risk resistance of the tracker. Experimental results of multiple-aspect evaluation show that the ground target tracking method applied to aerial video provided by the invention can improve the tracking performance of the universal single-target tracking method in aerial video scenes.
Drawings
Fig. 1 is a flowchart of a ground target tracking method applied to aerial video provided in embodiment 1 of the present invention;
fig. 2 is an overall frame diagram of a ground target tracking method applied to aerial videos provided in embodiment 1 of the present invention;
FIG. 3 is a structural framework diagram of a feature pyramid network used in embodiment 1 of the present invention;
fig. 4 is a flowchart of a search area size adaptive adjustment strategy in embodiment 1 of the present invention.
Detailed Description
The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is apparent that the described embodiments are merely examples and are not intended to limit the present invention.
The invention provides a ground target tracking method applied to aerial videos, which comprises the following steps:
S1: the feature space embedded network of SiamFC ++ is changed from GoogLeNet to ResNet;
S2: embedding the search area image x into the feature space embedded network ResNet, outputting the depth features of the 2 nd block of ResNet The lowest layer input to the first feature pyramid network (Feature Pyramid Network, FPN) will output the 3 rd block of ResNet depth features/>Input to the middle layer of the first feature pyramid network, depth features/>, output of the 4 th block of ResNetInputting to the highest level of the first feature pyramid network; after the first feature pyramid network processes the depth features input to each layer, the depth features/>, of the search area image x, are output at the lowest layer of the first feature pyramid networkOutputting depth features/>, of search area image x at a middle layer of a first feature pyramid networkDepth features/>, of search area image x are output at the highest layer of the first feature pyramid network
S3: the template image z is input into the network ResNet with the same structure and shared parameters in the step S2, and the depth feature of the 2 nd block output of ResNet is embedded into the networkThe depth feature/>, which is input to the lowest layer of the second feature pyramid network with the same structure as in the step S2 but with the non-shared parameters, of the 3 rd block output of ResNetInput to the middle layer of the second feature pyramid network, depth features/>, output ResNet block4Inputting to the highest level of the second feature pyramid network; after the second feature pyramid network processes the depth features input to each layer, outputting the depth features/>' of the template image z at the lowest layer of the second feature pyramid networkOutputting depth features/>, of template image z at a middle layer of a second feature pyramid networkDepth features/>, of template image z are output at the highest layer of the second feature pyramid network
S4: depth characterizationAnd/>Combining and inputting the depth features into a first tracking head networkAnd/>After combination, the depth features are input into a second tracking head network, and the depth features are input into a second tracking head networkAnd/>The combination is input into a third tracking head network; the first tracking head network, the second tracking head network and the third tracking head network have the same structure but do not share parameters, and the three tracking head networks have the same structure as the SiamFC ++ tracking head network;
S5: each tracking header network receives corresponding depth features And/>Outputting a first classification confidence score graph and a target bounding box regression response graph as inputs; wherein k ε {2,3,4};
S6: and selecting the position of the maximum value of the classification confidence scores in the first classification confidence score graphs output by the three tracking head networks, wherein the vector of the target boundary box regression response graph at the position is the boundary box prediction result of the target to be tracked.
The following describes in detail the implementation of the ground target tracking method applied to aerial videos provided by the present invention through two specific embodiments.
Example 1: the flow chart is shown in fig. 1, and the overall block diagram is shown in fig. 2.
The first step: the SiamFC ++ universal single-target tracking method is selected as a baseline method, and the characteristic space embedded network of SiamFC ++ is changed from GoogLeNet to ResNet.
And a second step of: the search area image x is input into the feature space embedding network ResNet, depth features of the search area image x are extracted using the feature space embedding network ResNet, and then processed using the first feature pyramid network.
Specifically, depth features that output the 2 nd block of ResNetDepth feature/>, input to the lowest layer of the first feature pyramid network, output the 3 rd block of ResNetInput to the middle layer of the first feature pyramid network, depth features/>, output of the 4 th block of ResNetInput to the highest level of the first feature pyramid network. As shown in fig. 3, in the first feature pyramid network, the features of the three-layer input all use 1×1 convolution to adjust the number of channels to 256, denoted/>, respectively And/>It is known that ResNet deeper depth features have stronger semantic information than shallower depth features, while ResNet shallower depth features have stronger detail information than deeper depth features, therefore, depth features input to the highest layer of the first feature pyramid network and adjusted by the number of channels/>The resolution is enlarged to 2 times of the original resolution through an up-sampling module, and then the up-sampling module is connected with depth features/>, which are input to the middle layer of the first feature pyramid network and are subjected to channel number adjustmentFusion to obtain depth features/>Likewise, depth features/>, obtained after fusionThe resolution is enlarged to 2 times of the original resolution through another up-sampling module, and then the depth features/>, which are input to the lowest layer of the first feature pyramid network and are subjected to channel number adjustment, are obtainedFusion to obtain depth features/>Finally, depth features/>, are output at the lowest layer of the first feature pyramid networkOutputting depth features/>, in a middle layer of a first feature pyramid networkOutputting and inputting depth features/>, which are input to the highest layer of the first feature pyramid network and are subjected to channel number adjustment, at the highest layer of the first feature pyramid networkExpressed as/>
And a third step of: the template image z is input to the feature space embedding network ResNet which has the same structure and shared parameters as in step S2, the depth features of the template image z are extracted using the feature space embedding network ResNet, and then the depth features of the template image z are processed using the second feature pyramid network.
Specifically, depth features that output the 2 nd block of ResNetThe depth feature of the 3 rd block output of ResNet is input to the lowest layer of the second feature pyramid network which has the same structure as in the step S2 but does not share the parametersInput to the middle layer of the second feature pyramid network, depth features/>, output ResNet block4Input to the highest level of the second feature pyramid network. As shown in fig. 3, in the second feature pyramid network, the features of the three-layer input all use a 1×1 convolution to adjust the number of channels to 256, denoted/>, respectivelyAnd/>It is known that ResNet deeper depth features have stronger semantic information than shallower depth features, while ResNet shallower depth features have stronger detail information than deeper depth features, therefore, depth features input to the highest layer of the second feature pyramid network and channel number adjusted/>The resolution is enlarged to 2 times of the original resolution through an up-sampling module, and then the up-sampling module is connected with depth features/>, which are input to the middle layer of the first feature pyramid network and are subjected to channel number adjustmentFusion to obtain depth features/>Likewise, depth features/>, obtained after fusionThe resolution is enlarged to 2 times of the original resolution through another up-sampling module, and then the depth features/>, which are input to the lowest layer of the first feature pyramid network and are subjected to channel number adjustment, are obtainedFusion to obtain depth features/>Finally, outputting depth features/>, of the template image z at the lowest layer of the second feature pyramid networkOutputting depth features of template image z at middle layer of second feature pyramid networkOutputting and inputting depth features/>, which are input to the highest layer of the first feature pyramid network and are subjected to channel number adjustment, at the highest layer of the second feature pyramid networkExpressed as depth features/>
Fourth step: depth characterizationAnd/>After combination, the depth features are input into a first tracking head (TRACKING HEAD) network, and the depth features are input into a second tracking head networkAnd/>Combining and inputting the depth features into a second tracking head networkAnd/>The combination is input into a third tracking head network; wherein the first, second and third trace header networks are identical in structure but not shared in parameters, and the three trace header networks are identical in structure to the SiamFC ++ trace header network, as shown in brackets in fig. 2.
Fifth step: each tracking header network receives corresponding depth featuresAnd/>As inputs, a first classification confidence score map and a target bounding box regression response map are output.
Specifically, each tracking head network includes classification branches for spatial location classification and regression branches for target bounding box regression, depth features to be combinedAnd/>The classification branch and the regression branch of the corresponding trace header network are input, respectively. For example, the combined depth features/>And/>Inputting a classification branch and a regression branch of the first tracking head network respectively, and combining depth features/>And/>Inputting a classification branch and a regression branch of the second tracking head network respectively, and combining depth features/>And/>The classification branch and the regression branch of the third trace header network are input, respectively.
For classifying branches, depth features are processed separately using a multi-layer convolution layer that is identical in structure but does not share parametersAnd/>Performing cross-correlation operation, and respectively transmitting the result of the cross-correlation operation to a classification sub-branch and a centrality sub-branch of the classification branch; classifying the sub-branches to use a1 multiplied by 1 convolution to process the result of the cross-correlation operation and output a second classification confidence score map; the result of cross-correlation operation is processed by using a1 multiplied by 1 convolution on the centrality sub-branch, and a centrality confidence probability map of each spatial position is output; in the test stage, the centrality confidence probability map is multiplied by the second classification confidence score map as a weight to generate a first classification confidence score map.
For regression branches, depth features are processed separately using a multi-layer convolution layer that is identical in structure but does not share parametersAnd/>Performing cross-correlation operation; regression branching uses the result of a1×1 convolution processing cross-correlation operation to output a target bounding box regression response graph.
The above procedure is formally expressed as:
Wherein, Representing feature space embedded network ResNet,/>Representing depth features of template image z processed by first k blocks of feature space embedding network ResNet and second feature pyramid network,/>Representing depth features of the template image x after the first k blocks of the feature space embedding network ResNet and the first feature pyramid network are processed, wherein k represents index numbers of the feature space embedding network ResNet; i represents the index number of the trace header network, i e {1,2,3}, ζ i represents the i-th trace header network; f i denotes the mapping from input to output of group i, mathematically expressed as:
Wherein, And/>Representing the first classification confidence score map and the target bounding box regression response map output by the ith tracking head network respectively,/>And/>Representing the height and width of the output result of the ith trace header network, respectively.
Sixth step: and selecting the position of the maximum value of the classification confidence scores in the first classification confidence score graphs output by the three tracking head networks, wherein the vector of the target boundary box regression response graph at the position is the boundary box prediction result of the target to be tracked.
Specifically, the position of the maximum value of the classification confidence scores in the first classification confidence score graphs output by the three tracking head networks is selected, and the formalized representation is as follows:
wherein p represents the position of the maximum value of all classification confidence scores,
The vector of the target boundary box regression response diagram at the position is the boundary box prediction result of the target to be tracked, and the vector is expressed in a formalized way as follows:
Wherein, Vector b, p 1 E {1,2,3 }/>, of p 2 row and p 3 column in target bounding box regression response diagram representing p 1 tracking head network output
In summary, the specific implementation process of the ground target tracking method applied to aerial video provided by the embodiment 1 of the invention is a set of general single target tracking model framework with strong discrimination capability and small target perception.
Example 2: embodiment 1+ search region size adaptive adjustment strategy.
It should be noted that the search area image is a part of each frame of image in the aerial video. Based on the above ground target tracking method applied to aerial video provided in embodiment 1 of the present invention, embodiment 2 of the present invention further proposes a search area size adaptive adjustment strategy, as shown in fig. 4, specifically, the following manner may be adopted: during tracking, setting the initial size of a search area image as d 0, representing the tracking quality theta of the current frame image in the aerial video by using the maximum value of the classification confidence scores in the first classification confidence score graphs of the three tracking head networks, and setting a stability threshold tau 1, a loss threshold tau 2 and the tracking quality class number m; specifically, when θ is equal to or greater than τ 1, it indicates that the tracking result of the current frame image is "stable and reliable", so that the size of the search area does not need to be enlarged for the next frame image; when θ < τ 2, it indicates that the tracking result of the current frame image is "lost", that is, the predicted target bounding box has completely deviated from the real target bounding box, and a search area as large as possible needs to be used to ensure that the next frame image can retrieve the target to be tracked; when τ 2≤θ<τ1 indicates that the tracking result of the current frame image is "incompletely reliable", the predicted target bounding box has undergone a tracking drift of a certain magnitude, and in order to prevent the risk of tracking loss, the next frame image may be subjected to an "adaptive growth strategy (i.e.) "Enlarge search area size". To sum up, the size of the search area image in the next frame image can be formally expressed as:
where μ represents 3 times the maximum receptive field size of the feature space embedded network ResNet and function max represents the maximum value of the elements in the collection.
In order to better verify the effectiveness of the embodiments 1 and 2 of the present invention, the following description will use the actual test to combine the two embodiments, where the test process corresponds to the tracking process in the actual application, and the test tracking video corresponds to the frame sequence input in real time on line or the video cached off line in the actual application.
In the field of target tracking, the quality of tracking is generally evaluated using the UAV123 dataset. UAV123 is a large-scale aerial video single-target tracking data set, and comprises 123 high-definition unmanned aerial vehicle aerial videos, each video has an average length of 915 frames, and the number of data aggregation frames exceeds 100,000. The data set covers various common categories such as pedestrians, cars, large trucks, bicycles, ships, buildings and the like, and covers most application scenes in practical application, so that the data set based on the evaluation index has strong generalization and universality. Specifically, the main evaluation index of the data set on the target tracking method is the curve coverage area (Success Plot Area Under Curve, success AUC) of the Success graph. The greater the value of the Success AUC is [0,1], the stronger the robustness of the evaluated target tracking method is indicated, and the higher the practical application value is.
The effectiveness of the two technical points (1, the general single-target tracking model framework with small target perception with strong discrimination capability, namely the search region size adaptive adjustment strategy in the embodiments 1,2 and 2) of the present invention is verified based on the UAV123 dataset by comparing the existing SiamFC ++ method, the embodiment 1 of the present invention, the search region size adaptive adjustment strategy and the Success AUC index in the embodiment 2 of the present invention. As shown in table 1, the Success AUC index for the SiamFC ++ method is 0.631; if only the technical point 1 of the invention is used, a set of general single-target tracking model frames with small target perception and strong discrimination capability (namely the embodiment 1 of the invention), the Success AUC index can reach 0.660 and exceed SiamFC ++ method by 2.9 points; if only the technical point 2 is used, the self-adaptive adjustment strategy of the size of the search area is adopted, the initial size of the search area is set to d 0=447,τ1=0.8,τ2 =0.4, m=3, and the success AUC index can reach 0.646 and exceed 1.5 points of the SiamFC ++ method; by combining the two technical points of the invention (namely the embodiment 2 of the invention), the Success AUC index can reach 0.672, which exceeds 4.1 points of SiamFC ++ method; this fully verifies the effectiveness of both technical aspects of the present invention.
TABLE 1
The following compares the ground target tracking method applied to aerial video provided in embodiment 2 of the present invention with the existing SiamRPN method, siamRPN ++ method and the existing Success AUC index of SiamFC ++ method based on the UAV123 dataset, as shown in table 2, the ground target tracking method applied to aerial video provided in embodiment 2 of the present invention has a Success AUC index of 0.672, which exceeds the existing SiamRPN method, siamRPN ++ method and SiamFC ++ method, which indicates that the ground target tracking method applied to aerial video provided in embodiment 2 of the present invention has stronger effectiveness and versatility.
TABLE 2
The invention provides a ground target tracking method applied to aerial videos, and belongs to a universal single target tracking method. The feature space embedded network of SiamFC ++ is changed from GoogLeNet to ResNet; inputting ResNet the search area image x and the template image z, and extracting depth features of the search area image x and the template image z; using two feature pyramid networks to strengthen depth features of the search area image x and the template image z; inputting the depth features into a tracking head network, and outputting a first classification confidence score graph and a target boundary box regression response graph; and selecting the position of the maximum value of the classification confidence score in the first classification confidence score graph, wherein the vector of the target boundary box regression response graph at the position is the boundary box prediction result of the target to be tracked. In the process of extracting depth features, a feature pyramid network is utilized to adaptively fuse the shallow and deep features of a feature space embedded network, so that the feature representation has abundant apparent, shape, texture and other detailed information and strong semantic information, the effect of strengthening the discriminative feature representation of a small target can be achieved, and the problems of tracking drift, even tracking loss and the like caused by small targets due to large aerial visual field range can be avoided. Moreover, a search area size self-adaptive adjustment strategy is provided for enhancing the tracking loss risk resistance of the tracker. Experimental results of multiple-aspect evaluation show that the ground target tracking method applied to aerial video provided by the invention can improve the tracking performance of the universal single-target tracking method in aerial video scenes.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (3)

1. The ground target tracking method applied to the aerial video is characterized by comprising the following steps of:
S1: the feature space embedded network of SiamFC ++ is changed from GoogLeNet to ResNet;
S2: embedding the search area image x into the feature space embedded network ResNet, outputting the depth features of the 2 nd block of ResNet Depth feature/>, input to the lowest layer of the first feature pyramid network, output the 3 rd block of ResNetThe depth features input to the middle layer of the first feature pyramid network and output the 4 th block of ResNetInputting to the highest level of the first feature pyramid network; after the first feature pyramid network processes the depth features input to each layer, the depth features/>, of the search area image x, are output at the lowest layer of the first feature pyramid networkOutputting depth features/>, of search area image x at a middle layer of a first feature pyramid networkDepth features/>, of search area image x are output at the highest layer of the first feature pyramid network
S3: the template image z is input into the network ResNet with the same structure and shared parameters in the step S2, and the depth feature of the 2 nd block output of ResNet is embedded into the networkThe depth feature/>, which is input to the lowest layer of the second feature pyramid network with the same structure as in the step S2 but with the non-shared parameters, of the 3 rd block output of ResNetInput to the middle layer of the second feature pyramid network, depth features/>, output ResNet block4Inputting to the highest level of the second feature pyramid network; after the second feature pyramid network processes the depth features input to each layer, outputting the depth features/>' of the template image z at the lowest layer of the second feature pyramid networkOutputting depth features/>, of template image z at a middle layer of a second feature pyramid networkOutputting depth features of template image z at the highest layer of the second feature pyramid network
S4: depth characterizationAnd/>Combining and inputting the depth features into a first tracking head networkAnd/>After combination, the depth features are input into a second tracking head network, and the depth features are input into a second tracking head networkAnd/>The combination is input into a third tracking head network; the first tracking head network, the second tracking head network and the third tracking head network have the same structure but do not share parameters, and the three tracking head networks have the same structure as the SiamFC ++ tracking head network;
S5: each tracking header network receives corresponding depth features And/>Outputting a first classification confidence score graph and a target bounding box regression response graph as inputs; wherein k ε {2,3,4};
s6: selecting the position of the maximum value of the classification confidence scores in a first classification confidence score graph output by the three tracking head networks, wherein the vector of the target boundary box regression response graph at the position is the boundary box prediction result of the target to be tracked;
Step S5, each tracking head network receives the corresponding depth feature And/>As inputs, a first classification confidence score map and a target bounding box regression response map are output, specifically including:
Each tracking head network includes classification branches for spatial location classification and regression branches for target bounding box regression, combining depth features And/>Respectively inputting a classification branch and a regression branch of the corresponding tracking head network;
for classifying branches, depth features are processed separately using a multi-layer convolution layer that is identical in structure but does not share parameters And/>Performing cross-correlation operation, and respectively transmitting the result of the cross-correlation operation to a classification sub-branch and a centrality sub-branch of the classification branch; classifying the sub-branches to use a1 multiplied by 1 convolution to process the result of the cross-correlation operation and output a second classification confidence score map; the result of cross-correlation operation is processed by using a1 multiplied by 1 convolution on the centrality sub-branch, and a centrality confidence probability map of each spatial position is output; in the test stage, multiplying the central confidence probability map with the second classification confidence score map as a weight to generate a first classification confidence score map;
for regression branches, depth features are processed separately using a multi-layer convolution layer that is identical in structure but does not share parameters And/>Performing cross-correlation operation; regression branching uses a result of a1×1 convolution processing cross-correlation operation to output a target bounding box regression response diagram;
step S5 is formally expressed as:
Wherein, Representing feature space embedded network ResNet,/>Representing depth features of template image z processed by first k blocks of feature space embedding network ResNet and second feature pyramid network,/>Representing depth features of the template image x after the first k blocks of the feature space embedding network ResNet and the first feature pyramid network are processed, wherein k represents index numbers of the feature space embedding network ResNet; i represents the index number of the trace header network, i e {1,2,3}, ζ i represents the i-th trace header network; f i denotes the mapping from input to output of group i, mathematically expressed as:
Wherein, And/>Representing the first classification confidence score map and the target bounding box regression response map output by the ith tracking head network respectively,/>And/>Representing the height and width of the output result of the ith trace header network, respectively.
2. The ground target tracking method applied to aerial video according to claim 1, wherein step S6 is a step of selecting a position of a maximum value of classification confidence scores in a first classification confidence score map output by three tracking head networks, and a vector of a target bounding box regression response map at the position is a bounding box prediction result of a target to be tracked, and specifically includes:
Selecting the position of the maximum value of the classification confidence scores in the first classification confidence score graphs output by the three tracking head networks, and formally expressing the position as:
wherein p represents the position of the maximum value of all classification confidence scores,
The vector of the target boundary box regression response diagram at the position is the boundary box prediction result of the target to be tracked, and the vector is expressed in a formalized way as follows:
Wherein, Vector b, p 1 E {1,2,3 }/>, of p 2 row and p 3 column in target bounding box regression response diagram representing p 1 tracking head network output
3. The ground target tracking method applied to aerial video according to claim 1 or 2, wherein in step S2, the search area size adaptive adjustment strategy specifically comprises:
the searching area image is a part of each frame of image in the aerial video; during tracking, setting the initial size of a search area image as d 0, representing the tracking quality theta of the current frame image in the aerial video by using the maximum value of the classification confidence scores in the first classification confidence score graphs of the three tracking head networks, and setting a stability threshold tau 1, a loss threshold tau 2 and the tracking quality class number m; the size of the search area image in the next frame image is:
where μ represents 3 times the maximum receptive field size of the feature space embedded network ResNet and function max represents the maximum value of the elements in the collection.
CN202111156857.9A 2021-09-30 2021-09-30 Ground target tracking method applied to aerial video Active CN114219826B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111156857.9A CN114219826B (en) 2021-09-30 2021-09-30 Ground target tracking method applied to aerial video

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111156857.9A CN114219826B (en) 2021-09-30 2021-09-30 Ground target tracking method applied to aerial video

Publications (2)

Publication Number Publication Date
CN114219826A CN114219826A (en) 2022-03-22
CN114219826B true CN114219826B (en) 2024-06-07

Family

ID=80696037

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111156857.9A Active CN114219826B (en) 2021-09-30 2021-09-30 Ground target tracking method applied to aerial video

Country Status (1)

Country Link
CN (1) CN114219826B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114757973A (en) * 2022-04-25 2022-07-15 集美大学 Sea surface target motion tracking method, terminal equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111179314A (en) * 2019-12-30 2020-05-19 北京工业大学 Target tracking method based on residual dense twin network
CN111179307A (en) * 2019-12-16 2020-05-19 浙江工业大学 Visual target tracking method for full-volume integral and regression twin network structure
CN111508000A (en) * 2020-04-14 2020-08-07 北京交通大学 Deep reinforcement learning target tracking method based on parameter space noise network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109272530B (en) * 2018-08-08 2020-07-21 北京航空航天大学 Target tracking method and device for space-based monitoring scene

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111179307A (en) * 2019-12-16 2020-05-19 浙江工业大学 Visual target tracking method for full-volume integral and regression twin network structure
CN111179314A (en) * 2019-12-30 2020-05-19 北京工业大学 Target tracking method based on residual dense twin network
CN111508000A (en) * 2020-04-14 2020-08-07 北京交通大学 Deep reinforcement learning target tracking method based on parameter space noise network

Also Published As

Publication number Publication date
CN114219826A (en) 2022-03-22

Similar Documents

Publication Publication Date Title
Zhang et al. Deep-IRTarget: An automatic target detector in infrared imagery using dual-domain feature extraction and allocation
CN111797716B (en) Single target tracking method based on Siamese network
CN107833239B (en) Optimization matching target tracking method based on weighting model constraint
CN109377555B (en) Method for extracting and identifying three-dimensional reconstruction target features of foreground visual field of autonomous underwater robot
CN112651262B (en) Cross-modal pedestrian re-identification method based on self-adaptive pedestrian alignment
CN110766723B (en) Unmanned aerial vehicle target tracking method and system based on color histogram similarity
CN101950426A (en) Vehicle relay tracking method in multi-camera scene
CN110544269A (en) twin network infrared target tracking method based on characteristic pyramid
CN113592911B (en) Apparent enhanced depth target tracking method
Khalid et al. Bhattacharyya Coefficient in Correlation of Gray-Scale Objects.
CN113902991A (en) Twin network target tracking method based on cascade characteristic fusion
CN107609571A (en) A kind of adaptive target tracking method based on LARK features
CN111091582A (en) Single-vision target tracking algorithm and system based on deep neural network
CN114219826B (en) Ground target tracking method applied to aerial video
CN113763417B (en) Target tracking method based on twin network and residual error structure
CN112418203B (en) Robustness RGB-T tracking method based on bilinear convergence four-stream network
CN116777956A (en) Moving target screening method based on multi-scale track management
CN116665097A (en) Self-adaptive target tracking method combining context awareness
CN116777953A (en) Remote sensing image target tracking method based on multi-scale feature aggregation enhancement
CN116051601A (en) Depth space-time associated video target tracking method and system
CN109815790B (en) Gate controlled axis aggregation detection network system and method for remote sensing target detection
CN113793361A (en) Improved KCF target tracking method combined with lightweight SSD
CN116486203B (en) Single-target tracking method based on twin network and online template updating
CN115049705B (en) Target tracking method and device for multi-template network framework
Meng et al. An object tracking algorithm based on SRDCF and deformable diversity similarity

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20221101

Address after: 100191 No. 37, Haidian District, Beijing, Xueyuan Road

Applicant after: BEIHANG University

Applicant after: TECHNOLOGY AND ENGINEERING CENTER FOR SPACE UTILIZATION, CHINESE ACADEMY OF SCIENCES

Applicant after: BEIJING INSTITUTE OF SPACE MECHANICS & ELECTRICITY

Address before: 100191 No. 37, Haidian District, Beijing, Xueyuan Road

Applicant before: BEIHANG University

Applicant before: NATIONAL DISASTER REDUCTION CENTER OF THE EMERGENCY MANAGEMENT DEPARTMENT

Applicant before: BEIJING INSTITUTE OF SPACE MECHANICS & ELECTRICITY

DD01 Delivery of document by public notice
DD01 Delivery of document by public notice

Addressee: Patent of Beijing University of Aeronautics and Astronautics Receiver: The person in charge

Document name: Notice of Conformity

GR01 Patent grant