CN115239763A - Planar target tracking method based on central point detection and graph matching - Google Patents

Planar target tracking method based on central point detection and graph matching Download PDF

Info

Publication number
CN115239763A
CN115239763A CN202210853244.9A CN202210853244A CN115239763A CN 115239763 A CN115239763 A CN 115239763A CN 202210853244 A CN202210853244 A CN 202210853244A CN 115239763 A CN115239763 A CN 115239763A
Authority
CN
China
Prior art keywords
target
image
template
matching
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210853244.9A
Other languages
Chinese (zh)
Inventor
王涛
李坤鹏
刘贺
李浥东
郎丛妍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jiaotong University
Original Assignee
Beijing Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jiaotong University filed Critical Beijing Jiaotong University
Priority to CN202210853244.9A priority Critical patent/CN115239763A/en
Publication of CN115239763A publication Critical patent/CN115239763A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a plane target tracking method based on central point detection and graph matching, which comprises the following steps: predicting a central point of a tracking target in the current frame by using a central positioning network, and determining an initial target area according to the predicted central point; modeling the template image and the target region into a complete graph consisting of two subgraphs, wherein the two subgraphs respectively correspond to the two regions, and predicting a matching matrix of the template image and the target region by using a depth map matching network; and estimating the geometric transformation of the target from the template image to the current image from the matching pair identified by the matching matrix by using a RANSAC algorithm to obtain the predicted position of the tracking target. The method has better performance in zooming, rotating, perspective transformation, motion blurring, partial occlusion and unconstrained scenes than the prior method on the whole, and particularly obtains greater benefits in the partial occlusion, motion blurring and unconstrained scenes.

Description

Planar target tracking method based on central point detection and graph matching
Technical Field
The invention relates to the technical field of computer vision and artificial intelligence, in particular to a plane target tracking method based on central point detection and graph matching.
Background
With the continuous development of internet technology and computer vision, the perception and analysis of image information provide great convenience for the life of people. Today, channels for image information acquisition are diverse and wide, such as mobile phones, surveillance cameras, and so on. The influx of a great deal of visual information also urgently requires an efficient method or technology for processing the information. Among these techniques, planar target tracking plays an important role in tracking a specific target. Planar target tracking is widely used in many vision-based robot applications and related fields, such as visual SLAM (Simultaneous Localization and Mapping) and augmented reality, which provides powerful support for tracking or modeling of three-dimensional objects through geometric transformations derived from two-dimensional image analysis.
For a video sequence, the purpose of planar target tracking is to estimate where the target appears in subsequent frames given the planar object to be tracked in the initial frame. This problem is usually translated into an estimated 2D geometric transformation, such as affine transformation, perspective transformation (also called homography). Although some excellent work has been proposed in this field, they still do not perform well when the target is occluded and moves rapidly.
At present, the planar target tracking methods in the prior art can be divided into two categories, namely template-based methods and key point-based methods. The essence of the template-based approach is to solve an optimization problem that minimizes the differences between the template and the search area to track objects, such as the ESM (Efficient Second-Order Minimization) algorithm. The keypoint-based method generally models a template and a search region as two groups of keypoints, then establishes a correspondence between them, and finally estimates a 2D transformation using a geometric verification method, such as the Gracker algorithm. Compared with the template-based method, the keypoint-based method has natural advantages for partial occlusion.
The development of deep learning in recent years injects new vitality into the field, and gradually becomes a research hotspot in the tracking direction of a plane target. These depth methods are usually designed based on template methods, which regresses the position or homography of the predicted target by fusing the depth global features of the template and the search image, such as homographinet algorithm, HDN algorithm. In addition, some methods are also dedicated to constructing descriptors robust to geometric transformation to improve the tracking accuracy, such as GIFT operator and LISRD operator. But few methods of deep learning unify the acquisition and matching of keypoints into a complete framework.
At present, no key point-based depth plane target tracking method which is particularly effective for motion blur and unconstrained scenes exists in the prior art.
Disclosure of Invention
The embodiment of the invention provides a plane target tracking method based on central point detection and image matching, so as to effectively track a plane target in an image.
In order to achieve the purpose, the invention adopts the following technical scheme.
A planar target tracking method based on center point detection and graph matching comprises the following steps:
predicting the central point of a tracking target in the current frame by using a central positioning network, and determining an initial target area according to the predicted central point;
modeling the template image and the target region into a complete graph consisting of two subgraphs, wherein the two subgraphs respectively correspond to the two regions, and predicting a matching matrix of the template image and the target region by using a depth map matching network;
and estimating the geometric transformation of the target from the template image to the current image from the matching pair identified by the matching matrix by using a RANSAC algorithm to obtain the predicted position of the tracking target in the current frame.
Preferably, before predicting the central point of the tracking target in the current frame by using the central positioning network, the central positioning network is trained, and the training process includes:
the method comprises the following steps: acquiring a training part of the public tracking picture data set, and increasing the size of the template image to be 2 larger than the target in the data preprocessing 2 The multiplied area is used as a template area, and the size of the searched image is 5 times larger than that of the target 2 The multiplied areas are used as search areas and are scaled to 128 × 128 and 320 × 320 sizes, respectively, i.e., the input image has a format of [ C ] respectively 1 ,H 1 ,W 1 ]=[3,128,128],[C 2 ,H 2 ,W 2 ]=[3,320,320]Where C denotes a channel, H denotes a height of the image, and W denotes a width of the image;
step two: adopting ResNet50 with the last layer4, the last pooling layer and the last FC layer removed as a backbone network to extract the features of the image, wherein the extracted feature dimension is 1024, adopting a 1 x 1 convolution kernel to check the extracted features for dimension reduction, the feature dimension after the dimension reduction is 256, and using an unraimable sine and cosine code to perform position coding on elements at each position in the feature map;
step three: flattening the characteristic vectors of the two parts and splicing the two parts together along the spatial dimension to obtain the characteristic vectors
Figure BDA0003755485240000031
And sending the data into an Encoder module, wherein d =256, the Encoder enhances the original characteristics through self-attention and cross-attention and captures the corresponding relation between the original characteristics and the captured original characteristics to obtain the capability of distinguishing the space position of the target, and the characteristic vector of the search area coded by the Encoder
Figure BDA0003755485240000032
And query vector q ∈ R 1 ×d Attention is paid again, and the position information of the target in the search area is decoded, as shown in formula (1):
Figure BDA0003755485240000037
the decoded information is
Figure BDA0003755485240000033
Feature vector f' x By dimension change to
Figure BDA0003755485240000034
And sending the data to a stacked full convolution network, reducing the channel dimension of f to 1 through the full convolution network, and obtaining a probability graph of central point position prediction
Figure BDA0003755485240000035
Calculating an expected value of the probability map distribution in a grid coordinate space to obtain a predicted target center point, as shown in formula (2):
Figure BDA0003755485240000036
step four: training is carried out by using l1loss as a loss function, and the specific formula is shown as (3), wherein
Figure BDA0003755485240000041
And c i =(c x ,c y ) Respectively representing predicted target central point labels and real target central point labels, adopting AdamW as an optimizer, and optimizing parameters of the network model according to the loss value:
Figure BDA0003755485240000042
and obtaining the trained central positioning network.
Preferably, the predicting a central point of a tracked target in a current frame by using a central positioning network, and determining an initial target region according to the predicted central point includes:
inputting continuous video frames in a central positioning network, wherein a first frame is a template frame, an area where an object is located is called a template, storing the characteristics extracted from the template area by ResNet50 in the central positioning network to avoid repeated calculation, the template area is an area which takes the center of the template as a central point and has the width and the height which are respectively 2 times of the width and the height of the template, and the position deviation of the template in the first frame is taken as an initial motion parameter;
during tracking, firstly, a motion parameter tracked by a previous frame is used for carrying out inverse transformation on a currently read image to obtain a resampled image, meanwhile, a position tracked at the previous moment corresponds to a quadrangle area in the resampled image, the center of the quadrangle is used as a central point, the width and the height of a template which are 5 times of the width and the height of the quadrangle are used as sizes to cut, fill and zoom the resampled image to obtain a search area, and the template area and the search area are sent to a central positioning network to obtain a predicted target central point position (c) x ,c y ) To (c) x ,c y ) An area with the same size as the template is cut out for the center as the initial target area to be located.
Preferably, before predicting the matching matrix between the template image and the target region by using the depth map matching network, the method further includes training the depth map matching network, and the training process includes:
the method comprises the following steps: obtaining a public map matching data set, the data set comprising a template image (P), a search image (Q), key points in the template image
Figure BDA0003755485240000043
And the descriptors thereof
Figure BDA0003755485240000044
Searching for keypoints in images
Figure BDA0003755485240000045
And the descriptors thereof
Figure BDA0003755485240000046
The corresponding relation (M) between the key points in the template image and the key points in the search image is used for uniformly adjusting the size of the image to 256 multiplied by 256 during data preprocessing, namely the size of the input image is [ C, H, W]=[3,256,256]C denotes a channel, H denotes a height of the image, and W denotes a width of the image;
step two: modeling the template image P and the search image Q into graphs according to the Delaunay triangulation algorithm, respectively expressed as
Figure BDA0003755485240000051
And
Figure BDA0003755485240000052
wherein
Figure BDA0003755485240000053
The position of the vertex is represented,
Figure BDA0003755485240000054
the edges are represented as a function of time,
Figure BDA00037554852400000521
features representing vertices, ε edges, according to the ratio between two sets of points
Figure BDA0003755485240000055
And
Figure BDA0003755485240000056
) The feature similarity of the two sub-graphs constructs a cross edge and connects the two sub-graphs to form a complete graph
Figure BDA0003755485240000057
Specifically, for any one
Figure BDA0003755485240000058
In that
Figure BDA0003755485240000059
Selecting top-k points with the most similar appearance to the point v to build edges, wherein the appearance similarity is defined as:
Figure BDA00037554852400000510
step three: the graph first aggregates and updates node information along all edges, as shown in equations (4) and (5):
Figure BDA00037554852400000511
Figure BDA00037554852400000512
where N (v) denotes the neighbor of node v, M V Represents an aggregation function of the node information,
Figure BDA00037554852400000513
and
Figure BDA00037554852400000514
representing the information of node w and edge v → w at the t-th pass,
Figure BDA00037554852400000515
indicating that node v is at (t + 1) th Neighbor information during transfer, U V Representing a node updating function, aggregating the information of edges connected with each node to obtain neighbor information, and fusing and updating the neighbor information and the information of the original node to be used as a new state of the node;
after the node state is updated, the graph will be updated with the edge state, which is also divided into two steps of aggregation and update, as shown in formulas (6) and (7):
Figure BDA00037554852400000516
Figure BDA00037554852400000517
wherein,
Figure BDA00037554852400000518
feature vectors, M, representing the source node and destination node at the t-th pass of edge v → w, respectively E The side-information transfer function is represented,
Figure BDA00037554852400000519
representing the state of the side v → w at the t-th pass,
Figure BDA00037554852400000520
indicating that the edge is at (t + 1) th Neighbor information during transfer, U E Representing an edge updating function, aggregating the information of the source node and the destination node of each edge to obtain neighbor information, and fusing and updating the neighbor information and the original edge information to be used as a new state of the edge;
step four: using l2loss with weight as a loss function, and using a specific formula shown in (9), wherein λ is a hyper-parameter for balancing positive and negative samples, S and M respectively represent a predicted fractional matrix and a true matching matrix label, and using Adam as an optimizer to perform back propagation optimization on parameters of a network model according to a loss value:
Figure BDA0003755485240000061
Figure BDA0003755485240000062
preferably, the modeling the template image and the target region as a complete graph composed of two subgraphs, where the two subgraphs correspond to the two regions, respectively, and predicting a matching matrix of the template image and the target region by using a depth map matching network includes:
extracting feature points from the target area by using a SuperPoint network, modeling the extracted feature points into a graph by using a Delaunay triangulation algorithm, combining and modeling the graph and sub-graphs corresponding to the template image into a complete graph, sending the complete graph into a graph matching network for information transmission, aggregation and updating, after the state of the graph is updated, estimating the matching confidence coefficient of corresponding nodes from the features on the cross edges by using a linear layer to obtain a fractional matrix, and performing 0/1 processing on the fractional matrix by using a greedy algorithm to obtain a matching matrix of the template image and the target area, wherein the value of the matching matrix is 0 or 1.
Preferably, the estimating, by using the RANSAC algorithm, a geometric transformation of the target from the template image to the current image from the matching pair identified by the matching matrix to obtain a predicted position of the tracking target in the current frame includes:
obtaining a matching pair of the template image and the feature points in the target area according to the matching matrix, filtering abnormal values in the matching pair by using a RANSAC algorithm, estimating a transformation matrix by using the remaining feature point matching pair, and performing geometric transformation from the template image to the current image on the initial position of the template by using the transformation matrix to obtain the predicted position of the tracking target in the current frame;
if the number of elements with the confidence coefficient higher than 0.9 in the fractional matrix is less than 4, the target loss is considered to occur, a repositioning mechanism is started, a search area is directly determined on the current frame image according to the position tracked by the previous frame, the search area is input into a central positioning network, and the subsequent steps are executed.
According to the technical scheme provided by the embodiment of the invention, the method has the advantages that the performances of zooming, rotating, perspective transformation, motion blur, partial occlusion and unconstrained scenes are improved, and particularly, a great benefit is obtained in the partial occlusion, motion blur and unconstrained scenes.
Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic diagram of an implementation of a planar target tracking method based on center point detection and graph matching according to an embodiment of the present invention.
Fig. 2 is a schematic structural diagram of a centralized positioning network according to an embodiment of the present invention.
Fig. 3 is a schematic structural diagram of a depth map matching network according to an embodiment of the present invention.
Fig. 4 is a flowchart of training a center positioning network according to an embodiment of the present invention.
Fig. 5 is a flowchart of training a graph matching network according to an embodiment of the present invention. Fig. 6 is a processing flow chart of a planar target tracking method based on center point detection and graph matching according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
For the convenience of understanding of the embodiments of the present invention, the following detailed description will be given by way of example with reference to the accompanying drawings, and the embodiments are not limited to the embodiments of the present invention.
The embodiment of the invention provides a planar target tracking method which is more robust to different motion states of a target, the method is greatly improved in performance under fast motion and motion fuzzy scenes, deep learning and a tracking method based on key points are combined, the planar target tracking method based on central point detection and graph matching is provided, and acquisition and matching of the key points are unified into a complete frame.
The implementation principle of the plane target tracking method based on center point detection and graph matching in the embodiment of the invention is shown in FIG. 1. Intuitively, this gives a reliable guess of the initial position of the tracked object, creating good initial conditions for the matching phase. Then, the template image and the target region are modeled into a complete graph consisting of two subgraphs, and the two subgraphs respectively correspond to the two regions. And then, establishing a corresponding relation between the two point sets through a depth map matching network. Finally, homographies are calculated from the matched key point pairs by the RANSAC (Random Sample Consensus) algorithm.
The process of training the central positioning network provided by the embodiment of the invention is shown in fig. 2, and comprises the following processing procedures:
the method comprises the following steps: training portions of public tracking picture datasets MSCOCO2017, GOT-10K, and lassot are obtained. In data preprocessingWhen taking care, the size of the template image is larger than the target 2 2 The multiplied area is used as a template area, and the size of the searched image is 5 times larger than the target 2 The multiplied areas are used as search areas and are scaled to 128 × 128 and 320 × 320 sizes, respectively, i.e., the input image has a format of [ C ] respectively 1 ,H 1 ,W 1 ]=[3,128,128],[C 2 ,H 2 ,W 2 ]=[3,320,320]Where C denotes a channel, H denotes a height of the image, and W denotes a width of the image. Some means of data enhancement is used for the search area, including horizontal flipping, brightness perturbation, and center perturbation.
Step two: and (3) taking ResNet50 with the last layer4, the last firing layer and the FC layer removed as a backbone to extract the features of the image, wherein the extracted feature dimension is 1024. And then, performing dimension reduction on the extracted features by adopting a 1 × 1 convolution kernel, wherein the dimension of the features after dimension reduction is 256. Then, the element of each position in the feature map is position-coded using an unsalable sine-cosine code.
Step three: flattening the characteristic vectors of the two parts and splicing the two parts together along the spatial dimension to obtain the characteristic vectors
Figure BDA0003755485240000091
And into the Encoder module, where d =256. The Encoder of the Encoder can enhance the original characteristics through the self-attention and the cross attention and capture the corresponding relation between the original characteristics, thereby obtaining the capacity of distinguishing the space position of the target. Feature vector of search region encoded by Encode
Figure BDA0003755485240000092
The union query vector q ∈ R 1×d Attention is paid again, and the position information of the target in the search area is decoded, as shown in formula (1).
Figure BDA0003755485240000093
Here, the decoded information
Figure BDA0003755485240000094
Then, feature vector f' x Will become through dimension transformation
Figure BDA0003755485240000095
And sent to a stacked full convolution network. Through the full convolution network, the channel dimension of f is reduced to 1, and a probability graph of central point position prediction is obtained
Figure BDA0003755485240000101
We calculate the expected values of the probability map distribution in the grid coordinate space to obtain the predicted target center point, as shown in equation (2).
Figure BDA0003755485240000102
Step four: training is carried out by using l1loss as a loss function, and the specific formula is shown as (3), wherein
Figure BDA0003755485240000103
And
Figure BDA0003755485240000104
respectively representing predicted target center point and real target center point labels. AdamW is used as an optimizer, and parameters of the network model are optimized according to the loss value.
Figure BDA0003755485240000105
The process of training the depth map matching network provided by the embodiment of the invention is shown in fig. 3, and comprises the following processing procedures:
the method comprises the following steps: obtaining a public map matching data set, the data set comprising a template image (P), a search image (Q), key points in the template image
Figure BDA0003755485240000106
And the descriptors thereof
Figure BDA0003755485240000107
Searching for keypoints in images
Figure BDA0003755485240000108
And the descriptors thereof
Figure BDA0003755485240000109
And (M) correspondence between the key points in the template image and the key points in the search image. In the data preprocessing, the size of the image is uniformly adjusted to 256 × 256, that is, the size of the input image is [ C, H, W]=[3,256,256]C denotes a channel, H denotes a height of an image, and W denotes a width of the image.
Step two: modeling the template image P and the search image Q into graphs according to the Delaunay triangulation algorithm, respectively expressed as
Figure BDA00037554852400001010
And
Figure BDA00037554852400001011
wherein
Figure BDA00037554852400001012
The position of the vertex is represented and,
Figure BDA00037554852400001013
the edges are represented as a function of time,
Figure BDA00037554852400001014
representing the characteristics of the vertices and epsilon the characteristics of the edges. Then according to the ratio between two point sets (
Figure BDA00037554852400001015
And
Figure BDA00037554852400001016
) The feature similarity of the two sub-graphs constructs a cross edge and connects the two sub-graphs to form a complete graph
Figure BDA00037554852400001017
Figure BDA00037554852400001018
Specifically, for any one
Figure BDA00037554852400001019
We are in
Figure BDA00037554852400001020
Selecting top-k points with the most similar appearance to the points and building edges. Appearance similarity is defined as:
Figure BDA00037554852400001021
step three: the graph first aggregates and updates the node information along all edges, as shown in equations (4) and (5).
Figure BDA0003755485240000111
Figure BDA0003755485240000112
Wherein N (v) represents the neighbor of the node v, mv represents the node information aggregation function,
Figure BDA0003755485240000113
and
Figure BDA0003755485240000114
representing the information of node w and edge v → w at the t-th pass,
Figure BDA0003755485240000115
indicating that node v is at (t + 1) th Neighbor information during transfer, U V Representing a node update function. Simply speaking, the information of the edges connected with each node is aggregated to obtain the neighbor information, and then the neighbor information and the original node information are obtainedAnd performing fusion updating on the information as a new state of the node.
After the node state is updated, the graph will be updated with the edge state, which is also divided into two steps of aggregation and update, as shown in equations (6) and (7).
Figure BDA0003755485240000116
Figure BDA0003755485240000117
Wherein,
Figure BDA0003755485240000118
the feature vectors of the edge v → w at the t-th pass, M, of the source node and destination node, respectively E The side-information transfer function is represented,
Figure BDA0003755485240000119
representing the state of the side v → w at the t-th pass,
Figure BDA00037554852400001110
indicating that the edge is at (t + 1) th Neighbor information during transfer, U E Representing the edge update function. Simply speaking, the information of the source node and the destination node of each edge is aggregated to obtain the neighbor information, and then the neighbor information and the original information of the edge are fused and updated to be used as the new state of the edge. The aggregation function (MV, ME) and the update function (UV, UE) in equations (4) - (7) are implemented by an MLP that includes a linear layer, a ReLU layer, and a LayerNorm layer. After the state of the graph is updated, a linear layer is used for estimating the matching confidence of the corresponding node from the features on the crossed edges, and a fractional matrix S is obtained.
Step four: using l2loss with weight as the loss function, the specific formula is shown in (9), where λ is the hyper-parameter for balancing the positive and negative samples, and is set to 50 in the experiment, and s and M represent the predicted fractional matrix and the true match matrix label, respectively. Adam is used as an optimizer, and parameters of the network model are optimized through back propagation according to the loss value.
Figure BDA00037554852400001111
Figure BDA0003755485240000121
A training flowchart of a center positioning network according to an embodiment of the present invention is shown in fig. 4, a training flowchart of a graph matching network according to an embodiment of the present invention is shown in fig. 5, a processing flowchart of a planar target tracking method based on center point detection and graph matching according to an embodiment of the present invention is shown in fig. 6, and the processing method includes the following processing procedures:
the method comprises the following steps: data entry and tracker initialization. The input data of the model is a succession of video frames, the first of which is a template frame, the region where the object is located is called the template. To avoid computing the template multiple times in the forward propagation, we will store both the features extracted by the ResNet50 in the center-located network for the template region and the subgraphs of the corresponding template in the graph-matching network. The template region is a region having the center of the template as a center point and having a width and a height 2 times as large as the width and the height of the template, respectively. At the same time, we use the position offset of the template in the first frame as the initial motion parameter.
Step two: and (5) processing by a central positioning network. At this stage, we first perform an inverse transformation on the read image by using the motion parameters tracked in the previous frame to obtain a resampled image. The last tracked position will then also correspond to a quadrilateral area in the resampled image. The center of the quadrangle is taken as a central point, and the width and the height of the template which are 5 times are taken as sizes to cut, fill and zoom the resampled image to obtain a search area. Then, the template area and the search area are sent to a central positioning network to obtain the predicted target central point position (c) x ,c y ). Finally, with (c) x ,c y ) Cut out for the centerAnd a region with the same size as the template is used as the positioned target region.
Step three: and (5) processing by a graph matching network. In step two we obtain an initial rectangular target area. The method comprises the steps of extracting characteristic points from the images by using SuperPoint, and modeling the extracted characteristic points into a graph by using a Delaunay triangulation algorithm. And then combining and modeling the sub-images corresponding to the template images in the step one into a complete graph. And sending the constructed graph into a graph matching network for information transmission, aggregation and updating. After the state of the graph is updated, a linear layer is used for estimating the matching confidence of the corresponding nodes from the features on the crossed edges, and a fractional matrix is obtained. Finally, a greedy algorithm is used for processing the fractional matrix to obtain a matching matrix with the value of 0 or 1.
Step four: and obtaining the corresponding relation between the template image and the characteristic points in the target area by the matching matrix in the third step. We filter out outlers that bring large differences using the RANSAC algorithm and then estimate a transformation matrix using the remaining pairs of feature point matches. And performing geometric transformation on the initial position of the template by using the transformation matrix to obtain the predicted position of the target in the current frame.
Step five: and (4) loss processing. In target tracking, it is a common situation that a target loss occurs. In order to improve the tracking accuracy, a loss detection and relocation mechanism is added into a tracking framework. If the number of elements with confidence above 0.9 in the fractional matrix is less than 4, we consider that object loss has occurred because at least 4 pairs of matched feature points are needed to compute the perspective transformation. When a target loss occurs, we start the relocation mechanism. When an object is lost, the motion parameters of the previous frame are generally not trusted. Therefore, the motion parameters of the previous frame are not used for carrying out inverse transformation on the current frame image, and the search area is directly determined on the current frame image according to the tracked position of the previous frame, input into the center positioning network and carried out the subsequent steps.
The provided plane target tracking method based on central point detection and graph matching is compared and analyzed with other advanced algorithms on the public data set POT-210 through experiments, and the effectiveness of the method provided by the invention is proved. The result shows that the method provided by the invention has better robustness on different motion states of the target, has a leading advantage in processing partial shielding, motion blurring and unconstrained scenes, and realizes more accurate tracking.
TABLE 1 comparison of POT-210 data set with other methods
Figure BDA0003755485240000141
In summary, the embodiment of the present invention provides a planar target tracking method that is more robust to a target motion state, and the performance of the proposed method in a fast motion and motion blur scene is greatly improved. Experimental data show that the tracking method provided by the invention has improved performances in zooming, rotating, perspective transformation, motion blur, partial occlusion and unconstrained scenes, and particularly obtains greater benefits in the partial occlusion, motion blur and unconstrained scenes.
The invention decomposes the planar target tracking task into two steps, namely, an initial target area of the coarse granularity of the tracked object is predicted, and then the accurate position of the target is obtained by using the graph matching network for refining, thereby improving the tracking accuracy.
The central positioning network provided by the invention can firstly position the initial position of the target under the condition that the tracking target has larger position deviation, so that the searching space of the model can be effectively reduced under the condition of increasing less calculation amount, the occurrence of target loss is reduced, and the central positioning network has more robustness than the central positioning network which directly estimates the final position of the target. The graph matching network provided by the invention models the problem representation as a graph, and the stability of tracking is improved because the graph structure keeps certain structural invariance in continuous frames. In general, the two-stage tracking strategy provided by the invention realizes a more robust tracking effect on different motion states of a target through the pre-positioning and graph matching technology, and particularly obtains better performance in large-scale motion and unconstrained scenes. Those of ordinary skill in the art will understand that: the figures are merely schematic representations of one embodiment, and the blocks or flow diagrams in the figures are not necessarily required to practice the present invention.
From the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, they are described in relative terms, as long as they are described in partial descriptions of method embodiments. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims (6)

1. A planar target tracking method based on central point detection and graph matching is characterized by comprising the following steps:
predicting the central point of a tracking target in the current frame by using a central positioning network, and determining an initial target area according to the predicted central point;
modeling the template image and the target region into a complete graph consisting of two subgraphs, wherein the two subgraphs respectively correspond to the two regions, and predicting a matching matrix of the template image and the target region by using a depth map matching network;
and estimating the geometric transformation of the target from the template image to the current image from the matching pair identified by the matching matrix by using a RANSAC algorithm to obtain the predicted position of the tracking target in the current frame.
2. The method of claim 1, wherein the predicting the center point of the tracking target in the current frame using the center positioning network further comprises training the center positioning network, the training process comprising:
the method comprises the following steps: acquiring a training part of the public tracking picture data set, and increasing the size of the template image to be 2 larger than the target in the data preprocessing 2 The multiplied area is used as a template area, and the size of the searched image is 5 times larger than the target 2 The multiplied areas are used as search areas and are scaled to 128 × 128 and 320 × 320 sizes, respectively, i.e., the input image has a format of [ C ] respectively 1 ,H 1 ,W 1 ]=[3,128,128],[C 2 ,H 2 ,W 2 ]=[3,320,320]Where C denotes a channel, H denotes a height of the image, and W denotes a width of the image;
step two: adopting ResNet50 with the last layer4, the last pooling layer and the last FC layer removed as a backbone network to extract the features of the image, wherein the extracted feature dimension is 1024, adopting a 1 x 1 convolution kernel to check the extracted features for dimension reduction, the feature dimension after the dimension reduction is 256, and using an unraimable sine and cosine code to perform position coding on elements at each position in the feature map;
step three: flattening the characteristic vectors of the two parts and splicing the two parts together along the spatial dimension to obtain the characteristic vectors
Figure FDA0003755485230000011
And sending the data into an Encoder module, wherein d =256, the Encoder enhances the original characteristics through self-attention and cross-attention and captures the corresponding relation between the original characteristics and the captured original characteristics to obtain the capability of distinguishing the space position of the target, and the characteristic vector of the search area coded by the Encoder
Figure FDA0003755485230000021
And query vector q ∈ R 1×d Paying attention again, decoding the position information of the target in the search area, as shown in formula (1):
Figure FDA0003755485230000022
the decoded information is
Figure FDA0003755485230000023
Feature vector f' x By dimension change to
Figure FDA0003755485230000024
And sending the data to a stacked full convolution network, reducing the channel dimension of f to 1 through the full convolution network, and obtaining a probability graph of central point position prediction
Figure FDA0003755485230000025
Calculating an expected value of the probability map distribution in a grid coordinate space to obtain a predicted target center point, as shown in formula (2):
Figure FDA0003755485230000026
step four: by using
Figure FDA00037554852300000210
Training as a loss function, and the concrete formula is shown as (3), wherein
Figure FDA0003755485230000027
And
Figure FDA0003755485230000028
respectively representing predicted target central point labels and real target central point labels, adopting AdamW as an optimizer, and optimizing parameters of the network model according to the loss value:
Figure FDA0003755485230000029
and obtaining the trained central positioning network.
3. The method of claim 1, wherein predicting a center point of a tracked target in the current frame using a center location network and determining an initial target region based on the predicted center point comprises:
inputting continuous video frames in a central positioning network, wherein a first frame is a template frame, an area where an object is located is called a template, storing the characteristics extracted from the template area by ResNet50 in the central positioning network to avoid repeated calculation, the template area is an area which takes the center of the template as a central point and has the width and the height which are respectively 2 times of the width and the height of the template, and the position deviation of the template in the first frame is taken as an initial motion parameter;
during tracking, firstly, the motion parameter tracked by the previous frame is used for carrying out inverse transformation on the currently read image to obtain a resampled image, meanwhile, the position tracked at the previous moment corresponds to a quadrilateral area in the resampled image, the center of the quadrilateral is used as a central point, and the width and the height of a 5-time template are used as the size to cut, fill and zoom the resampled image to obtain a search areaA domain, sending the template region and the search region to a central positioning network to obtain a predicted target central point position (c) x ,c y ) To (c) x ,c y ) An area with the same size as the template is cut out for the center as the initial target area to be located.
4. The method of claim 1, wherein the predicting the matching matrix between the template image and the target region using the depth map matching network further comprises training the depth map matching network, the training process comprising:
the method comprises the following steps: obtaining a public graph matching data set, the data set comprising a template image (P), a search image (Q), key points in the template image
Figure FDA0003755485230000031
And descriptor thereof (v) P ) Searching for key points in an image
Figure FDA0003755485230000032
And descriptor thereof (v) Q ) And the corresponding relation (M) between the key points in the template image and the key points in the search image, and uniformly adjusting the size of the image to 256 multiplied by 256 during data preprocessing, namely the size of the input image is [ C, H, W [ ]]=[3,256,256]C denotes a channel, H denotes a height of the image, and W denotes a width of the image;
step two: modeling the template image P and the search image Q into graphs according to the Delaunay triangulation algorithm, respectively expressed as
Figure FDA0003755485230000033
And
Figure FDA0003755485230000034
wherein
Figure FDA0003755485230000035
The position of the vertex is represented and,
Figure FDA0003755485230000036
representing edges, v representing features of vertices, epsilon representing features of edges, based on (between) sets of points
Figure FDA0003755485230000037
And
Figure FDA0003755485230000038
) Constructing cross edges according to the feature similarity, and connecting the two sub-graphs into a complete graph
Figure FDA0003755485230000039
Specifically, for any one
Figure FDA00037554852300000310
In that
Figure FDA00037554852300000311
Selecting top-k points with the most similar appearance to the point v to build edges, wherein the appearance similarity is defined as:
Figure FDA00037554852300000312
step three: the graph first aggregates and updates node information along all edges, as shown in equations (4) and (5):
Figure FDA00037554852300000313
Figure FDA00037554852300000314
where N (v) denotes the neighbor of node v, M V Represents an aggregation function of the node information,
Figure FDA00037554852300000315
and
Figure FDA00037554852300000316
representing the information of node w and edge v → w at the t-th pass,
Figure FDA00037554852300000317
indicating that node v is at (t + 1) th Neighbor information during transfer, U V Representing a node updating function, aggregating the information of edges connected with each node to obtain neighbor information, and fusing and updating the neighbor information and the information of the original node to be used as a new state of the node;
after the node state is updated, the graph will be updated with the edge state, which is also divided into two steps of aggregation and update, as shown in formulas (6) and (7):
Figure FDA0003755485230000041
Figure FDA0003755485230000042
wherein,
Figure FDA0003755485230000043
feature vectors, M, representing the edge v → w at the t-th pass, the source node and the destination node, respectively E The side-information transfer function is represented,
Figure FDA0003755485230000044
representing the state of the side v → w at the t-th pass,
Figure FDA0003755485230000045
indicating that the edge is at (t + 1) th Neighbor information during transfer, U E Representing the edge update function, aggregating the information of the source node and the destination node of each edge to obtain neighbor information, and combining the neighbor information with the original edgeInformation is fused and updated to serve as a new state of the edge;
step four: using weighted
Figure FDA0003755485230000048
As a loss function, a specific formula is shown in (9), where λ is a hyper-parameter for balancing positive and negative samples, S and M respectively represent a predicted fractional matrix and a true matching matrix label, adam is used as an optimizer, and parameters of a network model are optimized by back propagation according to a loss value:
Figure FDA0003755485230000046
Figure FDA0003755485230000047
5. the method of claim 4, wherein modeling the template image and the target region as a complete graph consisting of two subgraphs, one for each region, and predicting a matching matrix for the template image and the target region using a depth map matching network comprises:
extracting feature points from the target area by using a SuperPoint network, modeling the extracted feature points into a graph by using a Delaunay triangulation algorithm, combining and modeling the graph and sub-graphs corresponding to the template image into a complete graph, sending the complete graph into a graph matching network for information transmission, aggregation and updating, after the state of the graph is updated, estimating the matching confidence coefficient of corresponding nodes from the features on the cross edges by using a linear layer to obtain a fractional matrix, and performing 0/1 processing on the fractional matrix by using a greedy algorithm to obtain a matching matrix of the template image and the target area, wherein the value of the matching matrix is 0 or 1.
6. The method of claim 5, wherein estimating the geometric transformation of the target from the template image to the current image from the matching pair identified by the matching matrix using the RANSAC algorithm to obtain the predicted position of the tracked target in the current frame comprises:
obtaining a matching pair of the template image and the feature points in the target area according to the matching matrix, filtering abnormal values in the matching pair by using a RANSAC algorithm, estimating a transformation matrix by using the remaining feature point matching pair, and performing geometric transformation from the template image to the current image on the initial position of the template by using the transformation matrix to obtain the predicted position of the tracking target in the current frame;
if the number of elements with the confidence coefficient higher than 0.9 in the fractional matrix is less than 4, the target loss is considered to occur, a repositioning mechanism is started, a search area is directly determined on the current frame image according to the position tracked by the previous frame, the search area is input into a central positioning network, and the subsequent steps are executed.
CN202210853244.9A 2022-07-20 2022-07-20 Planar target tracking method based on central point detection and graph matching Pending CN115239763A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210853244.9A CN115239763A (en) 2022-07-20 2022-07-20 Planar target tracking method based on central point detection and graph matching

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210853244.9A CN115239763A (en) 2022-07-20 2022-07-20 Planar target tracking method based on central point detection and graph matching

Publications (1)

Publication Number Publication Date
CN115239763A true CN115239763A (en) 2022-10-25

Family

ID=83674314

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210853244.9A Pending CN115239763A (en) 2022-07-20 2022-07-20 Planar target tracking method based on central point detection and graph matching

Country Status (1)

Country Link
CN (1) CN115239763A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116572264A (en) * 2023-05-22 2023-08-11 中铁九局集团电务工程有限公司 Soft mechanical arm free eye system target tracking method based on light weight model
CN118071831A (en) * 2024-04-10 2024-05-24 北京阿丘机器人科技有限公司 Image coarse positioning method, device and computer readable storage medium
CN118071831B (en) * 2024-04-10 2024-07-30 北京阿丘机器人科技有限公司 Image coarse positioning method, device and computer readable storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116572264A (en) * 2023-05-22 2023-08-11 中铁九局集团电务工程有限公司 Soft mechanical arm free eye system target tracking method based on light weight model
CN116572264B (en) * 2023-05-22 2024-06-04 中铁九局集团电务工程有限公司 Soft mechanical arm free eye system target tracking method based on light weight model
CN118071831A (en) * 2024-04-10 2024-05-24 北京阿丘机器人科技有限公司 Image coarse positioning method, device and computer readable storage medium
CN118071831B (en) * 2024-04-10 2024-07-30 北京阿丘机器人科技有限公司 Image coarse positioning method, device and computer readable storage medium

Similar Documents

Publication Publication Date Title
Yamaguchi et al. Robust monocular epipolar flow estimation
CN110866953A (en) Map construction method and device, and positioning method and device
US11170202B2 (en) Apparatus and method for performing 3D estimation based on locally determined 3D information hypotheses
CN108537844B (en) Visual SLAM loop detection method fusing geometric information
Yang et al. A bundled-optimization model of multiview dense depth map synthesis for dynamic scene reconstruction
CN105409207A (en) Feature-based image set compression
Chen et al. Towards part-aware monocular 3d human pose estimation: An architecture search approach
CN111402412A (en) Data acquisition method and device, equipment and storage medium
Cao et al. Fast and robust feature tracking for 3D reconstruction
Pintore et al. Deep3dlayout: 3d reconstruction of an indoor layout from a spherical panoramic image
KR102464271B1 (en) Pose acquisition method, apparatus, electronic device, storage medium and program
CN113298871B (en) Map generation method, positioning method, system thereof, and computer-readable storage medium
CN115239763A (en) Planar target tracking method based on central point detection and graph matching
CN117011137B (en) Image stitching method, device and equipment based on RGB similarity feature matching
GB2571307A (en) 3D skeleton reconstruction from images using volumic probability data
CN112085842B (en) Depth value determining method and device, electronic equipment and storage medium
Zhang et al. Unsupervised learning of monocular depth and ego-motion with space–temporal-centroid loss
Knorr et al. A modular scheme for 2D/3D conversion of TV broadcast
CN109951705B (en) Reference frame synthesis method and device for vehicle object coding in surveillance video
Zhu et al. Video stabilization based on image registration
Cai et al. Consistent Depth Prediction for Transparent Object Reconstruction from RGB-D Camera
Li et al. FeatFlow: Learning geometric features for 3D motion estimation
Yang et al. An improved belief propagation method for dynamic collage
Porzi et al. An automatic image-to-DEM alignment approach for annotating mountains pictures on a smartphone
Zhang et al. Joint motion model for local stereo video-matching method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination