CN115239763A

CN115239763A - Planar target tracking method based on central point detection and graph matching

Info

Publication number: CN115239763A
Application number: CN202210853244.9A
Authority: CN
Inventors: 王涛; 李坤鹏; 刘贺; 李浥东; 郎丛妍
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2022-07-20
Filing date: 2022-07-20
Publication date: 2022-10-25

Abstract

The invention provides a plane target tracking method based on central point detection and graph matching, which comprises the following steps: predicting a central point of a tracking target in the current frame by using a central positioning network, and determining an initial target area according to the predicted central point; modeling the template image and the target region into a complete graph consisting of two subgraphs, wherein the two subgraphs respectively correspond to the two regions, and predicting a matching matrix of the template image and the target region by using a depth map matching network; and estimating the geometric transformation of the target from the template image to the current image from the matching pair identified by the matching matrix by using a RANSAC algorithm to obtain the predicted position of the tracking target. The method has better performance in zooming, rotating, perspective transformation, motion blurring, partial occlusion and unconstrained scenes than the prior method on the whole, and particularly obtains greater benefits in the partial occlusion, motion blurring and unconstrained scenes.

Description

Planar target tracking method based on central point detection and graph matching

Technical Field

The invention relates to the technical field of computer vision and artificial intelligence, in particular to a plane target tracking method based on central point detection and graph matching.

Background

With the continuous development of internet technology and computer vision, the perception and analysis of image information provide great convenience for the life of people. Today, channels for image information acquisition are diverse and wide, such as mobile phones, surveillance cameras, and so on. The influx of a great deal of visual information also urgently requires an efficient method or technology for processing the information. Among these techniques, planar target tracking plays an important role in tracking a specific target. Planar target tracking is widely used in many vision-based robot applications and related fields, such as visual SLAM (Simultaneous Localization and Mapping) and augmented reality, which provides powerful support for tracking or modeling of three-dimensional objects through geometric transformations derived from two-dimensional image analysis.

For a video sequence, the purpose of planar target tracking is to estimate where the target appears in subsequent frames given the planar object to be tracked in the initial frame. This problem is usually translated into an estimated 2D geometric transformation, such as affine transformation, perspective transformation (also called homography). Although some excellent work has been proposed in this field, they still do not perform well when the target is occluded and moves rapidly.

At present, the planar target tracking methods in the prior art can be divided into two categories, namely template-based methods and key point-based methods. The essence of the template-based approach is to solve an optimization problem that minimizes the differences between the template and the search area to track objects, such as the ESM (Efficient Second-Order Minimization) algorithm. The keypoint-based method generally models a template and a search region as two groups of keypoints, then establishes a correspondence between them, and finally estimates a 2D transformation using a geometric verification method, such as the Gracker algorithm. Compared with the template-based method, the keypoint-based method has natural advantages for partial occlusion.

The development of deep learning in recent years injects new vitality into the field, and gradually becomes a research hotspot in the tracking direction of a plane target. These depth methods are usually designed based on template methods, which regresses the position or homography of the predicted target by fusing the depth global features of the template and the search image, such as homographinet algorithm, HDN algorithm. In addition, some methods are also dedicated to constructing descriptors robust to geometric transformation to improve the tracking accuracy, such as GIFT operator and LISRD operator. But few methods of deep learning unify the acquisition and matching of keypoints into a complete framework.

At present, no key point-based depth plane target tracking method which is particularly effective for motion blur and unconstrained scenes exists in the prior art.

Disclosure of Invention

The embodiment of the invention provides a plane target tracking method based on central point detection and image matching, so as to effectively track a plane target in an image.

In order to achieve the purpose, the invention adopts the following technical scheme.

A planar target tracking method based on center point detection and graph matching comprises the following steps:

predicting the central point of a tracking target in the current frame by using a central positioning network, and determining an initial target area according to the predicted central point;

modeling the template image and the target region into a complete graph consisting of two subgraphs, wherein the two subgraphs respectively correspond to the two regions, and predicting a matching matrix of the template image and the target region by using a depth map matching network;

and estimating the geometric transformation of the target from the template image to the current image from the matching pair identified by the matching matrix by using a RANSAC algorithm to obtain the predicted position of the tracking target in the current frame.

Preferably, before predicting the central point of the tracking target in the current frame by using the central positioning network, the central positioning network is trained, and the training process includes:

the method comprises the following steps: acquiring a training part of the public tracking picture data set, and increasing the size of the template image to be 2 larger than the target in the data preprocessing ² The multiplied area is used as a template area, and the size of the searched image is 5 times larger than that of the target ² The multiplied areas are used as search areas and are scaled to 128 × 128 and 320 × 320 sizes, respectively, i.e., the input image has a format of [ C ] respectively ₁ ，H ₁ ，W ₁ ]＝[3，128，128],[C ₂ ，H ₂ ，W ₂ ]＝[3，320，320]Where C denotes a channel, H denotes a height of the image, and W denotes a width of the image;

step two: adopting ResNet50 with the last layer4, the last pooling layer and the last FC layer removed as a backbone network to extract the features of the image, wherein the extracted feature dimension is 1024, adopting a 1 x 1 convolution kernel to check the extracted features for dimension reduction, the feature dimension after the dimension reduction is 256, and using an unraimable sine and cosine code to perform position coding on elements at each position in the feature map;

step three: flattening the characteristic vectors of the two parts and splicing the two parts together along the spatial dimension to obtain the characteristic vectors

And sending the data into an Encoder module, wherein d =256, the Encoder enhances the original characteristics through self-attention and cross-attention and captures the corresponding relation between the original characteristics and the captured original characteristics to obtain the capability of distinguishing the space position of the target, and the characteristic vector of the search area coded by the Encoder

And query vector q ∈ R ¹ ^×d Attention is paid again, and the position information of the target in the search area is decoded, as shown in formula (1):

the decoded information is

Feature vector f' _x By dimension change to

And sending the data to a stacked full convolution network, reducing the channel dimension of f to 1 through the full convolution network, and obtaining a probability graph of central point position prediction

Calculating an expected value of the probability map distribution in a grid coordinate space to obtain a predicted target center point, as shown in formula (2):

step four: training is carried out by using l1loss as a loss function, and the specific formula is shown as (3), wherein

And c _i ＝(c _x ，c _y ) Respectively representing predicted target central point labels and real target central point labels, adopting AdamW as an optimizer, and optimizing parameters of the network model according to the loss value:

and obtaining the trained central positioning network.

Preferably, the predicting a central point of a tracked target in a current frame by using a central positioning network, and determining an initial target region according to the predicted central point includes:

inputting continuous video frames in a central positioning network, wherein a first frame is a template frame, an area where an object is located is called a template, storing the characteristics extracted from the template area by ResNet50 in the central positioning network to avoid repeated calculation, the template area is an area which takes the center of the template as a central point and has the width and the height which are respectively 2 times of the width and the height of the template, and the position deviation of the template in the first frame is taken as an initial motion parameter;

during tracking, firstly, a motion parameter tracked by a previous frame is used for carrying out inverse transformation on a currently read image to obtain a resampled image, meanwhile, a position tracked at the previous moment corresponds to a quadrangle area in the resampled image, the center of the quadrangle is used as a central point, the width and the height of a template which are 5 times of the width and the height of the quadrangle are used as sizes to cut, fill and zoom the resampled image to obtain a search area, and the template area and the search area are sent to a central positioning network to obtain a predicted target central point position (c) _x ，c _y ) To (c) _x ，c _y ) An area with the same size as the template is cut out for the center as the initial target area to be located.

Preferably, before predicting the matching matrix between the template image and the target region by using the depth map matching network, the method further includes training the depth map matching network, and the training process includes:

the method comprises the following steps: obtaining a public map matching data set, the data set comprising a template image (P), a search image (Q), key points in the template image

And the descriptors thereof

Searching for keypoints in images

And the descriptors thereof

The corresponding relation (M) between the key points in the template image and the key points in the search image is used for uniformly adjusting the size of the image to 256 multiplied by 256 during data preprocessing, namely the size of the input image is [ C, H, W]＝[3，256，256]C denotes a channel, H denotes a height of the image, and W denotes a width of the image;

step two: modeling the template image P and the search image Q into graphs according to the Delaunay triangulation algorithm, respectively expressed as

And

wherein

The position of the vertex is represented,

the edges are represented as a function of time,

features representing vertices, ε edges, according to the ratio between two sets of points

And

) The feature similarity of the two sub-graphs constructs a cross edge and connects the two sub-graphs to form a complete graph

Specifically, for any one

In that

Selecting top-k points with the most similar appearance to the point v to build edges, wherein the appearance similarity is defined as:

step three: the graph first aggregates and updates node information along all edges, as shown in equations (4) and (5):

where N (v) denotes the neighbor of node v, M _V Represents an aggregation function of the node information,

and

representing the information of node w and edge v → w at the t-th pass,

indicating that node v is at (t + 1) _th Neighbor information during transfer, U _V Representing a node updating function, aggregating the information of edges connected with each node to obtain neighbor information, and fusing and updating the neighbor information and the information of the original node to be used as a new state of the node;

after the node state is updated, the graph will be updated with the edge state, which is also divided into two steps of aggregation and update, as shown in formulas (6) and (7):

wherein,

feature vectors, M, representing the source node and destination node at the t-th pass of edge v → w, respectively _E The side-information transfer function is represented,

representing the state of the side v → w at the t-th pass,

indicating that the edge is at (t + 1) _th Neighbor information during transfer, U _E Representing an edge updating function, aggregating the information of the source node and the destination node of each edge to obtain neighbor information, and fusing and updating the neighbor information and the original edge information to be used as a new state of the edge;

step four: using l2loss with weight as a loss function, and using a specific formula shown in (9), wherein λ is a hyper-parameter for balancing positive and negative samples, S and M respectively represent a predicted fractional matrix and a true matching matrix label, and using Adam as an optimizer to perform back propagation optimization on parameters of a network model according to a loss value:

preferably, the modeling the template image and the target region as a complete graph composed of two subgraphs, where the two subgraphs correspond to the two regions, respectively, and predicting a matching matrix of the template image and the target region by using a depth map matching network includes:

extracting feature points from the target area by using a SuperPoint network, modeling the extracted feature points into a graph by using a Delaunay triangulation algorithm, combining and modeling the graph and sub-graphs corresponding to the template image into a complete graph, sending the complete graph into a graph matching network for information transmission, aggregation and updating, after the state of the graph is updated, estimating the matching confidence coefficient of corresponding nodes from the features on the cross edges by using a linear layer to obtain a fractional matrix, and performing 0/1 processing on the fractional matrix by using a greedy algorithm to obtain a matching matrix of the template image and the target area, wherein the value of the matching matrix is 0 or 1.

Preferably, the estimating, by using the RANSAC algorithm, a geometric transformation of the target from the template image to the current image from the matching pair identified by the matching matrix to obtain a predicted position of the tracking target in the current frame includes:

obtaining a matching pair of the template image and the feature points in the target area according to the matching matrix, filtering abnormal values in the matching pair by using a RANSAC algorithm, estimating a transformation matrix by using the remaining feature point matching pair, and performing geometric transformation from the template image to the current image on the initial position of the template by using the transformation matrix to obtain the predicted position of the tracking target in the current frame;

if the number of elements with the confidence coefficient higher than 0.9 in the fractional matrix is less than 4, the target loss is considered to occur, a repositioning mechanism is started, a search area is directly determined on the current frame image according to the position tracked by the previous frame, the search area is input into a central positioning network, and the subsequent steps are executed.

According to the technical scheme provided by the embodiment of the invention, the method has the advantages that the performances of zooming, rotating, perspective transformation, motion blur, partial occlusion and unconstrained scenes are improved, and particularly, a great benefit is obtained in the partial occlusion, motion blur and unconstrained scenes.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of an implementation of a planar target tracking method based on center point detection and graph matching according to an embodiment of the present invention.

Fig. 2 is a schematic structural diagram of a centralized positioning network according to an embodiment of the present invention.

Fig. 3 is a schematic structural diagram of a depth map matching network according to an embodiment of the present invention.

Fig. 4 is a flowchart of training a center positioning network according to an embodiment of the present invention.

Fig. 5 is a flowchart of training a graph matching network according to an embodiment of the present invention. Fig. 6 is a processing flow chart of a planar target tracking method based on center point detection and graph matching according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

For the convenience of understanding of the embodiments of the present invention, the following detailed description will be given by way of example with reference to the accompanying drawings, and the embodiments are not limited to the embodiments of the present invention.

The embodiment of the invention provides a planar target tracking method which is more robust to different motion states of a target, the method is greatly improved in performance under fast motion and motion fuzzy scenes, deep learning and a tracking method based on key points are combined, the planar target tracking method based on central point detection and graph matching is provided, and acquisition and matching of the key points are unified into a complete frame.

The implementation principle of the plane target tracking method based on center point detection and graph matching in the embodiment of the invention is shown in FIG. 1. Intuitively, this gives a reliable guess of the initial position of the tracked object, creating good initial conditions for the matching phase. Then, the template image and the target region are modeled into a complete graph consisting of two subgraphs, and the two subgraphs respectively correspond to the two regions. And then, establishing a corresponding relation between the two point sets through a depth map matching network. Finally, homographies are calculated from the matched key point pairs by the RANSAC (Random Sample Consensus) algorithm.

The process of training the central positioning network provided by the embodiment of the invention is shown in fig. 2, and comprises the following processing procedures:

the method comprises the following steps: training portions of public tracking picture datasets MSCOCO2017, GOT-10K, and lassot are obtained. In data preprocessingWhen taking care, the size of the template image is larger than the target 2 ² The multiplied area is used as a template area, and the size of the searched image is 5 times larger than the target ² The multiplied areas are used as search areas and are scaled to 128 × 128 and 320 × 320 sizes, respectively, i.e., the input image has a format of [ C ] respectively ₁ ，H ₁ ，W ₁ ]＝[3，128，128],[C ₂ ，H ₂ ，W ₂ ]＝[3，320，320]Where C denotes a channel, H denotes a height of the image, and W denotes a width of the image. Some means of data enhancement is used for the search area, including horizontal flipping, brightness perturbation, and center perturbation.

Step two: and (3) taking ResNet50 with the last layer4, the last firing layer and the FC layer removed as a backbone to extract the features of the image, wherein the extracted feature dimension is 1024. And then, performing dimension reduction on the extracted features by adopting a 1 × 1 convolution kernel, wherein the dimension of the features after dimension reduction is 256. Then, the element of each position in the feature map is position-coded using an unsalable sine-cosine code.

And into the Encoder module, where d =256. The Encoder of the Encoder can enhance the original characteristics through the self-attention and the cross attention and capture the corresponding relation between the original characteristics, thereby obtaining the capacity of distinguishing the space position of the target. Feature vector of search region encoded by Encode

The union query vector q ∈ R ^1×d Attention is paid again, and the position information of the target in the search area is decoded, as shown in formula (1).

Here, the decoded information

Then, feature vector f' _x Will become through dimension transformation

And sent to a stacked full convolution network. Through the full convolution network, the channel dimension of f is reduced to 1, and a probability graph of central point position prediction is obtained

We calculate the expected values of the probability map distribution in the grid coordinate space to obtain the predicted target center point, as shown in equation (2).

And

respectively representing predicted target center point and real target center point labels. AdamW is used as an optimizer, and parameters of the network model are optimized according to the loss value.

The process of training the depth map matching network provided by the embodiment of the invention is shown in fig. 3, and comprises the following processing procedures:

And the descriptors thereof

Searching for keypoints in images

And the descriptors thereof

And (M) correspondence between the key points in the template image and the key points in the search image. In the data preprocessing, the size of the image is uniformly adjusted to 256 × 256, that is, the size of the input image is [ C, H, W]＝[3，256，256]C denotes a channel, H denotes a height of an image, and W denotes a width of the image.

And

wherein

The position of the vertex is represented and,

the edges are represented as a function of time,

representing the characteristics of the vertices and epsilon the characteristics of the edges. Then according to the ratio between two point sets (

And

Specifically, for any one

We are in

Selecting top-k points with the most similar appearance to the points and building edges. Appearance similarity is defined as:

step three: the graph first aggregates and updates the node information along all edges, as shown in equations (4) and (5).

Wherein N (v) represents the neighbor of the node v, mv represents the node information aggregation function,

and

representing the information of node w and edge v → w at the t-th pass,

indicating that node v is at (t + 1) _th Neighbor information during transfer, U _V Representing a node update function. Simply speaking, the information of the edges connected with each node is aggregated to obtain the neighbor information, and then the neighbor information and the original node information are obtainedAnd performing fusion updating on the information as a new state of the node.

After the node state is updated, the graph will be updated with the edge state, which is also divided into two steps of aggregation and update, as shown in equations (6) and (7).

Wherein,

the feature vectors of the edge v → w at the t-th pass, M, of the source node and destination node, respectively _E The side-information transfer function is represented,

representing the state of the side v → w at the t-th pass,

indicating that the edge is at (t + 1) _th Neighbor information during transfer, U _E Representing the edge update function. Simply speaking, the information of the source node and the destination node of each edge is aggregated to obtain the neighbor information, and then the neighbor information and the original information of the edge are fused and updated to be used as the new state of the edge. The aggregation function (MV, ME) and the update function (UV, UE) in equations (4) - (7) are implemented by an MLP that includes a linear layer, a ReLU layer, and a LayerNorm layer. After the state of the graph is updated, a linear layer is used for estimating the matching confidence of the corresponding node from the features on the crossed edges, and a fractional matrix S is obtained.

Step four: using l2loss with weight as the loss function, the specific formula is shown in (9), where λ is the hyper-parameter for balancing the positive and negative samples, and is set to 50 in the experiment, and s and M represent the predicted fractional matrix and the true match matrix label, respectively. Adam is used as an optimizer, and parameters of the network model are optimized through back propagation according to the loss value.

A training flowchart of a center positioning network according to an embodiment of the present invention is shown in fig. 4, a training flowchart of a graph matching network according to an embodiment of the present invention is shown in fig. 5, a processing flowchart of a planar target tracking method based on center point detection and graph matching according to an embodiment of the present invention is shown in fig. 6, and the processing method includes the following processing procedures:

the method comprises the following steps: data entry and tracker initialization. The input data of the model is a succession of video frames, the first of which is a template frame, the region where the object is located is called the template. To avoid computing the template multiple times in the forward propagation, we will store both the features extracted by the ResNet50 in the center-located network for the template region and the subgraphs of the corresponding template in the graph-matching network. The template region is a region having the center of the template as a center point and having a width and a height 2 times as large as the width and the height of the template, respectively. At the same time, we use the position offset of the template in the first frame as the initial motion parameter.

Step two: and (5) processing by a central positioning network. At this stage, we first perform an inverse transformation on the read image by using the motion parameters tracked in the previous frame to obtain a resampled image. The last tracked position will then also correspond to a quadrilateral area in the resampled image. The center of the quadrangle is taken as a central point, and the width and the height of the template which are 5 times are taken as sizes to cut, fill and zoom the resampled image to obtain a search area. Then, the template area and the search area are sent to a central positioning network to obtain the predicted target central point position (c) _x ，c _y ). Finally, with (c) _x ，c _y ) Cut out for the centerAnd a region with the same size as the template is used as the positioned target region.

Step three: and (5) processing by a graph matching network. In step two we obtain an initial rectangular target area. The method comprises the steps of extracting characteristic points from the images by using SuperPoint, and modeling the extracted characteristic points into a graph by using a Delaunay triangulation algorithm. And then combining and modeling the sub-images corresponding to the template images in the step one into a complete graph. And sending the constructed graph into a graph matching network for information transmission, aggregation and updating. After the state of the graph is updated, a linear layer is used for estimating the matching confidence of the corresponding nodes from the features on the crossed edges, and a fractional matrix is obtained. Finally, a greedy algorithm is used for processing the fractional matrix to obtain a matching matrix with the value of 0 or 1.

Step four: and obtaining the corresponding relation between the template image and the characteristic points in the target area by the matching matrix in the third step. We filter out outlers that bring large differences using the RANSAC algorithm and then estimate a transformation matrix using the remaining pairs of feature point matches. And performing geometric transformation on the initial position of the template by using the transformation matrix to obtain the predicted position of the target in the current frame.

Step five: and (4) loss processing. In target tracking, it is a common situation that a target loss occurs. In order to improve the tracking accuracy, a loss detection and relocation mechanism is added into a tracking framework. If the number of elements with confidence above 0.9 in the fractional matrix is less than 4, we consider that object loss has occurred because at least 4 pairs of matched feature points are needed to compute the perspective transformation. When a target loss occurs, we start the relocation mechanism. When an object is lost, the motion parameters of the previous frame are generally not trusted. Therefore, the motion parameters of the previous frame are not used for carrying out inverse transformation on the current frame image, and the search area is directly determined on the current frame image according to the tracked position of the previous frame, input into the center positioning network and carried out the subsequent steps.

The provided plane target tracking method based on central point detection and graph matching is compared and analyzed with other advanced algorithms on the public data set POT-210 through experiments, and the effectiveness of the method provided by the invention is proved. The result shows that the method provided by the invention has better robustness on different motion states of the target, has a leading advantage in processing partial shielding, motion blurring and unconstrained scenes, and realizes more accurate tracking.

TABLE 1 comparison of POT-210 data set with other methods

In summary, the embodiment of the present invention provides a planar target tracking method that is more robust to a target motion state, and the performance of the proposed method in a fast motion and motion blur scene is greatly improved. Experimental data show that the tracking method provided by the invention has improved performances in zooming, rotating, perspective transformation, motion blur, partial occlusion and unconstrained scenes, and particularly obtains greater benefits in the partial occlusion, motion blur and unconstrained scenes.

The invention decomposes the planar target tracking task into two steps, namely, an initial target area of the coarse granularity of the tracked object is predicted, and then the accurate position of the target is obtained by using the graph matching network for refining, thereby improving the tracking accuracy.

The central positioning network provided by the invention can firstly position the initial position of the target under the condition that the tracking target has larger position deviation, so that the searching space of the model can be effectively reduced under the condition of increasing less calculation amount, the occurrence of target loss is reduced, and the central positioning network has more robustness than the central positioning network which directly estimates the final position of the target. The graph matching network provided by the invention models the problem representation as a graph, and the stability of tracking is improved because the graph structure keeps certain structural invariance in continuous frames. In general, the two-stage tracking strategy provided by the invention realizes a more robust tracking effect on different motion states of a target through the pre-positioning and graph matching technology, and particularly obtains better performance in large-scale motion and unconstrained scenes. Those of ordinary skill in the art will understand that: the figures are merely schematic representations of one embodiment, and the blocks or flow diagrams in the figures are not necessarily required to practice the present invention.

From the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, they are described in relative terms, as long as they are described in partial descriptions of method embodiments. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. A planar target tracking method based on central point detection and graph matching is characterized by comprising the following steps:

2. The method of claim 1, wherein the predicting the center point of the tracking target in the current frame using the center positioning network further comprises training the center positioning network, the training process comprising:

the method comprises the following steps: acquiring a training part of the public tracking picture data set, and increasing the size of the template image to be 2 larger than the target in the data preprocessing ² The multiplied area is used as a template area, and the size of the searched image is 5 times larger than the target ² The multiplied areas are used as search areas and are scaled to 128 × 128 and 320 × 320 sizes, respectively, i.e., the input image has a format of [ C ] respectively ₁ ，H ₁ ，W ₁ ]＝[3，128，128],[C ₂ ，H ₂ ，W ₂ ]＝[3，320，320]Where C denotes a channel, H denotes a height of the image, and W denotes a width of the image;

And query vector q ∈ R ^1×d Paying attention again, decoding the position information of the target in the search area, as shown in formula (1):

the decoded information is

Feature vector f' _x By dimension change to

step four: by using

Training as a loss function, and the concrete formula is shown as (3), wherein

And

respectively representing predicted target central point labels and real target central point labels, adopting AdamW as an optimizer, and optimizing parameters of the network model according to the loss value:

and obtaining the trained central positioning network.

3. The method of claim 1, wherein predicting a center point of a tracked target in the current frame using a center location network and determining an initial target region based on the predicted center point comprises:

during tracking, firstly, the motion parameter tracked by the previous frame is used for carrying out inverse transformation on the currently read image to obtain a resampled image, meanwhile, the position tracked at the previous moment corresponds to a quadrilateral area in the resampled image, the center of the quadrilateral is used as a central point, and the width and the height of a 5-time template are used as the size to cut, fill and zoom the resampled image to obtain a search areaA domain, sending the template region and the search region to a central positioning network to obtain a predicted target central point position (c) _x ，c _y ) To (c) _x ，c _y ) An area with the same size as the template is cut out for the center as the initial target area to be located.

4. The method of claim 1, wherein the predicting the matching matrix between the template image and the target region using the depth map matching network further comprises training the depth map matching network, the training process comprising:

the method comprises the following steps: obtaining a public graph matching data set, the data set comprising a template image (P), a search image (Q), key points in the template image

And descriptor thereof (v) _P ) Searching for key points in an image

And descriptor thereof (v) _Q ) And the corresponding relation (M) between the key points in the template image and the key points in the search image, and uniformly adjusting the size of the image to 256 multiplied by 256 during data preprocessing, namely the size of the input image is [ C, H, W [ ]]＝[3，256，256]C denotes a channel, H denotes a height of the image, and W denotes a width of the image;

And

wherein

The position of the vertex is represented and,

representing edges, v representing features of vertices, epsilon representing features of edges, based on (between) sets of points

And

) Constructing cross edges according to the feature similarity, and connecting the two sub-graphs into a complete graph

Specifically, for any one

In that

and

representing the information of node w and edge v → w at the t-th pass,

wherein,

feature vectors, M, representing the edge v → w at the t-th pass, the source node and the destination node, respectively _E The side-information transfer function is represented,

representing the state of the side v → w at the t-th pass,

indicating that the edge is at (t + 1) _th Neighbor information during transfer, U _E Representing the edge update function, aggregating the information of the source node and the destination node of each edge to obtain neighbor information, and combining the neighbor information with the original edgeInformation is fused and updated to serve as a new state of the edge;

step four: using weighted

As a loss function, a specific formula is shown in (9), where λ is a hyper-parameter for balancing positive and negative samples, S and M respectively represent a predicted fractional matrix and a true matching matrix label, adam is used as an optimizer, and parameters of a network model are optimized by back propagation according to a loss value:

5. the method of claim 4, wherein modeling the template image and the target region as a complete graph consisting of two subgraphs, one for each region, and predicting a matching matrix for the template image and the target region using a depth map matching network comprises:

6. The method of claim 5, wherein estimating the geometric transformation of the target from the template image to the current image from the matching pair identified by the matching matrix using the RANSAC algorithm to obtain the predicted position of the tracked target in the current frame comprises: