CN117710688A - Target tracking method and system based on convolution and attention combination feature extraction - Google Patents

Target tracking method and system based on convolution and attention combination feature extraction Download PDF

Info

Publication number
CN117710688A
CN117710688A CN202311697673.2A CN202311697673A CN117710688A CN 117710688 A CN117710688 A CN 117710688A CN 202311697673 A CN202311697673 A CN 202311697673A CN 117710688 A CN117710688 A CN 117710688A
Authority
CN
China
Prior art keywords
feature
local
image
global
convolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311697673.2A
Other languages
Chinese (zh)
Other versions
CN117710688B (en
Inventor
王员云
孙传雨
王军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanchang Institute of Technology
Original Assignee
Nanchang Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanchang Institute of Technology filed Critical Nanchang Institute of Technology
Priority to CN202311697673.2A priority Critical patent/CN117710688B/en
Publication of CN117710688A publication Critical patent/CN117710688A/en
Application granted granted Critical
Publication of CN117710688B publication Critical patent/CN117710688B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/42Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a target tracking method and a target tracking system based on convolution and attention combination feature extraction, which relate to the field of computer vision and image processing and comprise the following steps: respectively initializing a given first frame image and a search area of each subsequent frame image to obtain a target template image and a search image; constructing a feature extraction network; the feature extraction sub-network comprises a convolution module, a plurality of layered feature modules and a complete connection layer which are sequentially connected; each hierarchical feature module comprises a global branch for extracting global information by adopting a self-attention operation and a local branch for extracting local information by adopting an operation combining attention and convolution; and respectively applying a feature extraction network to the target template image and each search image to extract features, and carrying out convolution operation on the extracted features to obtain the target response position of the search area in each subsequent frame image. The invention improves the target tracking efficiency and accuracy through the lightweight characteristic extraction network combining convolution and attention.

Description

Target tracking method and system based on convolution and attention combination feature extraction
Technical Field
The invention relates to the field of computer vision and image processing, in particular to a target tracking method and system of a lightweight characteristic extraction network based on combination of convolution and attention.
Background
Target tracking is an important research direction in computer vision, the purpose of which is to estimate the position of an object in a video sequence given the initial state of the tracked object. At present, the target tracking is widely applied to vision fields such as vision positioning, automatic driving systems, intelligent cities and the like. Although the object tracking has achieved numerous achievements, the object tracking task still faces some problems, such as illumination change, scale change, background interference, object shielding, motion and blurring, and the effect of the existing method still does not reach an ideal state, and designing a high-precision tracker is still a challenging task.
Convolutional neural networks and attention are two main techniques in target tracking, with superior performance in feature learning. The convolution neural network can effectively extract local information from the related feature map through neighborhood convolution operation. However, the limited receptive field of convolutional neural networks results in difficulty in capturing global dependencies. The attention mechanism uses image blocks as input representations and applies a weighted average operation to the context of the input features, which can effectively capture global information, but blind similarity matching between neighboring image blocks can lead to high redundancy.
Convolutional neural network-based and attention-based trackers are two different approaches, but achieve better tracking performance. Specifically, some trackers extract features of template branches and search branches using convolutional neural networks as feature extraction networks, and then calculate similarity scores of the features through an attention-based feature fusion network, thereby performing tracking. Other trackers use attention-based transformers as feature extraction and feature fusion networks, respectively, to achieve high performance tracking. However, the large model of the typically attention-based transformer network can affect the performance and speed of tracking. Therefore, the invention provides a target tracking method and a target tracking system based on convolution and attention combination feature extraction.
Disclosure of Invention
The invention aims to provide a target tracking method and a target tracking system based on convolution and attention combination feature extraction, which can improve target tracking efficiency and accuracy through a convolution and attention combination lightweight feature extraction network.
In order to achieve the above object, the present invention provides the following solutions:
a target tracking method based on convolution and attention combination feature extraction, comprising:
Respectively initializing a given first frame image and a search area in each subsequent frame image to obtain a target template image corresponding to the first frame image and a search image corresponding to each subsequent frame image;
constructing a feature extraction network; the feature extraction network comprises two feature extraction sub-networks with the same structure; the two feature extraction sub-networks are respectively used for extracting features of the target template image and the search image; the characteristic extraction sub-network comprises a convolution module, a plurality of layered characteristic modules and a first complete connection layer, wherein the layered characteristic modules are sequentially connected in series; the plurality of hierarchical feature modules are sequentially denoted as a first hierarchical feature module, a second hierarchical feature module, an M-th hierarchical feature module; the input of the convolution module is the first frame image or each subsequent frame image, the output of the convolution module is the input of the first hierarchical feature module, the output of the Mth hierarchical feature module is the input of the first complete connection layer, and the output of the first complete connection layer is the feature extracted by the feature extraction network; each hierarchical feature module comprises a layer normalization branch, a global branch, a local branch and a fusion branch; the input of the layer normalization branch is the output of the convolution module; the outputs of the layer normalization branches are the input of the global branch and the input of the local branch respectively; the global branch is used for extracting global information by adopting self-attention operation; the local branches are used for extracting local information by adopting the operation of combining attention and convolution; the fusion branch is used for fusing the global information output by the global branch and the local information output by the local branch;
Respectively applying the characteristic extraction sub-network of each target template image and each search image to extract the characteristics to obtain target template image extraction characteristics and search image extraction characteristics;
and inputting the extracted features of the target template image into a tracking model, and carrying out convolution operation on the extracted features of each search image and the result output by the tracking model respectively to obtain the target response position of the search area in each subsequent frame image.
Optionally, the feature extraction sub-network is applied to the target template image and each search image to perform feature extraction, which specifically includes:
cutting the target template image and the search image into a plurality of blocks respectively to obtain a cut target template image and a cut search image; the blocks in the cut target template image and the blocks in the cut search image overlap;
carrying out convolution processing on the image to be processed by utilizing the convolution module to obtain a feature after convolution; the image to be processed is the cut target template image or the cut search image;
inputting the features output by the m-1 th layered feature module to the layer normalization branch of the m-th layered feature module to obtain normalization features;
Inputting the normalized feature to the global branch of the mth hierarchical feature to obtain the global information;
inputting the normalized feature to the local branch of the mth hierarchical feature to obtain the local information;
the global information and the local information are subjected to the fusion branch of the mth hierarchical feature to obtain a fusion feature; m=1, 2,; when m=1, the feature output by the m-1 th hierarchical feature module is the convolved feature;
judging whether M is equal to M; if not, making m=m+1, and returning to the step of inputting the features output by the m-1 th layered feature module into the layer normalization branch of the m-th layered feature module; if yes, inputting the features output by the Mth hierarchical feature module to the first complete connection layer to obtain the target template image extraction features or the search image extraction features; when the image to be processed is the cut target template image, the first complete connection layer outputs the extracted features of the target template image; when the image to be processed is the cut search image, the first complete connection layer outputs the extracted features of the search image.
Optionally, inputting the normalized feature to the global branch of the mth hierarchical feature to obtain the global information, which specifically includes:
performing linear transformation operation on the normalized features to obtain three feature graphs, which are respectively marked as global query Q, global key K and global value V;
performing a downsampling operation on the global key K and the global value V;
and performing linear standard attention operation on the global query Q, the downsampled key K and the downsampled value V to obtain the global information.
Optionally, inputting the normalized feature to the local branch of the mth hierarchical feature to obtain the local information, which specifically includes:
performing linear transformation operation on the normalized features to obtain three feature graphs, which are respectively marked as a local query Q, a local key K and a local value V;
performing a deep convolution operation of global weight sharing on the local query Q, the local key K and the local value V respectively to obtain a query Q local aggregation feature, a key K local aggregation feature and a value V local aggregation feature;
carrying out Hardmard product operation on the query Q local aggregation feature and the key K local aggregation feature to obtain a product operation result;
Sequentially passing the product operation result through a second complete connection layer, a first activation layer, a second complete connection layer and a second activation layer to obtain context sensing information;
and carrying out Hardmard product operation on the context awareness information and the value V local aggregation feature to obtain the local information.
Optionally, the merging branch of the mth hierarchical feature is used for obtaining the merging feature by the global information and the local information, which specifically includes:
performing cascading operation on the global information and the local information to obtain cascading characteristics;
and passing the cascaded features through a fourth complete connection layer to obtain the fusion features.
The invention provides a target tracking system based on convolution and attention combination feature extraction, which comprises:
the initialization module is used for respectively initializing a given first frame image and a search area in each subsequent frame image to obtain a target template image corresponding to the first frame image and a search image corresponding to each subsequent frame image;
the feature extraction network construction module is used for constructing a feature extraction network; the feature extraction network comprises two feature extraction sub-networks with the same structure; the two feature extraction sub-networks are respectively used for extracting features of the target template image and the search image; the characteristic extraction sub-network comprises a convolution module, a plurality of layered characteristic modules and a first complete connection layer, wherein the layered characteristic modules are sequentially connected in series; the plurality of hierarchical feature modules are sequentially denoted as a first hierarchical feature module, a second hierarchical feature module, an M-th hierarchical feature module; the input of the convolution module is the first frame image or each subsequent frame image, the output of the convolution module is the input of the first hierarchical feature module, the output of the Mth hierarchical feature module is the input of the first complete connection layer, and the output of the first complete connection layer is the feature extracted by the feature extraction network; each hierarchical feature module comprises a layer normalization branch, a global branch, a local branch and a fusion branch; the input of the layer normalization branch is the output of the convolution module; the outputs of the layer normalization branches are the input of the global branch and the input of the local branch respectively; the global branch is used for extracting global information by adopting self-attention operation; the local branches are used for extracting local information by adopting the operation of combining attention and convolution; the fusion branch is used for fusing the global information output by the global branch and the local information output by the local branch;
The feature extraction module is used for respectively applying the feature extraction sub-network of each target template image and each search image to carry out feature extraction to obtain target template image extraction features and search image extraction features;
and the target tracking module is used for inputting the extracted features of the target template image into a tracking model, and carrying out convolution operation on the extracted features of each search image and the result output by the tracking model respectively to obtain the target response position of the search area in each subsequent frame image.
Optionally, the feature extraction module specifically includes:
the image cutting unit is used for cutting the target template image and the search image into a plurality of blocks respectively to obtain a cut target template image and a cut search image; the blocks in the cut target template image and the blocks in the cut search image overlap;
the convolution operation unit is used for carrying out convolution processing on the image to be processed by utilizing the convolution module to obtain a feature after convolution; the image to be processed is the cut target template image or the cut search image;
the normalization unit is used for inputting the features output by the m-1 th layered feature module to the layer normalization branch of the m-th layered feature module to obtain normalization features;
The global feature extraction unit is used for inputting the normalized feature to the global branch of the mth hierarchical feature to obtain the global information;
a local feature extraction unit, configured to input the normalized feature to the local branch of the mth hierarchical feature, to obtain the local information;
the feature fusion unit is used for obtaining fusion features by the fusion branches of the mth hierarchical features through the global information and the local information; m=1, 2,; when m=1, the feature output by the m-1 th hierarchical feature module is the convolved feature;
a judging unit for judging whether M is equal to M; if not, making m=m+1, and returning to the step of inputting the features output by the m-1 th layered feature module into the layer normalization branch of the m-th layered feature module; if yes, inputting the features output by the Mth hierarchical feature module to the first complete connection layer to obtain the target template image extraction features or the search image extraction features; when the image to be processed is the cut target template image, the first complete connection layer outputs the extracted features of the target template image; when the image to be processed is the cut search image, the first complete connection layer outputs the extracted features of the search image.
Optionally, the global feature extraction unit specifically includes:
the first linear transformation subunit is used for performing linear transformation operation on the normalized features to obtain three feature graphs, which are respectively marked as global query Q, global key K and global value V;
a downsampling operation subunit, configured to perform a downsampling operation on the global key K and the global value V;
and the global feature extraction subunit is used for performing linear standard attention operation on the global query Q, the downsampled key K and the downsampled value V to obtain the global information.
Optionally, the local feature extraction unit specifically includes:
the second linear transformation subunit is used for performing linear transformation operation on the normalized features to obtain three feature graphs, which are respectively marked as local query Q, local key K and local value V;
the depth convolution operation subunit is used for respectively executing the depth convolution operation of the global weight sharing on the local query Q, the local key K and the local value V to obtain a query Q local aggregation feature, a key K local aggregation feature and a value V local aggregation feature;
the Hardmard product operation subunit is used for executing Hardmard product operation on the query Q local aggregation feature and the key K local aggregation feature to obtain a product operation result;
A series of processing subunits, configured to sequentially pass through the second complete connection layer, the first activation layer, the second complete connection layer and the second activation layer on the product operation result, so as to obtain context sensing information;
and the local feature extraction subunit is used for carrying out Hardmard product operation on the context awareness information and the value V local aggregation feature to obtain the local information.
Optionally, the feature fusion unit specifically includes:
a cascading operation subunit, configured to perform cascading operation on the global information and the local information, so as to obtain a feature after cascading;
and the complete connection layer processing subunit is used for enabling the cascaded features to pass through a fourth complete connection layer to obtain the fusion features.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the invention provides a target tracking method and a target tracking system based on convolution and attention combination feature extraction, which are combined with rapid local representation learning and global modeling characteristics in attention in a convolution neural network; the hierarchical feature module consists of a global branch and a local branch; the global branch adopts self-attention to extract global information, so that the number of calculation required by attention is reduced, and the global receptive field is enhanced; the convolution with attention style in the local branches effectively aggregates local feature information by utilizing the global shared weight and the context weight to the local perception advantage. The method adopts a mode based on combination of convolution and attention to perform feature extraction, and improves the efficiency and the accuracy of target tracking.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a target tracking method based on convolution and attention combination feature extraction according to embodiment 1 of the present invention;
FIG. 2 is a schematic diagram of the structure of a target tracking framework of the lightweight feature extraction network based on convolution and attention combination according to embodiment 1 of the present invention;
fig. 3 is a schematic diagram of a hierarchical feature module principle structure of object tracking of a lightweight feature extraction network based on combination of convolution and attention according to embodiment 1 of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The invention aims to provide a target tracking method and a target tracking system based on convolution and attention combination feature extraction, which combine the advantages of rapid local feature learning in a convolution neural network and global modeling characteristics in an attention converter, solve the problem that the global dependence is difficult to capture due to the limited receptive field of the convolution neural network in a tracker based on the convolution neural network, and also solve the problem that the performance and the speed of tracking are influenced due to the fact that a model of the tracker based on the attention is large.
In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.
Example 1
As shown in fig. 1, the present embodiment provides a target tracking method based on convolution and attention combination feature extraction, including:
s1: and initializing the given first frame image and the search area in each subsequent frame image respectively to obtain a target template image corresponding to the first frame image and a search image corresponding to each subsequent frame image.
S2: and constructing a feature extraction network.
As shown in fig. 2, the feature extraction network includes two feature extraction sub-networks with identical structures; the two feature extraction sub-networks are respectively used for extracting features of the target template image and the search image; the characteristic extraction sub-network comprises a convolution module, a plurality of layered characteristic modules and a first complete connection layer, wherein the layered characteristic modules are sequentially connected in series; the plurality of hierarchical feature modules are sequentially denoted as a first hierarchical feature module, a second hierarchical feature module, an M-th hierarchical feature module; the input of the convolution module is the first frame image or each subsequent frame image, the output of the convolution module is the input of the first hierarchical feature module, the output of the Mth hierarchical feature module is the input of the first complete connection layer, and the output of the first complete connection layer is the feature extracted by the feature extraction network.
Each layered feature module is a lightweight module based on combination of convolution and attention, and layered features are extracted through the layered feature modules. Each hierarchical feature module comprises a layer normalization branch, a global branch, a local branch and a fusion branch; the input of the layer normalization branch is the output of the convolution module; the outputs of the layer normalization branches are the input of the global branch and the input of the local branch respectively; the global branch is used for extracting global information by adopting self-attention operation; the local branches are used for extracting local information by adopting the operation of combining attention and convolution; the fusion branch is used for fusing the global information output by the global branch and the local information output by the local branch.
In fig. 2, the convolution module adopts 3×3 convolution, and learns the target template image feature and the search image feature through convolution operation to obtain the template feature and the search feature respectively. The number of the hierarchical feature modules is 4, m=4, and the specific number is arbitrarily determined according to requirements. The first fully connected layer adopts an average pooling fully connected layer.
The hierarchical feature module includes a global branch that performs attention operations and a local branch that is a mixture of attention and convolution, as shown in the hierarchical feature module schematic diagram of fig. 3, with the local branch added with a deep convolution.
S3: and respectively applying the characteristic extraction sub-network to the target template image and each search image to extract the characteristics so as to obtain the extraction characteristics of the target template image and the extraction characteristics of the search image.
The step S3 specifically comprises the following steps:
s31: cutting the target template image and the search image into a plurality of blocks respectively to obtain a cut target template image and a cut search image; the blocks in the cut target template image and the blocks in the cut search image overlap.
S32: carrying out 3×3 convolution processing on the image to be processed by utilizing the convolution module to obtain a feature after convolution; the image to be processed is the cut target template image or the cut search image.
Since the process of feature extraction performed by the cut target template image or the cut target template image using the corresponding feature extraction sub-network is consistent, steps S32 to S37 are only illustrative of the process of feature extraction performed by the cut target template image or the cut search image, but only describe the feature extraction process in a certain case, and the actual two images use the respective feature extraction sub-networks to perform the processes of steps S32 to S37.
S33: and inputting the features output by the m-1 th layered feature module into the layer normalization branch of the m-th layered feature module to obtain normalization features.
Step S31 to step S33:
the template image and the search image are converted into overlapping blocks, and projection mapping is performed using a 3×3 convolution (two 3×3 convolution pairs in fig. 2). Batch normalization processing is performed and an activation function is used after convolution.
F i,j =GELU(BN(Conv 3×3 (x)))
x=BN(Conv 3×3 (F i,j ))+x
BN represents batch normalization, GELU is an activation function, x is a template image and a search image of an input image after initialization, F i,j The template feature and the search feature (feature after convolution) obtained through convolution operation are represented, i represents the horizontal coordinate of the pixel, and j represents the vertical coordinate of the pixel.
S34: and inputting the normalized feature to the global branch of the mth hierarchical feature to obtain the global information.
The step S34 specifically includes:
s341: and performing linear transformation operation on the normalized features to obtain three feature graphs, which are respectively marked as global query Q, global key K and global value V.
S342: and performing a downsampling operation on the global key K and the global value V.
S343: and performing linear standard attention operation on the global query Q, the downsampled key K and the downsampled value V to obtain the global information.
As shown in fig. 3, in the global branch, first, the layer normalization and linear transformation layer (full connection layer, which is a type of linear transformation layer) is performed to obtain Q, K and V, second, the K and V are downsampled, and finally, standard attention process is performed on Q, K and V to extract global feature information.
Q,K,V=FC(LN(F i,j ))
F global =Attntion(Q g ,Pool(K g ),Pool(V g ))
Wherein F is global Is divided intoGlobal features extracted by the layer module, attntion is a standard attention process, and Q is obtained by inquiring Q, key K and value V and downsampling respectively g ,K g ,V g (g is the meaning of query, key and value representing global branches), pool is the pooling layer, and FC represents the fully connected layer.
S35: and inputting the normalized feature to the local branch of the mth hierarchical feature to obtain the local information.
The step S35 specifically includes:
s351: and performing linear transformation operation on the normalized features to obtain three feature graphs, which are respectively marked as a local query Q, a local key K and a local value V.
S352: and respectively executing the deep convolution operation of the global weight sharing on the local query Q, the local key K and the local value V to obtain the local aggregation characteristic of the query Q, the local aggregation characteristic of the key K and the local aggregation characteristic of the value V.
S353: and carrying out Hardmard product operation on the query Q local aggregation feature and the key K local aggregation feature to obtain a product operation result.
S354: and sequentially passing the product operation result through the second complete connection layer, the first activation layer, the second complete connection layer and the second activation layer to obtain context sensing information.
S355: and carrying out Hardmard product operation on the context awareness information and the value V local aggregation feature to obtain the local information.
As shown in fig. 3, the attention-style convolution operator is a key module for enabling high performance of a hierarchical attention model, and includes some standard attention operations. Linear transformation is first applied to obtain Q, K and V, which are subjected to a local feature aggregation process with shared weights. For V, the depth convolution used to aggregate local information, the weights of the depth convolution are globally shared:
V s =DWconv(V)
where DWconv is the depth convolution, V s The representation value V aggregates the local information; the weights of the deep convolution are shared, VLocal feature aggregation of shared weights.
After integrating the local information of V with the shared weights, combine with Q and K to generate context-aware weights. Specifically, two DWconv are first used to aggregate local information of Q and K, respectively (after aggregating local information of V with a shared weight, combine Q and V to generate a context-aware weight. Then, the Hadamard product of Q and K is calculated, and a series of transformations are performed to obtain context-aware weights between-1 and 1. Finally, the local features are enhanced using the generated weights.
Q l =DWconv(Q)
K l =DWconv(K)
Attn t =FC(Swish(FC(Q l ⊙K l ))
F local =Attn⊙V s
Where d is the channel number, and by which is meant the Hardmad product. Attention-style based convolution introduces a stronger nonlinearity. Q (Q) l Representing Q aggregate local information, K l Representing K aggregated local information, F local Is a local feature. Swish is an activation function, swish has better effect on deep model than ReLU, and Tanh is an activation function, which can be input to compression-switch to interval (-1, 1).
S36: the global information and the local information are subjected to the fusion branch of the mth hierarchical feature to obtain a fusion feature; m=1, 2,; when m=1, the feature output by the m-1 th hierarchical feature module is the convolved feature.
And connecting the output of the global branch characteristic with the output characteristic of the local branch in the template branch, and obtaining the template branch output characteristic of the lightweight hierarchical characteristic module based on volume and attention through a full connection layer.
And connecting the output of the global branch characteristic with the output characteristic of the local branch in the search branch, and obtaining the search branch output characteristic of the lightweight hierarchical characteristic module based on volume and attention through a full connection layer.
In step S36, the local and global features are connected using a simple method, fused with the global branch output features. Specifically, the local features and the global features in the channel dimension are connected, and the template branch layering features and the search branch layering features are output through a complete connection layer. The method comprises the following steps:
F t =Concat(F global ,F local )
F out =FC(F t )
Wherein F is t Is a feature which merges local features and global features, F out Is a hierarchical feature of the output, concat (-) represents a cascading operation. FC denotes a fully connected layer (fourth fully connected layer).
S37: it is determined whether M is equal to M.
If not, let m=m+1, and return to step "input the feature output by the m-1 th hierarchical feature module to the layer normalization branch of the m-th hierarchical feature module".
If yes, inputting the features output by the Mth hierarchical feature module to the first complete connection layer to obtain the target template image extraction features or the search image extraction features. When the image to be processed is the cut target template image, the first complete connection layer outputs the extracted features of the target template image; when the image to be processed is the cut search image, the first complete connection layer outputs the extracted features of the search image.
S4: and inputting the extracted features of the target template image into a tracking model, and carrying out convolution operation on the extracted features of each search image and the result output by the tracking model respectively to obtain the target response position of the search area in each subsequent frame image.
And merging the output result of the template branch output characteristics after the tracking model and the search branch output characteristics by utilizing convolution operation to obtain the response position of the target in the search area, and realizing accurate tracking under the condition of reducing the calculated amount. The tracking model is an existing model under the DIMP50 tracking framework, and the invention improves on a feature extraction network under the DIMP50 tracking framework. Wherein, DIMP: learning Discriminative Model Prediction forTracking (tracking of learning discriminant model predictions).
In the embodiment, a lightweight characteristic extraction network based on combination of convolution and attention is provided, and global modeling characteristics in learning and attention are represented by combining fast local parts in a convolution neural network; corresponding template features and search features are obtained through convolution operation and are respectively used as the input of the hierarchical feature module; the hierarchical feature module consists of a global branch and a local branch; the global branch adopts self-attention to extract global information, so that the number of calculation required by attention is reduced, and the global receptive field is enhanced; the convolution with attention style in the local branches effectively aggregates local feature information by utilizing the global shared weight and the context weight to the local perception advantage; the hierarchical feature module adds the features output by the global branches and the local branches, obtains output features through the full connection layer, and the output features aggregate local and global context information. The invention utilizes the lightweight characteristic extraction network, and can reduce the operand of the tracker.
Example 2
The present embodiment provides a target tracking system based on convolution and attention combination feature extraction, including:
the initialization module 100 is configured to initialize a given first frame image and a search area in each subsequent frame image, respectively, to obtain a target template image corresponding to the first frame image and a search image corresponding to each subsequent frame image.
A feature extraction network construction module 200 for constructing a feature extraction network; the feature extraction network comprises two feature extraction sub-networks with the same structure; the two feature extraction sub-networks are respectively used for extracting features of the target template image and the search image; the characteristic extraction sub-network comprises a convolution module, a plurality of layered characteristic modules and a first complete connection layer, wherein the layered characteristic modules are sequentially connected in series; the plurality of hierarchical feature modules are sequentially denoted as a first hierarchical feature module, a second hierarchical feature module, an M-th hierarchical feature module; the input of the convolution module is the first frame image or each subsequent frame image, the output of the convolution module is the input of the first hierarchical feature module, the output of the Mth hierarchical feature module is the input of the first complete connection layer, and the output of the first complete connection layer is the feature extracted by the feature extraction network; each hierarchical feature module comprises a layer normalization branch, a global branch, a local branch and a fusion branch; the input of the layer normalization branch is the output of the convolution module; the outputs of the layer normalization branches are the input of the global branch and the input of the local branch respectively; the global branch is used for extracting global information by adopting self-attention operation; the local branches are used for extracting local information by adopting the operation of combining attention and convolution; the fusion branch is used for fusing the global information output by the global branch and the local information output by the local branch.
And the feature extraction module 300 is configured to apply the feature extraction sub-network to the target template image and each search image to perform feature extraction, so as to obtain target template image extraction features and search image extraction features.
The feature extraction module 300 specifically includes:
an image cutting unit 301, configured to cut the target template image and the search image into a plurality of blocks, to obtain a cut target template image and a cut search image; the blocks in the cut target template image and the blocks in the cut search image overlap.
The convolution operation unit 302 is configured to perform convolution processing on the image to be processed by using the convolution module, so as to obtain a feature after convolution; the image to be processed is the cut target template image or the cut search image.
And the normalization unit 303 is configured to input the features output by the m-1 th hierarchical feature module to the layer normalization branch of the m-th hierarchical feature module, and obtain normalized features.
And a global feature extraction unit 304, configured to input the normalized feature to the global branch of the mth hierarchical feature, to obtain the global information.
Specifically, the global feature extraction unit 304 includes:
and the first linear transformation subunit is used for performing linear transformation operation on the normalized features to obtain three feature graphs, which are respectively marked as global query Q, global key K and global value V.
And the downsampling operation subunit is used for executing downsampling operation on the global key K and the global value V.
And the global feature extraction subunit is used for performing linear standard attention operation on the global query Q, the downsampled key K and the downsampled value V to obtain the global information.
A local feature extraction unit 305, configured to input the normalized feature to the local branch of the mth hierarchical feature, to obtain the local information.
Specifically, the local feature extraction unit 305 includes:
and the second linear transformation subunit is used for performing linear transformation operation on the normalized features to obtain three feature graphs, which are respectively marked as a local query Q, a local key K and a local value V.
And the depth convolution operation subunit is used for respectively executing the depth convolution operation of the weight global sharing on the local query Q, the local key K and the local value V to obtain the local aggregation characteristic of the query Q, the local aggregation characteristic of the key K and the local aggregation characteristic of the value V.
And the Hardmad product operation subunit is used for executing Hardmad product operation on the query Q local aggregation feature and the key K local aggregation feature to obtain a product operation result.
And the series processing subunit is used for sequentially passing through the second complete connection layer, the first activation layer, the second complete connection layer and the second activation layer to the product operation result to obtain context perception information.
And the local feature extraction subunit is used for carrying out Hardmard product operation on the context awareness information and the value V local aggregation feature to obtain the local information.
A feature fusion unit 306, configured to obtain a fusion feature by passing the global information and the local information through the fusion branch of the mth hierarchical feature; m=1, 2,; when m=1, the feature output by the m-1 th hierarchical feature module is the convolved feature.
Specifically, the feature fusion unit 306 includes:
and the cascading operation subunit is used for executing cascading operation on the global information and the local information to obtain cascading characteristics.
And the complete connection layer processing subunit is used for enabling the cascaded features to pass through a fourth complete connection layer to obtain the fusion features.
A judging unit 307 for judging whether M is equal to M; if not, making m=m+1, and returning to the step of inputting the features output by the m-1 th layered feature module into the layer normalization branch of the m-th layered feature module; if yes, inputting the features output by the Mth hierarchical feature module to the first complete connection layer to obtain the target template image extraction features or the search image extraction features; when the image to be processed is the cut target template image, the first complete connection layer outputs the extracted features of the target template image; when the image to be processed is the cut search image, the first complete connection layer outputs the extracted features of the search image.
And the target tracking module 400 is configured to input the extracted features of the target template image to a tracking model, and perform convolution operation on each extracted feature of the search image and a result output by the tracking model, so as to obtain a target response position of the search area in each subsequent frame image.
Example 3
The present embodiment provides an electronic device including a memory for storing a computer program and a processor that runs the computer program to cause the electronic device to execute the target tracking method based on the convolution and attention combination feature extraction of embodiment 1.
Alternatively, the electronic device may be a server.
In addition, an embodiment of the present invention further provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the target tracking method based on the convolution and attention combination feature extraction of embodiment 1.
Embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Each embodiment is mainly described and is different from other embodiments, and the same similar parts among the embodiments are mutually referred. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.

Claims (10)

1. A target tracking method based on convolution and attention combination feature extraction, comprising:
respectively initializing a given first frame image and a search area in each subsequent frame image to obtain a target template image corresponding to the first frame image and a search image corresponding to each subsequent frame image;
constructing a feature extraction network; the feature extraction network comprises two feature extraction sub-networks with the same structure; the two feature extraction sub-networks are respectively used for extracting features of the target template image and the search image; the characteristic extraction sub-network comprises a convolution module, a plurality of layered characteristic modules and a first complete connection layer, wherein the layered characteristic modules are sequentially connected in series; the plurality of hierarchical feature modules are sequentially denoted as a first hierarchical feature module, a second hierarchical feature module, an M-th hierarchical feature module; the input of the convolution module is the first frame image or each subsequent frame image, the output of the convolution module is the input of the first hierarchical feature module, the output of the Mth hierarchical feature module is the input of the first complete connection layer, and the output of the first complete connection layer is the feature extracted by the feature extraction network; each hierarchical feature module comprises a layer normalization branch, a global branch, a local branch and a fusion branch; the input of the layer normalization branch is the output of the convolution module; the outputs of the layer normalization branches are the input of the global branch and the input of the local branch respectively; the global branch is used for extracting global information by adopting self-attention operation; the local branches are used for extracting local information by adopting the operation of combining attention and convolution; the fusion branch is used for fusing the global information output by the global branch and the local information output by the local branch;
Respectively applying the characteristic extraction sub-network of each target template image and each search image to extract the characteristics to obtain target template image extraction characteristics and search image extraction characteristics;
and inputting the extracted features of the target template image into a tracking model, and carrying out convolution operation on the extracted features of each search image and the result output by the tracking model respectively to obtain the target response position of the search area in each subsequent frame image.
2. The target tracking method based on convolution and attention combination feature extraction according to claim 1, wherein the feature extraction is performed by applying the target template image and each of the search images to the respective feature extraction sub-networks, specifically comprising:
cutting the target template image and the search image into a plurality of blocks respectively to obtain a cut target template image and a cut search image; the blocks in the cut target template image and the blocks in the cut search image overlap;
carrying out convolution processing on the image to be processed by utilizing the convolution module to obtain a feature after convolution; the image to be processed is the cut target template image or the cut search image;
Inputting the features output by the m-1 th layered feature module to the layer normalization branch of the m-th layered feature module to obtain normalization features;
inputting the normalized feature to the global branch of the mth hierarchical feature to obtain the global information;
inputting the normalized feature to the local branch of the mth hierarchical feature to obtain the local information;
the global information and the local information are subjected to the fusion branch of the mth hierarchical feature to obtain a fusion feature; m=1, 2,; when m=1, the feature output by the m-1 th hierarchical feature module is the convolved feature;
judging whether M is equal to M; if not, making m=m+1, and returning to the step of inputting the features output by the m-1 th layered feature module into the layer normalization branch of the m-th layered feature module; if yes, inputting the features output by the Mth hierarchical feature module to the first complete connection layer to obtain the target template image extraction features or the search image extraction features; when the image to be processed is the cut target template image, the first complete connection layer outputs the extracted features of the target template image; when the image to be processed is the cut search image, the first complete connection layer outputs the extracted features of the search image.
3. The target tracking method based on convolution and attention combination feature extraction according to claim 2, wherein the inputting the normalized feature into the global branch of the mth hierarchical feature, to obtain the global information, specifically comprises:
performing linear transformation operation on the normalized features to obtain three feature graphs, which are respectively marked as global query Q, global key K and global value V;
performing a downsampling operation on the global key K and the global value V;
and performing linear standard attention operation on the global query Q, the downsampled key K and the downsampled value V to obtain the global information.
4. The target tracking method based on convolution and attention combination feature extraction according to claim 2, wherein the step of inputting the normalized feature to the local branch of the mth hierarchical feature to obtain the local information specifically comprises:
performing linear transformation operation on the normalized features to obtain three feature graphs, which are respectively marked as a local query Q, a local key K and a local value V;
performing a deep convolution operation of global weight sharing on the local query Q, the local key K and the local value V respectively to obtain a query Q local aggregation feature, a key K local aggregation feature and a value V local aggregation feature;
Carrying out Hardmard product operation on the query Q local aggregation feature and the key K local aggregation feature to obtain a product operation result;
sequentially passing the product operation result through a second complete connection layer, a first activation layer, a second complete connection layer and a second activation layer to obtain context sensing information;
and carrying out Hardmard product operation on the context awareness information and the value V local aggregation feature to obtain the local information.
5. The target tracking method based on convolution and attention combination feature extraction according to claim 1, wherein the merging branches of the mth hierarchical feature are used for obtaining a merged feature by the global information and the local information, and the method specifically comprises the following steps:
performing cascading operation on the global information and the local information to obtain cascading characteristics;
and passing the cascaded features through a fourth complete connection layer to obtain the fusion features.
6. A target tracking system based on convolution and attention combination feature extraction, comprising:
the initialization module is used for respectively initializing a given first frame image and a search area in each subsequent frame image to obtain a target template image corresponding to the first frame image and a search image corresponding to each subsequent frame image;
The feature extraction network construction module is used for constructing a feature extraction network; the feature extraction network comprises two feature extraction sub-networks with the same structure; the two feature extraction sub-networks are respectively used for extracting features of the target template image and the search image; the characteristic extraction sub-network comprises a convolution module, a plurality of layered characteristic modules and a first complete connection layer, wherein the layered characteristic modules are sequentially connected in series; the plurality of hierarchical feature modules are sequentially denoted as a first hierarchical feature module, a second hierarchical feature module, an M-th hierarchical feature module; the input of the convolution module is the first frame image or each subsequent frame image, the output of the convolution module is the input of the first hierarchical feature module, the output of the Mth hierarchical feature module is the input of the first complete connection layer, and the output of the first complete connection layer is the feature extracted by the feature extraction network; each hierarchical feature module comprises a layer normalization branch, a global branch, a local branch and a fusion branch; the input of the layer normalization branch is the output of the convolution module; the outputs of the layer normalization branches are the input of the global branch and the input of the local branch respectively; the global branch is used for extracting global information by adopting self-attention operation; the local branches are used for extracting local information by adopting the operation of combining attention and convolution; the fusion branch is used for fusing the global information output by the global branch and the local information output by the local branch;
The feature extraction module is used for respectively applying the feature extraction sub-network of each target template image and each search image to carry out feature extraction to obtain target template image extraction features and search image extraction features;
and the target tracking module is used for inputting the extracted features of the target template image into a tracking model, and carrying out convolution operation on the extracted features of each search image and the result output by the tracking model respectively to obtain the target response position of the search area in each subsequent frame image.
7. The target tracking system based on convolution and attention combination feature extraction according to claim 6, wherein the feature extraction module specifically comprises:
the image cutting unit is used for cutting the target template image and the search image into a plurality of blocks respectively to obtain a cut target template image and a cut search image; the blocks in the cut target template image and the blocks in the cut search image overlap;
the convolution operation unit is used for carrying out convolution processing on the image to be processed by utilizing the convolution module to obtain a feature after convolution; the image to be processed is the cut target template image or the cut search image;
The normalization unit is used for inputting the features output by the m-1 th layered feature module to the layer normalization branch of the m-th layered feature module to obtain normalization features;
the global feature extraction unit is used for inputting the normalized feature to the global branch of the mth hierarchical feature to obtain the global information;
a local feature extraction unit, configured to input the normalized feature to the local branch of the mth hierarchical feature, to obtain the local information;
the feature fusion unit is used for obtaining fusion features by the fusion branches of the mth hierarchical features through the global information and the local information; m=1, 2,; when m=1, the feature output by the m-1 th hierarchical feature module is the convolved feature;
a judging unit for judging whether M is equal to M; if not, making m=m+1, and returning to the step of inputting the features output by the m-1 th layered feature module into the layer normalization branch of the m-th layered feature module; if yes, inputting the features output by the Mth hierarchical feature module to the first complete connection layer to obtain the target template image extraction features or the search image extraction features; when the image to be processed is the cut target template image, the first complete connection layer outputs the extracted features of the target template image; when the image to be processed is the cut search image, the first complete connection layer outputs the extracted features of the search image.
8. The target tracking system based on convolution and attention combination feature extraction according to claim 7, wherein the global feature extraction unit specifically comprises:
the first linear transformation subunit is used for performing linear transformation operation on the normalized features to obtain three feature graphs, which are respectively marked as global query Q, global key K and global value V;
a downsampling operation subunit, configured to perform a downsampling operation on the global key K and the global value V;
and the global feature extraction subunit is used for performing linear standard attention operation on the global query Q, the downsampled key K and the downsampled value V to obtain the global information.
9. The target tracking system based on convolution and attention combination feature extraction according to claim 7, wherein the local feature extraction unit specifically comprises:
the second linear transformation subunit is used for performing linear transformation operation on the normalized features to obtain three feature graphs, which are respectively marked as local query Q, local key K and local value V;
the depth convolution operation subunit is used for respectively executing the depth convolution operation of the global weight sharing on the local query Q, the local key K and the local value V to obtain a query Q local aggregation feature, a key K local aggregation feature and a value V local aggregation feature;
The Hardmard product operation subunit is used for executing Hardmard product operation on the query Q local aggregation feature and the key K local aggregation feature to obtain a product operation result;
a series of processing subunits, configured to sequentially pass through the second complete connection layer, the first activation layer, the second complete connection layer and the second activation layer on the product operation result, so as to obtain context sensing information;
and the local feature extraction subunit is used for carrying out Hardmard product operation on the context awareness information and the value V local aggregation feature to obtain the local information.
10. The target tracking system based on convolution and attention combination feature extraction according to claim 6, wherein the feature fusion unit specifically comprises:
a cascading operation subunit, configured to perform cascading operation on the global information and the local information, so as to obtain a feature after cascading;
and the complete connection layer processing subunit is used for enabling the cascaded features to pass through a fourth complete connection layer to obtain the fusion features.
CN202311697673.2A 2023-12-12 2023-12-12 Target tracking method and system based on convolution and attention combination feature extraction Active CN117710688B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311697673.2A CN117710688B (en) 2023-12-12 2023-12-12 Target tracking method and system based on convolution and attention combination feature extraction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311697673.2A CN117710688B (en) 2023-12-12 2023-12-12 Target tracking method and system based on convolution and attention combination feature extraction

Publications (2)

Publication Number Publication Date
CN117710688A true CN117710688A (en) 2024-03-15
CN117710688B CN117710688B (en) 2024-06-25

Family

ID=90147228

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311697673.2A Active CN117710688B (en) 2023-12-12 2023-12-12 Target tracking method and system based on convolution and attention combination feature extraction

Country Status (1)

Country Link
CN (1) CN117710688B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210224564A1 (en) * 2020-01-16 2021-07-22 Samsung Electronics Co., Ltd. Method and apparatus for tracking target
CN114092766A (en) * 2021-11-24 2022-02-25 北京邮电大学 Robot grabbing detection method based on characteristic attention mechanism
CN114463677A (en) * 2022-01-19 2022-05-10 北京工业大学 Safety helmet wearing detection method based on global attention
CN115100235A (en) * 2022-08-18 2022-09-23 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Target tracking method, system and storage medium
CN115272419A (en) * 2022-09-27 2022-11-01 南昌工程学院 Method and system for tracking aggregation network target based on mixed convolution and self attention
CN116030097A (en) * 2023-02-28 2023-04-28 南昌工程学院 Target tracking method and system based on dual-attention feature fusion network
US20230162522A1 (en) * 2022-07-29 2023-05-25 Nanjing University Of Posts And Telecommunications Person re-identification method of integrating global features and ladder-shaped local features and device thereof
CN116402860A (en) * 2023-04-21 2023-07-07 中国人民解放***箭军工程大学 Unmanned aerial vehicle aerial photographing target tracking method and device with enhanced attention
CN116563337A (en) * 2023-04-11 2023-08-08 武汉大学 Target tracking method based on double-attention mechanism
CN117133035A (en) * 2023-08-25 2023-11-28 华中师范大学 Facial expression recognition method and system and electronic equipment

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210224564A1 (en) * 2020-01-16 2021-07-22 Samsung Electronics Co., Ltd. Method and apparatus for tracking target
CN114092766A (en) * 2021-11-24 2022-02-25 北京邮电大学 Robot grabbing detection method based on characteristic attention mechanism
CN114463677A (en) * 2022-01-19 2022-05-10 北京工业大学 Safety helmet wearing detection method based on global attention
US20230162522A1 (en) * 2022-07-29 2023-05-25 Nanjing University Of Posts And Telecommunications Person re-identification method of integrating global features and ladder-shaped local features and device thereof
CN115100235A (en) * 2022-08-18 2022-09-23 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Target tracking method, system and storage medium
CN115272419A (en) * 2022-09-27 2022-11-01 南昌工程学院 Method and system for tracking aggregation network target based on mixed convolution and self attention
CN116030097A (en) * 2023-02-28 2023-04-28 南昌工程学院 Target tracking method and system based on dual-attention feature fusion network
CN116563337A (en) * 2023-04-11 2023-08-08 武汉大学 Target tracking method based on double-attention mechanism
CN116402860A (en) * 2023-04-21 2023-07-07 中国人民解放***箭军工程大学 Unmanned aerial vehicle aerial photographing target tracking method and device with enhanced attention
CN117133035A (en) * 2023-08-25 2023-11-28 华中师范大学 Facial expression recognition method and system and electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
徐亮;张江;张晶;杨亚琦;: "基于VGG网络的鲁棒目标跟踪算法", 计算机工程与科学, no. 08, 15 August 2020 (2020-08-15) *

Also Published As

Publication number Publication date
CN117710688B (en) 2024-06-25

Similar Documents

Publication Publication Date Title
CN109840500B (en) Three-dimensional human body posture information detection method and device
US11783500B2 (en) Unsupervised depth prediction neural networks
CN112712546A (en) Target tracking method based on twin neural network
CN110942512A (en) Indoor scene reconstruction method based on meta-learning
Wang et al. TF-SOD: a novel transformer framework for salient object detection
CN115239765A (en) Infrared image target tracking system and method based on multi-scale deformable attention
Ukwuoma et al. Image inpainting and classification agent training based on reinforcement learning and generative models with attention mechanism
Lin et al. Attention guided network for salient object detection in optical remote sensing images
Glegoła et al. MobileNet family tailored for Raspberry Pi
Lin et al. Efficient and high-quality monocular depth estimation via gated multi-scale network
Zhang et al. Video extrapolation in space and time
Wang et al. EMAT: Efficient feature fusion network for visual tracking via optimized multi-head attention
CN114202454A (en) Graph optimization method, system, computer program product and storage medium
CN109784295A (en) Video stream characteristics recognition methods, device, equipment and storage medium
CN117576149A (en) Single-target tracking method based on attention mechanism
CN117710688B (en) Target tracking method and system based on convolution and attention combination feature extraction
Wang et al. 3D object detection algorithm for panoramic images with multi-scale convolutional neural network
Liu et al. Siamese network with bidirectional feature pyramid for small target tracking
CN115830707A (en) Multi-view human behavior identification method based on hypergraph learning
CN115097935A (en) Hand positioning method and VR equipment
Zheng et al. Multi-task View Synthesis with Neural Radiance Fields
CN114332989A (en) Face detection method and system of multitask cascade convolution neural network
Jokela Person counter using real-time object detection and a small neural network
Chen et al. EnforceNet: Monocular camera localization in large scale indoor sparse lidar point cloud
Lyu et al. High-precision and real-time visual tracking algorithm based on the Siamese network for autonomous driving

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant