CN117710688A

CN117710688A - Target tracking method and system based on convolution and attention combination feature extraction

Info

Publication number: CN117710688A
Application number: CN202311697673.2A
Authority: CN
Inventors: 王员云; 孙传雨; 王军
Original assignee: Nanchang Institute of Technology
Current assignee: Nanchang Institute of Technology
Priority date: 2023-12-12
Filing date: 2023-12-12
Publication date: 2024-03-15
Anticipated expiration: 2043-12-12
Also published as: CN117710688B

Abstract

The invention discloses a target tracking method and a target tracking system based on convolution and attention combination feature extraction, which relate to the field of computer vision and image processing and comprise the following steps: respectively initializing a given first frame image and a search area of each subsequent frame image to obtain a target template image and a search image; constructing a feature extraction network; the feature extraction sub-network comprises a convolution module, a plurality of layered feature modules and a complete connection layer which are sequentially connected; each hierarchical feature module comprises a global branch for extracting global information by adopting a self-attention operation and a local branch for extracting local information by adopting an operation combining attention and convolution; and respectively applying a feature extraction network to the target template image and each search image to extract features, and carrying out convolution operation on the extracted features to obtain the target response position of the search area in each subsequent frame image. The invention improves the target tracking efficiency and accuracy through the lightweight characteristic extraction network combining convolution and attention.

Description

Target tracking method and system based on convolution and attention combination feature extraction

Technical Field

The invention relates to the field of computer vision and image processing, in particular to a target tracking method and system of a lightweight characteristic extraction network based on combination of convolution and attention.

Background

Target tracking is an important research direction in computer vision, the purpose of which is to estimate the position of an object in a video sequence given the initial state of the tracked object. At present, the target tracking is widely applied to vision fields such as vision positioning, automatic driving systems, intelligent cities and the like. Although the object tracking has achieved numerous achievements, the object tracking task still faces some problems, such as illumination change, scale change, background interference, object shielding, motion and blurring, and the effect of the existing method still does not reach an ideal state, and designing a high-precision tracker is still a challenging task.

Convolutional neural networks and attention are two main techniques in target tracking, with superior performance in feature learning. The convolution neural network can effectively extract local information from the related feature map through neighborhood convolution operation. However, the limited receptive field of convolutional neural networks results in difficulty in capturing global dependencies. The attention mechanism uses image blocks as input representations and applies a weighted average operation to the context of the input features, which can effectively capture global information, but blind similarity matching between neighboring image blocks can lead to high redundancy.

Convolutional neural network-based and attention-based trackers are two different approaches, but achieve better tracking performance. Specifically, some trackers extract features of template branches and search branches using convolutional neural networks as feature extraction networks, and then calculate similarity scores of the features through an attention-based feature fusion network, thereby performing tracking. Other trackers use attention-based transformers as feature extraction and feature fusion networks, respectively, to achieve high performance tracking. However, the large model of the typically attention-based transformer network can affect the performance and speed of tracking. Therefore, the invention provides a target tracking method and a target tracking system based on convolution and attention combination feature extraction.

Disclosure of Invention

The invention aims to provide a target tracking method and a target tracking system based on convolution and attention combination feature extraction, which can improve target tracking efficiency and accuracy through a convolution and attention combination lightweight feature extraction network.

In order to achieve the above object, the present invention provides the following solutions:

a target tracking method based on convolution and attention combination feature extraction, comprising:

Respectively initializing a given first frame image and a search area in each subsequent frame image to obtain a target template image corresponding to the first frame image and a search image corresponding to each subsequent frame image;

constructing a feature extraction network; the feature extraction network comprises two feature extraction sub-networks with the same structure; the two feature extraction sub-networks are respectively used for extracting features of the target template image and the search image; the characteristic extraction sub-network comprises a convolution module, a plurality of layered characteristic modules and a first complete connection layer, wherein the layered characteristic modules are sequentially connected in series; the plurality of hierarchical feature modules are sequentially denoted as a first hierarchical feature module, a second hierarchical feature module, an M-th hierarchical feature module; the input of the convolution module is the first frame image or each subsequent frame image, the output of the convolution module is the input of the first hierarchical feature module, the output of the Mth hierarchical feature module is the input of the first complete connection layer, and the output of the first complete connection layer is the feature extracted by the feature extraction network; each hierarchical feature module comprises a layer normalization branch, a global branch, a local branch and a fusion branch; the input of the layer normalization branch is the output of the convolution module; the outputs of the layer normalization branches are the input of the global branch and the input of the local branch respectively; the global branch is used for extracting global information by adopting self-attention operation; the local branches are used for extracting local information by adopting the operation of combining attention and convolution; the fusion branch is used for fusing the global information output by the global branch and the local information output by the local branch;

Respectively applying the characteristic extraction sub-network of each target template image and each search image to extract the characteristics to obtain target template image extraction characteristics and search image extraction characteristics;

and inputting the extracted features of the target template image into a tracking model, and carrying out convolution operation on the extracted features of each search image and the result output by the tracking model respectively to obtain the target response position of the search area in each subsequent frame image.

Optionally, the feature extraction sub-network is applied to the target template image and each search image to perform feature extraction, which specifically includes:

cutting the target template image and the search image into a plurality of blocks respectively to obtain a cut target template image and a cut search image; the blocks in the cut target template image and the blocks in the cut search image overlap;

carrying out convolution processing on the image to be processed by utilizing the convolution module to obtain a feature after convolution; the image to be processed is the cut target template image or the cut search image;

inputting the features output by the m-1 th layered feature module to the layer normalization branch of the m-th layered feature module to obtain normalization features;

Inputting the normalized feature to the global branch of the mth hierarchical feature to obtain the global information;

inputting the normalized feature to the local branch of the mth hierarchical feature to obtain the local information;

the global information and the local information are subjected to the fusion branch of the mth hierarchical feature to obtain a fusion feature; m=1, 2,; when m=1, the feature output by the m-1 th hierarchical feature module is the convolved feature;

judging whether M is equal to M; if not, making m=m+1, and returning to the step of inputting the features output by the m-1 th layered feature module into the layer normalization branch of the m-th layered feature module; if yes, inputting the features output by the Mth hierarchical feature module to the first complete connection layer to obtain the target template image extraction features or the search image extraction features; when the image to be processed is the cut target template image, the first complete connection layer outputs the extracted features of the target template image; when the image to be processed is the cut search image, the first complete connection layer outputs the extracted features of the search image.

Optionally, inputting the normalized feature to the global branch of the mth hierarchical feature to obtain the global information, which specifically includes:

performing linear transformation operation on the normalized features to obtain three feature graphs, which are respectively marked as global query Q, global key K and global value V;

performing a downsampling operation on the global key K and the global value V;

and performing linear standard attention operation on the global query Q, the downsampled key K and the downsampled value V to obtain the global information.

Optionally, inputting the normalized feature to the local branch of the mth hierarchical feature to obtain the local information, which specifically includes:

performing linear transformation operation on the normalized features to obtain three feature graphs, which are respectively marked as a local query Q, a local key K and a local value V;

performing a deep convolution operation of global weight sharing on the local query Q, the local key K and the local value V respectively to obtain a query Q local aggregation feature, a key K local aggregation feature and a value V local aggregation feature;

carrying out Hardmard product operation on the query Q local aggregation feature and the key K local aggregation feature to obtain a product operation result;

Sequentially passing the product operation result through a second complete connection layer, a first activation layer, a second complete connection layer and a second activation layer to obtain context sensing information;

and carrying out Hardmard product operation on the context awareness information and the value V local aggregation feature to obtain the local information.

Optionally, the merging branch of the mth hierarchical feature is used for obtaining the merging feature by the global information and the local information, which specifically includes:

performing cascading operation on the global information and the local information to obtain cascading characteristics;

and passing the cascaded features through a fourth complete connection layer to obtain the fusion features.

The invention provides a target tracking system based on convolution and attention combination feature extraction, which comprises:

the initialization module is used for respectively initializing a given first frame image and a search area in each subsequent frame image to obtain a target template image corresponding to the first frame image and a search image corresponding to each subsequent frame image;

the feature extraction network construction module is used for constructing a feature extraction network; the feature extraction network comprises two feature extraction sub-networks with the same structure; the two feature extraction sub-networks are respectively used for extracting features of the target template image and the search image; the characteristic extraction sub-network comprises a convolution module, a plurality of layered characteristic modules and a first complete connection layer, wherein the layered characteristic modules are sequentially connected in series; the plurality of hierarchical feature modules are sequentially denoted as a first hierarchical feature module, a second hierarchical feature module, an M-th hierarchical feature module; the input of the convolution module is the first frame image or each subsequent frame image, the output of the convolution module is the input of the first hierarchical feature module, the output of the Mth hierarchical feature module is the input of the first complete connection layer, and the output of the first complete connection layer is the feature extracted by the feature extraction network; each hierarchical feature module comprises a layer normalization branch, a global branch, a local branch and a fusion branch; the input of the layer normalization branch is the output of the convolution module; the outputs of the layer normalization branches are the input of the global branch and the input of the local branch respectively; the global branch is used for extracting global information by adopting self-attention operation; the local branches are used for extracting local information by adopting the operation of combining attention and convolution; the fusion branch is used for fusing the global information output by the global branch and the local information output by the local branch;

The feature extraction module is used for respectively applying the feature extraction sub-network of each target template image and each search image to carry out feature extraction to obtain target template image extraction features and search image extraction features;

and the target tracking module is used for inputting the extracted features of the target template image into a tracking model, and carrying out convolution operation on the extracted features of each search image and the result output by the tracking model respectively to obtain the target response position of the search area in each subsequent frame image.

Optionally, the feature extraction module specifically includes:

the image cutting unit is used for cutting the target template image and the search image into a plurality of blocks respectively to obtain a cut target template image and a cut search image; the blocks in the cut target template image and the blocks in the cut search image overlap;

the convolution operation unit is used for carrying out convolution processing on the image to be processed by utilizing the convolution module to obtain a feature after convolution; the image to be processed is the cut target template image or the cut search image;

the normalization unit is used for inputting the features output by the m-1 th layered feature module to the layer normalization branch of the m-th layered feature module to obtain normalization features;

The global feature extraction unit is used for inputting the normalized feature to the global branch of the mth hierarchical feature to obtain the global information;

a local feature extraction unit, configured to input the normalized feature to the local branch of the mth hierarchical feature, to obtain the local information;

the feature fusion unit is used for obtaining fusion features by the fusion branches of the mth hierarchical features through the global information and the local information; m=1, 2,; when m=1, the feature output by the m-1 th hierarchical feature module is the convolved feature;

a judging unit for judging whether M is equal to M; if not, making m=m+1, and returning to the step of inputting the features output by the m-1 th layered feature module into the layer normalization branch of the m-th layered feature module; if yes, inputting the features output by the Mth hierarchical feature module to the first complete connection layer to obtain the target template image extraction features or the search image extraction features; when the image to be processed is the cut target template image, the first complete connection layer outputs the extracted features of the target template image; when the image to be processed is the cut search image, the first complete connection layer outputs the extracted features of the search image.

Optionally, the global feature extraction unit specifically includes:

the first linear transformation subunit is used for performing linear transformation operation on the normalized features to obtain three feature graphs, which are respectively marked as global query Q, global key K and global value V;

a downsampling operation subunit, configured to perform a downsampling operation on the global key K and the global value V;

and the global feature extraction subunit is used for performing linear standard attention operation on the global query Q, the downsampled key K and the downsampled value V to obtain the global information.

Optionally, the local feature extraction unit specifically includes:

the second linear transformation subunit is used for performing linear transformation operation on the normalized features to obtain three feature graphs, which are respectively marked as local query Q, local key K and local value V;

the depth convolution operation subunit is used for respectively executing the depth convolution operation of the global weight sharing on the local query Q, the local key K and the local value V to obtain a query Q local aggregation feature, a key K local aggregation feature and a value V local aggregation feature;

the Hardmard product operation subunit is used for executing Hardmard product operation on the query Q local aggregation feature and the key K local aggregation feature to obtain a product operation result;

A series of processing subunits, configured to sequentially pass through the second complete connection layer, the first activation layer, the second complete connection layer and the second activation layer on the product operation result, so as to obtain context sensing information;

and the local feature extraction subunit is used for carrying out Hardmard product operation on the context awareness information and the value V local aggregation feature to obtain the local information.

Optionally, the feature fusion unit specifically includes:

a cascading operation subunit, configured to perform cascading operation on the global information and the local information, so as to obtain a feature after cascading;

and the complete connection layer processing subunit is used for enabling the cascaded features to pass through a fourth complete connection layer to obtain the fusion features.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention provides a target tracking method and a target tracking system based on convolution and attention combination feature extraction, which are combined with rapid local representation learning and global modeling characteristics in attention in a convolution neural network; the hierarchical feature module consists of a global branch and a local branch; the global branch adopts self-attention to extract global information, so that the number of calculation required by attention is reduced, and the global receptive field is enhanced; the convolution with attention style in the local branches effectively aggregates local feature information by utilizing the global shared weight and the context weight to the local perception advantage. The method adopts a mode based on combination of convolution and attention to perform feature extraction, and improves the efficiency and the accuracy of target tracking.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a target tracking method based on convolution and attention combination feature extraction according to embodiment 1 of the present invention;

FIG. 2 is a schematic diagram of the structure of a target tracking framework of the lightweight feature extraction network based on convolution and attention combination according to embodiment 1 of the present invention;

fig. 3 is a schematic diagram of a hierarchical feature module principle structure of object tracking of a lightweight feature extraction network based on combination of convolution and attention according to embodiment 1 of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention aims to provide a target tracking method and a target tracking system based on convolution and attention combination feature extraction, which combine the advantages of rapid local feature learning in a convolution neural network and global modeling characteristics in an attention converter, solve the problem that the global dependence is difficult to capture due to the limited receptive field of the convolution neural network in a tracker based on the convolution neural network, and also solve the problem that the performance and the speed of tracking are influenced due to the fact that a model of the tracker based on the attention is large.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

Example 1

As shown in fig. 1, the present embodiment provides a target tracking method based on convolution and attention combination feature extraction, including:

s1: and initializing the given first frame image and the search area in each subsequent frame image respectively to obtain a target template image corresponding to the first frame image and a search image corresponding to each subsequent frame image.

S2: and constructing a feature extraction network.

As shown in fig. 2, the feature extraction network includes two feature extraction sub-networks with identical structures; the two feature extraction sub-networks are respectively used for extracting features of the target template image and the search image; the characteristic extraction sub-network comprises a convolution module, a plurality of layered characteristic modules and a first complete connection layer, wherein the layered characteristic modules are sequentially connected in series; the plurality of hierarchical feature modules are sequentially denoted as a first hierarchical feature module, a second hierarchical feature module, an M-th hierarchical feature module; the input of the convolution module is the first frame image or each subsequent frame image, the output of the convolution module is the input of the first hierarchical feature module, the output of the Mth hierarchical feature module is the input of the first complete connection layer, and the output of the first complete connection layer is the feature extracted by the feature extraction network.

Each layered feature module is a lightweight module based on combination of convolution and attention, and layered features are extracted through the layered feature modules. Each hierarchical feature module comprises a layer normalization branch, a global branch, a local branch and a fusion branch; the input of the layer normalization branch is the output of the convolution module; the outputs of the layer normalization branches are the input of the global branch and the input of the local branch respectively; the global branch is used for extracting global information by adopting self-attention operation; the local branches are used for extracting local information by adopting the operation of combining attention and convolution; the fusion branch is used for fusing the global information output by the global branch and the local information output by the local branch.

In fig. 2, the convolution module adopts 3×3 convolution, and learns the target template image feature and the search image feature through convolution operation to obtain the template feature and the search feature respectively. The number of the hierarchical feature modules is 4, m=4, and the specific number is arbitrarily determined according to requirements. The first fully connected layer adopts an average pooling fully connected layer.

The hierarchical feature module includes a global branch that performs attention operations and a local branch that is a mixture of attention and convolution, as shown in the hierarchical feature module schematic diagram of fig. 3, with the local branch added with a deep convolution.

S3: and respectively applying the characteristic extraction sub-network to the target template image and each search image to extract the characteristics so as to obtain the extraction characteristics of the target template image and the extraction characteristics of the search image.

The step S3 specifically comprises the following steps:

s31: cutting the target template image and the search image into a plurality of blocks respectively to obtain a cut target template image and a cut search image; the blocks in the cut target template image and the blocks in the cut search image overlap.

S32: carrying out 3×3 convolution processing on the image to be processed by utilizing the convolution module to obtain a feature after convolution; the image to be processed is the cut target template image or the cut search image.

Since the process of feature extraction performed by the cut target template image or the cut target template image using the corresponding feature extraction sub-network is consistent, steps S32 to S37 are only illustrative of the process of feature extraction performed by the cut target template image or the cut search image, but only describe the feature extraction process in a certain case, and the actual two images use the respective feature extraction sub-networks to perform the processes of steps S32 to S37.

S33: and inputting the features output by the m-1 th layered feature module into the layer normalization branch of the m-th layered feature module to obtain normalization features.

Step S31 to step S33:

the template image and the search image are converted into overlapping blocks, and projection mapping is performed using a 3×3 convolution (two 3×3 convolution pairs in fig. 2). Batch normalization processing is performed and an activation function is used after convolution.

F _i,j ＝GELU(BN(Conv _3×3 (x)))

x＝BN(Conv _3×3 (F _i,j ))+x

BN represents batch normalization, GELU is an activation function, x is a template image and a search image of an input image after initialization, F _i,j The template feature and the search feature (feature after convolution) obtained through convolution operation are represented, i represents the horizontal coordinate of the pixel, and j represents the vertical coordinate of the pixel.

S34: and inputting the normalized feature to the global branch of the mth hierarchical feature to obtain the global information.

The step S34 specifically includes:

s341: and performing linear transformation operation on the normalized features to obtain three feature graphs, which are respectively marked as global query Q, global key K and global value V.

S342: and performing a downsampling operation on the global key K and the global value V.

S343: and performing linear standard attention operation on the global query Q, the downsampled key K and the downsampled value V to obtain the global information.

As shown in fig. 3, in the global branch, first, the layer normalization and linear transformation layer (full connection layer, which is a type of linear transformation layer) is performed to obtain Q, K and V, second, the K and V are downsampled, and finally, standard attention process is performed on Q, K and V to extract global feature information.

Q,K,V＝FC(LN(F _i,j ))

F _global ＝Attntion(Q _g ,Pool(K _g ),Pool(V _g ))

Wherein F is _global Is divided intoGlobal features extracted by the layer module, attntion is a standard attention process, and Q is obtained by inquiring Q, key K and value V and downsampling respectively _g ，K _g ，V _g (g is the meaning of query, key and value representing global branches), pool is the pooling layer, and FC represents the fully connected layer.

S35: and inputting the normalized feature to the local branch of the mth hierarchical feature to obtain the local information.

The step S35 specifically includes:

s351: and performing linear transformation operation on the normalized features to obtain three feature graphs, which are respectively marked as a local query Q, a local key K and a local value V.

S352: and respectively executing the deep convolution operation of the global weight sharing on the local query Q, the local key K and the local value V to obtain the local aggregation characteristic of the query Q, the local aggregation characteristic of the key K and the local aggregation characteristic of the value V.

S353: and carrying out Hardmard product operation on the query Q local aggregation feature and the key K local aggregation feature to obtain a product operation result.

S354: and sequentially passing the product operation result through the second complete connection layer, the first activation layer, the second complete connection layer and the second activation layer to obtain context sensing information.

S355: and carrying out Hardmard product operation on the context awareness information and the value V local aggregation feature to obtain the local information.

As shown in fig. 3, the attention-style convolution operator is a key module for enabling high performance of a hierarchical attention model, and includes some standard attention operations. Linear transformation is first applied to obtain Q, K and V, which are subjected to a local feature aggregation process with shared weights. For V, the depth convolution used to aggregate local information, the weights of the depth convolution are globally shared:

V _s ＝DWconv(V)

where DWconv is the depth convolution, V _s The representation value V aggregates the local information; the weights of the deep convolution are shared, VLocal feature aggregation of shared weights.

After integrating the local information of V with the shared weights, combine with Q and K to generate context-aware weights. Specifically, two DWconv are first used to aggregate local information of Q and K, respectively (after aggregating local information of V with a shared weight, combine Q and V to generate a context-aware weight. Then, the Hadamard product of Q and K is calculated, and a series of transformations are performed to obtain context-aware weights between-1 and 1. Finally, the local features are enhanced using the generated weights.

Q _l ＝DWconv(Q)

K _l ＝DWconv(K)

Attn _t ＝FC(Swish(FC(Q _l ⊙K _l ))

F _local ＝Attn⊙V _s

Where d is the channel number, and by which is meant the Hardmad product. Attention-style based convolution introduces a stronger nonlinearity. Q (Q) _l Representing Q aggregate local information, K _l Representing K aggregated local information, F _local Is a local feature. Swish is an activation function, swish has better effect on deep model than ReLU, and Tanh is an activation function, which can be input to compression-switch to interval (-1, 1).

S36: the global information and the local information are subjected to the fusion branch of the mth hierarchical feature to obtain a fusion feature; m=1, 2,; when m=1, the feature output by the m-1 th hierarchical feature module is the convolved feature.

And connecting the output of the global branch characteristic with the output characteristic of the local branch in the template branch, and obtaining the template branch output characteristic of the lightweight hierarchical characteristic module based on volume and attention through a full connection layer.

And connecting the output of the global branch characteristic with the output characteristic of the local branch in the search branch, and obtaining the search branch output characteristic of the lightweight hierarchical characteristic module based on volume and attention through a full connection layer.

In step S36, the local and global features are connected using a simple method, fused with the global branch output features. Specifically, the local features and the global features in the channel dimension are connected, and the template branch layering features and the search branch layering features are output through a complete connection layer. The method comprises the following steps:

F _t ＝Concat(F _global ,F _local )

F _out ＝FC(F _t )

Wherein F is _t Is a feature which merges local features and global features, F _out Is a hierarchical feature of the output, concat (-) represents a cascading operation. FC denotes a fully connected layer (fourth fully connected layer).

S37: it is determined whether M is equal to M.

If not, let m=m+1, and return to step "input the feature output by the m-1 th hierarchical feature module to the layer normalization branch of the m-th hierarchical feature module".

If yes, inputting the features output by the Mth hierarchical feature module to the first complete connection layer to obtain the target template image extraction features or the search image extraction features. When the image to be processed is the cut target template image, the first complete connection layer outputs the extracted features of the target template image; when the image to be processed is the cut search image, the first complete connection layer outputs the extracted features of the search image.

S4: and inputting the extracted features of the target template image into a tracking model, and carrying out convolution operation on the extracted features of each search image and the result output by the tracking model respectively to obtain the target response position of the search area in each subsequent frame image.

And merging the output result of the template branch output characteristics after the tracking model and the search branch output characteristics by utilizing convolution operation to obtain the response position of the target in the search area, and realizing accurate tracking under the condition of reducing the calculated amount. The tracking model is an existing model under the DIMP50 tracking framework, and the invention improves on a feature extraction network under the DIMP50 tracking framework. Wherein, DIMP: learning Discriminative Model Prediction forTracking (tracking of learning discriminant model predictions).

In the embodiment, a lightweight characteristic extraction network based on combination of convolution and attention is provided, and global modeling characteristics in learning and attention are represented by combining fast local parts in a convolution neural network; corresponding template features and search features are obtained through convolution operation and are respectively used as the input of the hierarchical feature module; the hierarchical feature module consists of a global branch and a local branch; the global branch adopts self-attention to extract global information, so that the number of calculation required by attention is reduced, and the global receptive field is enhanced; the convolution with attention style in the local branches effectively aggregates local feature information by utilizing the global shared weight and the context weight to the local perception advantage; the hierarchical feature module adds the features output by the global branches and the local branches, obtains output features through the full connection layer, and the output features aggregate local and global context information. The invention utilizes the lightweight characteristic extraction network, and can reduce the operand of the tracker.

Example 2

The present embodiment provides a target tracking system based on convolution and attention combination feature extraction, including:

the initialization module 100 is configured to initialize a given first frame image and a search area in each subsequent frame image, respectively, to obtain a target template image corresponding to the first frame image and a search image corresponding to each subsequent frame image.

A feature extraction network construction module 200 for constructing a feature extraction network; the feature extraction network comprises two feature extraction sub-networks with the same structure; the two feature extraction sub-networks are respectively used for extracting features of the target template image and the search image; the characteristic extraction sub-network comprises a convolution module, a plurality of layered characteristic modules and a first complete connection layer, wherein the layered characteristic modules are sequentially connected in series; the plurality of hierarchical feature modules are sequentially denoted as a first hierarchical feature module, a second hierarchical feature module, an M-th hierarchical feature module; the input of the convolution module is the first frame image or each subsequent frame image, the output of the convolution module is the input of the first hierarchical feature module, the output of the Mth hierarchical feature module is the input of the first complete connection layer, and the output of the first complete connection layer is the feature extracted by the feature extraction network; each hierarchical feature module comprises a layer normalization branch, a global branch, a local branch and a fusion branch; the input of the layer normalization branch is the output of the convolution module; the outputs of the layer normalization branches are the input of the global branch and the input of the local branch respectively; the global branch is used for extracting global information by adopting self-attention operation; the local branches are used for extracting local information by adopting the operation of combining attention and convolution; the fusion branch is used for fusing the global information output by the global branch and the local information output by the local branch.

And the feature extraction module 300 is configured to apply the feature extraction sub-network to the target template image and each search image to perform feature extraction, so as to obtain target template image extraction features and search image extraction features.

The feature extraction module 300 specifically includes:

an image cutting unit 301, configured to cut the target template image and the search image into a plurality of blocks, to obtain a cut target template image and a cut search image; the blocks in the cut target template image and the blocks in the cut search image overlap.

The convolution operation unit 302 is configured to perform convolution processing on the image to be processed by using the convolution module, so as to obtain a feature after convolution; the image to be processed is the cut target template image or the cut search image.

And the normalization unit 303 is configured to input the features output by the m-1 th hierarchical feature module to the layer normalization branch of the m-th hierarchical feature module, and obtain normalized features.

And a global feature extraction unit 304, configured to input the normalized feature to the global branch of the mth hierarchical feature, to obtain the global information.

Specifically, the global feature extraction unit 304 includes:

and the first linear transformation subunit is used for performing linear transformation operation on the normalized features to obtain three feature graphs, which are respectively marked as global query Q, global key K and global value V.

And the downsampling operation subunit is used for executing downsampling operation on the global key K and the global value V.

A local feature extraction unit 305, configured to input the normalized feature to the local branch of the mth hierarchical feature, to obtain the local information.

Specifically, the local feature extraction unit 305 includes:

and the second linear transformation subunit is used for performing linear transformation operation on the normalized features to obtain three feature graphs, which are respectively marked as a local query Q, a local key K and a local value V.

And the depth convolution operation subunit is used for respectively executing the depth convolution operation of the weight global sharing on the local query Q, the local key K and the local value V to obtain the local aggregation characteristic of the query Q, the local aggregation characteristic of the key K and the local aggregation characteristic of the value V.

And the Hardmad product operation subunit is used for executing Hardmad product operation on the query Q local aggregation feature and the key K local aggregation feature to obtain a product operation result.

And the series processing subunit is used for sequentially passing through the second complete connection layer, the first activation layer, the second complete connection layer and the second activation layer to the product operation result to obtain context perception information.

A feature fusion unit 306, configured to obtain a fusion feature by passing the global information and the local information through the fusion branch of the mth hierarchical feature; m=1, 2,; when m=1, the feature output by the m-1 th hierarchical feature module is the convolved feature.

Specifically, the feature fusion unit 306 includes:

and the cascading operation subunit is used for executing cascading operation on the global information and the local information to obtain cascading characteristics.

A judging unit 307 for judging whether M is equal to M; if not, making m=m+1, and returning to the step of inputting the features output by the m-1 th layered feature module into the layer normalization branch of the m-th layered feature module; if yes, inputting the features output by the Mth hierarchical feature module to the first complete connection layer to obtain the target template image extraction features or the search image extraction features; when the image to be processed is the cut target template image, the first complete connection layer outputs the extracted features of the target template image; when the image to be processed is the cut search image, the first complete connection layer outputs the extracted features of the search image.

And the target tracking module 400 is configured to input the extracted features of the target template image to a tracking model, and perform convolution operation on each extracted feature of the search image and a result output by the tracking model, so as to obtain a target response position of the search area in each subsequent frame image.

Example 3

The present embodiment provides an electronic device including a memory for storing a computer program and a processor that runs the computer program to cause the electronic device to execute the target tracking method based on the convolution and attention combination feature extraction of embodiment 1.

Alternatively, the electronic device may be a server.

In addition, an embodiment of the present invention further provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the target tracking method based on the convolution and attention combination feature extraction of embodiment 1.

Embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Each embodiment is mainly described and is different from other embodiments, and the same similar parts among the embodiments are mutually referred. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.

Claims

1. A target tracking method based on convolution and attention combination feature extraction, comprising:

2. The target tracking method based on convolution and attention combination feature extraction according to claim 1, wherein the feature extraction is performed by applying the target template image and each of the search images to the respective feature extraction sub-networks, specifically comprising:

3. The target tracking method based on convolution and attention combination feature extraction according to claim 2, wherein the inputting the normalized feature into the global branch of the mth hierarchical feature, to obtain the global information, specifically comprises:

performing a downsampling operation on the global key K and the global value V;

4. The target tracking method based on convolution and attention combination feature extraction according to claim 2, wherein the step of inputting the normalized feature to the local branch of the mth hierarchical feature to obtain the local information specifically comprises:

5. The target tracking method based on convolution and attention combination feature extraction according to claim 1, wherein the merging branches of the mth hierarchical feature are used for obtaining a merged feature by the global information and the local information, and the method specifically comprises the following steps:

6. A target tracking system based on convolution and attention combination feature extraction, comprising:

7. The target tracking system based on convolution and attention combination feature extraction according to claim 6, wherein the feature extraction module specifically comprises:

8. The target tracking system based on convolution and attention combination feature extraction according to claim 7, wherein the global feature extraction unit specifically comprises:

9. The target tracking system based on convolution and attention combination feature extraction according to claim 7, wherein the local feature extraction unit specifically comprises:

10. The target tracking system based on convolution and attention combination feature extraction according to claim 6, wherein the feature fusion unit specifically comprises: