CN113177546A

CN113177546A - Target detection method based on sparse attention module

Info

Publication number: CN113177546A
Application number: CN202110484922.4A
Authority: CN
Inventors: 陈春霖; 凌强; 李峰
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2021-04-30
Filing date: 2021-04-30
Publication date: 2021-07-27

Abstract

The invention provides a target detection method based on a sparse attention module, which comprises the following specific steps of: step 1: inputting the convolution feature map into a sparse attention module; step 2: carrying out sparse position sampling operation on the convolution characteristic diagram input in the step 1, and searching a position set of sparse characteristics with the most expression ability; and step 3: performing convolution transformation on the convolution characteristic diagram input in the step 1, sampling by using a position set of sparse characteristics to obtain sparse key-value characteristic pairs, and then calculating an attention matrix of the value characteristics and each key characteristic; step 4): and carrying out weighted summation on the value characteristics according to the attention matrix to obtain attention fusion characteristics, carrying out direct connection addition on the attention fusion characteristics and the input characteristic diagram, and outputting the characteristic diagram after the sparse attention module is enhanced.

Description

Target detection method based on sparse attention module

Technical Field

The invention relates to the field of digital image processing, target detection and deep learning, in particular to a target detection method based on a sparse attention module.

Background

Object detection is a fundamental computer vision-aware task, and over the last few years, many advanced object detection methods have been based on convolutional networks. Under the canonical local linear weighting operation of the conventional convolutional layer, it is difficult to effectively obtain global context information. Some recent work has focused on enhancing the way contextual information is fused by a more flexible network computing architecture. In the prior art, a deformable convolution layer is adopted to dynamically adjust the sampling position of a convolution kernel, and the sampling position of the convolution kernel can be distributed to a far position in an image space, so that remote dependence can be more effectively modeled, and context information of the environment can be extracted. There is also a new non-local module proposed by the scholars to model remote dependencies, which successfully implements the application of self-attention mechanisms to visual tasks such as video classification, target detection and keypoint detection by aggregating context information from any two locations of the input feature map. Here, each location is associated with all other locations in the image feature space through a dense attention map, and for a certain location, the context information will be aggregated and fused through a weighted sum of all features. Without limitation, non-local networks can improve the performance of existing networks in various image tasks (e.g., video classification, object detection, and keypoint detection). Although non-local networks have excellent performance, they require the additional introduction of significant computational effort and GPU memory footprint. This is because non-local operations require a great deal of attention to describe the relationship between any two locations of the input feature map. For example, given an input feature map with a spatial resolution of H × W, a non-local operation will compute an attention map with a size of (HW × HW), and when the spatial resolution of the input feature map is increased, the required attention map is increased by a multiple of the square, so the required computation amount and display space occupation are high. Especially for object detection, in order to detect all different scale objects in the input image as much as possible, the resolution of the input image will typically be large, and thus the convolution signature in the network will typically have a high resolution. Therefore, in practical applications, a non-local based detection network would introduce high computational complexity and would also cost a very large GPU memory. This memory-unfriendly computing mechanism limits the application of such non-local networks.

Disclosure of Invention

In order to solve the technical problem, the invention provides a target detection method based on a sparse attention module to capture a remote dependency relationship in an image space, and the context information extraction capability of a model is improved. The relationship between the query and the key elements is modeled by dynamically selecting the locations of a set of sparse points after searching for local response peaks in a thermodynamic diagram for a given input feature map. By utilizing the obtained sparse point positions, the sparse attention module can well model the remote dependence relationship and greatly improve the target detection performance, and the module is very light, and the introduced extra GPU memory and the calculation cost are less than 2% of those of the conventional non-local module. Such sparse attention modules can be easily inserted into various target detection frameworks, resulting in significant improvements to detection results, and the computational and memory overhead is almost negligible.

The sparse attention module is used for improving the expression capability of extracting features of a detection network and improving the context information extraction capability of a model. The proposed sparse attention module can be easily inserted into a generic detection framework, resulting in a better balance of speed and accuracy.

The technical scheme of the invention is as follows: a target detection method based on a sparse attention module specifically comprises the following steps:

step 1: inputting the convolution feature map into a sparse attention module;

step 2: carrying out sparse position sampling operation on the convolution characteristic diagram input in the step 1, and searching a position set of sparse characteristics with the most expression ability;

and step 3: performing convolution transformation on the convolution characteristic diagram input in the step 1, sampling by using a position set of sparse characteristics to obtain sparse key-value characteristic pairs, and then calculating an attention matrix of the value characteristics and each key characteristic;

step 4): and carrying out weighted summation on the value characteristics according to the attention matrix to obtain attention fusion characteristics, carrying out direct connection addition on the attention fusion characteristics and the input characteristic diagram, and outputting the characteristic diagram after the sparse attention module is enhanced.

The operation of the sparse attention module is mathematically represented as:

Y＝softmax(Q^Ts(Q))s(V)

wherein

V ═ g (X), each of which is obtained by outputting the input features X through a 1X1 convolution layer,

represents a 1 × 1 convolutional layer; s (-) epsilon R^C'*NIt is a sparse sampling operation that samples the N most representative features from a given HW feature vectors, Z being the output feature and Y being the attention-fused feature.

Further, the step 2 sparse sampling operation includes: in the sparse position search block, the input features X will be applied to a channel-by-channel mean operation, generating a feature response thermodynamic diagram H_p∈R^H*W；H_pIs a matrix representing the input characteristic response; we then get these key locations by searching for local peaks of these responses:

where i is the index of the feature in spatial position, ranges from 1 to (HW), and Ω_iA window centered at position i; if x_iIs the maximum of the neighboring pixel, i is the position of the local peak in the response thermodynamic diagram; all local peak locations constitute a set P describing the set of locations of the most valuable features in the input feature map.

Further, the step 3 further comprises: convolution signature for a given inputIs input by the character X ∈ R^C*H*WWhere H is the image height, W is the image width, and C is the number of convolution channels, first the feature X is input into two 1 × 1 convolution layers

g (-) to obtain two feature maps Q and V, thereby reducing the convolution channel number from C to C'.

Further, the step 4 specifically includes: the feature of attention fusion is expressed by a mathematical formula as:

Y＝softmax(Q^Ts(Q))s(V)

wherein

V ═ g (X), which is to output the input features X through 1X1 convolution layers, respectively; s (-) epsilon R^C'*NIs a sparse sampling operation that samples the N most representative features from a given HW feature vectors.

It can be easily inserted into any existing target detection architecture through a residual connection without destroying the feature extraction capability of the pre-trained detection network, and can be expressed as:

where input features X are here raw input features, Z is the final output feature of the attention module,

represents a 1 × 1 convolutional layer. When embedding the detection network, in order to maintain the feature extraction performance of the pre-training basic network, the method comprises the following steps

Is initialized to zero.

Further comprising, step 5): and inserting the sparse attention module into a main part of a general target detection network to construct a new detection network.

Advantageous effects

(1) The invention improves the context information extraction capability

It is well known that contextual information of surrounding objects is very beneficial for object detection, especially the identification and localization of objects. Longer range target information in image space may aid in the identification of the current target, and thus such context information may be captured by enhancing the long-range dependence of image space. The sparse attention module provided by the invention can effectively capture the remote dependence in the image space and improve the context information extraction capability of the model. The module selects the most representative locations for remote dependency correlation modeling by searching for local peaks in the input feature response heatmap. The sparse attention module can be conveniently inserted into a general target detection frame, and the detection precision is stably improved.

(2) The invention reduces the time consumption, memory occupation and parameter quantity of model processing

The improvement method provided by the invention aims at the optimization of the intensive attention module, and is very simple and effective. The relationship between the query and the key elements is modeled by dynamically selecting the locations of a set of sparse points after searching for local response peaks in a thermodynamic diagram for a given input feature map. By utilizing the obtained sparse point position, the sparse attention module provided by the invention can well model a remote dependency relationship and greatly improve the target detection performance, and the module is very light, the extra GPU memory and the calculation cost introduced by the module are less than 2% of those of a conventional non-local module, and meanwhile, the parameter quantity of the module is also greatly reduced, so that the module has higher application value for practical industrial practice.

Drawings

FIG. 1 sparse attention module;

FIG. 2 sparse sampling operation;

FIG. 3 application of sparse attention module in a generic detection network;

fig. 4 shows an example of the detection result.

Detailed Description

The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, rather than all embodiments, and all other embodiments obtained by a person skilled in the art based on the embodiments of the present invention belong to the protection scope of the present invention without creative efforts.

According to an embodiment of the present invention, a target detection method based on a sparse attention module is provided, which specifically includes the following steps:

step 1: inputting the convolution feature map into a sparse attention module;

According to an embodiment of the present invention, the processing flow of the sparse attention module is as shown in fig. 1 below. Given an input feature X ∈ R^C*H*WWhere H is the image height, W is the image width, and C is the number of convolution channels, first the feature X is input into two 1 × 1 convolution layers

g (-) to obtain two feature maps Q and V, thereby reducing the convolution channel number from C to C'. To accommodate the matrix multiplication of the attention fusion operation, the last two dimensions of the two feature maps Q and V are respectively flattened into R^C'*HW. In order to improve computational efficiency, in one embodiment of the present invention, C' may be selected as C/rThe number of channels for both features is reduced, where r is the reduction rate. In order to best balance the model inference speed and detection accuracy, in practice r is 4.

The operation of the sparse attention module may be mathematically expressed as:

Y＝softmax(Q^Ts(Q))s(V)

wherein

represents a 1 × 1 convolutional layer; . s (-) epsilon R^C'*NIt is a sparse sampling operation that samples the N most representative features from a given HW feature vectors, Z being the output and Y being the attention-fused feature.

Compared with the conventional dense attention module, the sparse attention module proposed by the present invention has two main differences: (1) the sparse attention module of the invention introduces a sparse sampling operation s (-) to sample N elements as key elements and value elements, which are respectively called K and V in the conventional attention module; (2) the present invention extracts a key element K from a feature Q in a sparse attention module in which the key element shares the same input feature as a query element (value element), unlike a conventional attention module which extracts a new feature K using a convolution transform. This feature sharing mechanism between query elements and key elements produces little degradation in detection performance while greatly reducing the parameters and computational load of the entire module.

The two differences described above make the proposed sparse attention module more lightweight and consume very little GPU memory. It can be easily inserted into any existing target detection architecture through a residual connection without destroying the feature extraction capability of the pre-trained detection network, and can be expressed as:

where X is the original input feature of the input,

representing a 1x1 convolutional layer, in order to preserve the feature extraction performance of the pre-trained base network when embedding the detection network, it will usually be

Is initialized to zero.

The process of the sparse sampling operation according to one embodiment of the present invention is illustrated in fig. 2. Input features X ∈ R^C*H*WFirstly inputting the input feature into a sparse position search block, and searching N positions P which are most representative of the input feature_N. Then, P_NWill be used to sample the features at the corresponding N positions in Q or V.

In the sparse position search block, the input features X will be applied to a channel-by-channel mean operation, generating a feature response thermodynamic diagram H_p∈R^H*W。H_pIs a matrix representing the response of the input characteristic. At H_pA function that contains more valuable information will respond more strongly, while a function that contains less important information will result in a weaker response. Certain locations of the characteristic response thermodynamic diagram have very strong responses, typically corresponding to significant object edges. These edges provide the most valuable information about the location of the object, which helps to accurately regress the bounding box of each object. We can then find these key locations by searching for local peaks of these responses:

where i is the index of the feature in spatial position, ranges from 1 to (HW), and Ω_iIs at position iA 3 x 3 window at the center. If x_iIs the maximum of the neighboring pixels, i is the position of the local peak in the response thermodynamic diagram. All local peak locations constitute a set P describing the set of locations of the most valuable features in the input feature map.

In order to improve the efficiency of the sparse attention module, the local peak values are sequenced according to the responses of the local peak values, and only the first N peak values with the strongest responses are taken for sparse sampling. The resulting sparse position set is denoted P_NWhere N ═ ρ HW, where ρ is set to 0.01 in our experiments in general.

In a sparse position set P_NThe sparse attention module selectively samples the key elements and the value elements, respectively, thereby reducing the attention effort from (HW × HW) to (HW × ρ HW), more efficient in terms of memory and computational effort than the conventional non-local modules.

According to an embodiment of the present invention, a target detection network based on a sparse attention module may also be constructed based on the sparse attention module proposed by the present invention.

The sparse attention module proposed by the present invention is a practical universal module that can be naturally plugged into existing detection networks. Currently, ResNet is widely applied to various detection frameworks, such as fast R-CNN and RetinaNet, so we insert the proposed sparse attention block into the residual module of ResNet for feature enhancement.

As shown in fig. 3, the sparse attention module of the present invention is applied to a general detection network, where SA Block is the sparse attention module proposed in the present invention.

In general, conventional non-local modules are proposed to be used before the last residual module in the c4 phase of ResNet, considering the constraints of computational load and GPU memory consumption. However, the sparse attention module provided by the invention is lighter in weight and less in memory occupation, so that a plurality of modules can be inserted into the basic network leisurely.

Generally, in the whole detection framework, the sparse attention module is added after the middle 3 × 3 convolution layer of the residual module, as shown in fig. 3.

In particular, the invention inserts sparse attention modules at the c3, c4, c5 stages of ResNet. When the pre-training model is adopted to initialize the basic network, the additional sparse attention module may destroy the feature extraction capability of the pre-training network, and the weight advantage of the pre-training cannot be utilized to the maximum extent. Therefore, the sparse attention module should be initialized to zero output at the beginning of training to effectively maintain the feature extraction capability of the pre-training network, so we initialize the weight of the last 1 × 1 convolutional layer to 0.

According to an embodiment of the present invention, after the detection network is trained, it can be used in a practical application scenario to detect an object of interest in an input picture. An example of the detection result is shown in fig. 4.

Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, but various changes may be apparent to those skilled in the art, and it is intended that all inventive concepts utilizing the inventive concepts set forth herein be protected without departing from the spirit and scope of the present invention as defined and limited by the appended claims.

Claims

1. A target detection method based on a sparse attention module is characterized by specifically comprising the following steps of:

step 1: inputting the convolution feature map into a sparse attention module;

2. The sparse attention module based object detection method of claim 1,

the operation of the sparse attention module is mathematically represented as:

Y＝softmax(Q^Ts(Q))s(V)

wherein

represents a 1 × 1 convolutional layer; s (-) epsilon R^C'^*NIt is a sparse sampling operation that samples the N most representative features from a given HW feature vectors, Z being the output feature and Y being the attention-fused feature.

3. The sparse attention module based object detection method of claim 1,

the step 2 sparse sampling operation process comprises:

in the sparse position search block, the input features X will be applied to a channel-by-channel mean operation, generating a feature response thermodynamic diagram H_p∈R^H*W；H_pIs a matrix representing the input characteristic response; these key locations are then obtained by searching for local peaks of these responses:

4. The sparse attention module based object detection method of claim 1,

the step 3 further comprises: given the input features X ∈ R of the input convolution feature map of the input^C*H*WWhere H is the image height, W is the image width, and C is the number of convolution channels, first the feature X is input into two 1 × 1 convolution layers

5. The sparse attention module based object detection method of claim 1,

the step 4 specifically includes: the feature of attention fusion is expressed by a mathematical formula as:

Y＝softmax(QTs(Q))s(V)

wherein

V ═ g (X), which is to output the input features X through 1X1 convolution layers, respectively; s (-) epsilon R^C*NIs a sparse sampling operation that samples the N most representative features from a given HW feature vectors;

it is inserted into any existing target detection architecture through a residual connection without destroying the feature extraction capability of the pre-trained detection network and is expressed as:

where input features X are represented here as raw input features, Z is the final output feature of the attention module,

Is initialized to zero.

6. The sparse attention module based object detection method of claim 1, further comprising:

step 5): and inserting the sparse attention module into a main part of a general target detection network to construct a new detection network.