CN113627590A

CN113627590A - Attention module and attention mechanism of convolutional neural network and convolutional neural network

Info

Publication number: CN113627590A
Application number: CN202110863925.9A
Authority: CN
Inventors: 李丰军; 周剑光; 陈志轩
Original assignee: China Automotive Innovation Co Ltd
Current assignee: China Automotive Innovation Co Ltd
Priority date: 2021-07-29
Filing date: 2021-07-29
Publication date: 2021-11-09

Abstract

The application relates to an attention module, an attention mechanism and a convolutional neural network of the convolutional neural network, wherein the attention module firstly utilizes deformable convolution to respectively extract characteristics in the horizontal direction and the vertical direction so as to facilitate the subsequent encoding to capture the position information of an object; secondly, the attention module captures the remote dependency relationship along one spatial direction, and meanwhile, retains accurate position information along the other spatial direction, so that information in the vertical direction and the horizontal direction can be retained; and then obtaining an attention vector after series of transformation, and multiplying the attention vector as a weight factor point to the original feature vector, so that the attention of the space and the attention of the channel can be fused, the problem of unified operation of the existing attention mechanism on the space and the channel is solved, and the precision of the convolutional neural network can be improved.

Description

Attention module and attention mechanism of convolutional neural network and convolutional neural network

Technical Field

The application relates to the technical field of deep learning, in particular to an attention module and an attention mechanism of a convolutional neural network and the convolutional neural network.

Background

In the process of perception algorithm industrialization, the current deep learning (deep neural network) gradually evolves towards the goal of 'good and fast', the significance of end-to-end neural network algorithm deployment on a certain vehicle-mounted chip is significant, more and more methods focus on the change of a neural network structure, so that the width and the depth of the whole backbone network are continuously reduced, the loss of precision is brought successively, for tasks with higher precision requirements, such as vehicle detection in an automatic driving scene, the precision reduction can hardly be tolerated, and due to safety considerations, the detection rate of all vehicles visible to the naked eye is approximate to 100%.

At present, a convolutional neural network is widely applied to the fields of target detection and identification based on visual perception in an automatic driving scene as one kind of deep neural network. And researches show that better effects can be obtained by introducing an attention mechanism into the convolutional neural network.

Currently, attention mechanisms are introduced into convolutional neural networks, and these schemes are roughly classified into the following two categories:

the attention mechanism applies to the feature map spatial dimensions: human vision focuses on important areas in the image, ignoring unimportant parts of the image. Compared with the process of processing the whole image information, the image information of a certain area in the image is finely processed in the training process, the calculated amount and the training detection time are obviously reduced, more information of the specific area can be obtained in the aspect of image processing, and the generalization capability of the network model is enhanced.

The attention mechanism applies to the feature map channel dimensions: the most important part in the convolutional neural network is convolution operation, image features are extracted in space dimensions and channel dimensions by a convolution kernel, an attention mechanism is applied to the channel dimensions to find internal relations among the channels, and the feature extraction performance of the convolutional neural network can be remarkably improved.

In the prior art that an attention mechanism is introduced into a convolutional neural network, for example, a patent with application number CN201910769868.0 discloses an SSD object detection method based on an SE module, which belongs to the second category described above, and which, after acquiring a picture or a video to be subjected to object recognition, replaces a first convolutional layer of a convolutional neural network ResNet18 with a 3 × 3 convolutional layer, adds an SE module in a first and a second residual block of ResNet18 to form an SE-ResNet18 network structure, replaces a backbone network in an SSD object detection algorithm with the SE-res 18 network structure to obtain a detection model, trains the detection model for small object detection to obtain a trained deep neural network model, and detects the small object of the picture or the video according to the trained deep neural network model to obtain a detection result. In this patent, the interdependencies between channels are efficiently constructed by simply squeezing each 2-dimensional feature map, however, it only considers re-weighting the importance of each channel by modeling the channel relationships, and neglects the location information, which is important for generating spatially selective attribute maps. Therefore, the regression accuracy of the position is still defective.

For another example, the patent with application number 202010595050.4 discloses an unmanned aerial vehicle real-time vehicle detection method based on a convolutional neural network, which includes firstly clustering out 9 anchor frames, building a shallow neural network, adding an attention mechanism, and adding a tensor adaptive module; training and testing in embedded devices. A shallow neural network is constructed, the parameter quantity is small, the Jetson tx2 is suitable for running in an embedded device of an unmanned aerial vehicle, and the requirement of real-time performance is met. After feature fusion is carried out based on the feature pyramid network, a self-adaptive tensor selection module is introduced, so that the network can select the most appropriate detection tensor according to the semantic information of the target, and the accuracy of model detection is further improved. However, this patent adds a CBAM attention mechanism between the convolutional layers, but the spatial attention and the channel attention operate separately from each other, which reduces the spatial correlation of the targets, thus resulting in difficulty in further improving the accuracy.

Disclosure of Invention

The embodiment of the application provides an attention module and an attention mechanism of a convolutional neural network and the convolutional neural network. The problem that the attention mechanism is unified in operation in space and channels can be solved, and the precision of the convolutional neural network can be improved.

In one aspect, an attention module of a convolutional neural network is provided in an embodiment of the present application, including:

an attention vector generation unit configured to feed the feature vectors input by the residual module to the first branch and the second branch; the first branch is configured to perform deformable convolution operation, channel attenuation operation and global pooling operation on the feature vectors in the horizontal direction, and the second branch is configured to perform deformable convolution operation, channel attenuation operation and global pooling operation on the feature vectors in the vertical direction;

the attention vector generation unit is also configured to splice the output of the first branch and the output of the second branch to obtain a spliced vector, and the spliced vector is transformed by using a convolution transformation function; and feeding the spliced vectors subjected to the conversion processing to a full-connection layer, and performing convolution operation on the input of the full-connection layer in the horizontal direction and the vertical direction respectively to obtain the attention vector of the input feature vector in the horizontal direction and the attention vector of the input feature vector in the vertical direction.

Optionally, the first branch includes a first deformable convolution layer, a first convolution layer, and a first global pooling layer, which are connected in sequence;

the output of the first deformable convolution layer is:

wherein p is₀Representing each feature point in the input feature vector; y (p)_x) Represents p₀A position in the horizontal direction after the deformable convolution operation; r represents a convolution kernel, p_nIs an enumeration of the positions listed in R; Δ p_xIndicating the amount of offset in the horizontal direction.

Optionally, the data dimension of the input feature vector is C × H × W; the data dimension of the output of the first deformable convolution layer is C × (W + H);

the convolution kernel size of the first convolution layer is 1x1, and the data dimension of the output of the first convolution layer is C/r x (W + H);

the data dimension of the output of the first global pooling layer is C/r × H × 1.

Optionally, the output of the c/r channel in the output of the first branch is:

wherein x is_c/r(h, i) represents the ith feature point in the c/r-th channel with height h in the output of the first convolution layer.

Optionally, the second branch includes a second deformable convolution layer, a second convolution layer and a second global pooling layer, which are connected in sequence;

the output of the second deformable convolution layer is:

wherein p is₀Representing each feature point in the input feature vector; y (p)_y) Represents p₀A position in the vertical direction after the deformable convolution operation; r represents a convolution kernel, p_nIs an enumeration of the positions listed in R; Δ p_yIndicating the amount of offset in the vertical direction.

Optionally, the data dimension of the input feature vector is C × H × W; the output of the second deformable convolution layer has a data dimension of C × (W + H);

the convolution kernel size of the second convolution layer is 1x1, and the data dimension of the output of the second convolution layer is C/r x (W + H);

the output of the second global pooling layer has a data dimension of C/r × 1 × W.

Optionally, the output of the c/r channel in the output of the second branch is:

wherein x is_c/r(j, w) represents the jth feature point in the c/r-th channel of width w in the output of the second convolutional layer.

Optionally, the system further comprises a weight distribution unit;

and the weight distribution unit is configured to perform weight distribution on the input feature vector based on the attention vector in the horizontal direction and the attention vector in the vertical direction to obtain a weighted feature vector.

In another aspect, an embodiment of the present application provides an attention mechanism of a convolutional neural network, including:

feeding the feature vectors input by the residual error module to the first branch and the second branch; the first branch is configured to perform deformable convolution operation, channel attenuation operation and global pooling operation on the feature vectors in the horizontal direction, and the second branch is configured to perform deformable convolution operation, channel attenuation operation and global pooling operation on the feature vectors in the vertical direction;

splicing the output of the first branch and the output of the second branch to obtain a splicing vector;

transforming the splicing vectors by using a convolution transformation function;

feeding the splicing vectors after the transformation processing to a full connection layer;

and performing convolution operation on the input of the full connection layer in the horizontal direction and the vertical direction respectively to obtain the attention vector of the input feature vector in the horizontal direction and the attention vector of the input feature vector in the vertical direction.

In another aspect, embodiments of the present application provide a convolutional neural network, including the attention module provided in the above embodiments.

The attention module, the attention mechanism and the convolutional neural network of the convolutional neural network provided by the embodiment of the application have the following beneficial effects:

the attention module comprises an attention vector generation unit configured to feed the feature vectors input by the residual module to the first branch and the second branch; the first branch is configured to perform deformable convolution operation, channel attenuation operation and global pooling operation on the feature vectors in the horizontal direction, and the second branch is configured to perform deformable convolution operation, channel attenuation operation and global pooling operation on the feature vectors in the vertical direction; the attention vector generation unit is also configured to splice the output of the first branch and the output of the second branch to obtain a spliced vector, and the spliced vector is transformed by using a convolution transformation function; and feeding the spliced vectors subjected to the conversion processing to a full-connection layer, and performing convolution operation on the input of the full-connection layer in the horizontal direction and the vertical direction respectively to obtain the attention vector of the input feature vector in the horizontal direction and the attention vector of the input feature vector in the vertical direction. Therefore, the attention of the space and the attention of the channel can be fused through the design of the double branches, the problem that the operation of the existing attention mechanism is unified on the space and the channel is solved, the characteristic extraction effect can be enhanced, and the precision of the convolutional neural network can be improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of an attention module of a convolutional neural network provided in an embodiment of the present application;

fig. 2(a) is an example of a feature diagram in a horizontal direction provided by an embodiment of the present application;

fig. 2(b) is an example of a feature diagram in a vertical direction provided by an embodiment of the present application;

fig. 3 is a schematic structural diagram of an attention module according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In order to facilitate the following description of the present solution, a few basic concepts will be described first.

Deep neural network: one type of neural network belongs to one branch of machine learning.

Feature (feature): a method of representing an image. Conventional methods represent an image with RGB three-channel pixels. In order to better utilize a computer for recognition, redundant information in RGB needs to be filtered out, and more semantic features need to be extracted. Image features contain some salient information in the image, such as contour edges, color, etc.

Convolutional Neural Network (CNN): the convolutional neural network is a feedforward neural network, and has simple training and better generalization capability.

And (3) convolution kernel: the convolution kernel is the core of a convolutional neural network, and is generally regarded as an information aggregation that aggregates spatial (spatial) information and channel-wise (channel-wise) information on a local receptive field.

ReLU (rectified Linear Unit): the modified linear unit is an excitation function (activation function) commonly used in artificial neural networks, and generally refers to a nonlinear function represented by a ramp function and a variation thereof.

Referring to fig. 1, fig. 1 is a schematic diagram of an attention module of a convolutional neural network according to an embodiment of the present disclosure, where the attention module includes an attention vector generation unit 101;

an attention vector generation unit 101 configured to feed the feature vectors input by the residual module to the first branch and the second branch; the first branch is configured to perform deformable convolution operation, channel attenuation operation and global pooling operation on the feature vectors in the horizontal direction, and the second branch is configured to perform deformable convolution operation, channel attenuation operation and global pooling operation on the feature vectors in the vertical direction;

the attention vector generation unit 101 is further configured to perform splicing processing on the output of the first branch and the output of the second branch to obtain a spliced vector, and perform transformation processing on the spliced vector by using a convolution transformation function; and feeding the spliced vectors subjected to the conversion processing to a full-connection layer, and performing convolution operation on the input of the full-connection layer in the horizontal direction and the vertical direction respectively to obtain the attention vector of the input feature vector in the horizontal direction and the attention vector of the input feature vector in the vertical direction.

The attention module of convolutional neural network that this application embodiment provided uses deformable convolution in horizontal direction and vertical direction feature extraction, more can catch the positional information of object when making things convenient for follow-up coding. Specifically, a feature map in the horizontal direction is obtained by performing deformable convolution operation and channel attenuation operation on a feature vector in the horizontal direction through a first branch, as shown in fig. 2(a), fig. 2(a) is an example of a feature map in the horizontal direction provided by the embodiment of the present application, and a feature map in the vertical direction is obtained by performing deformable convolution operation and channel attenuation operation on a feature vector in the vertical direction through a second branch, as shown in fig. 2(b), fig. 2(b) is an example of a feature map in the vertical direction provided by the embodiment of the present application, that is, only one-direction deformable convolution is performed on one branch, which can enhance the effect of feature extraction compared with a conventional mode in which deformable convolution is performed in two directions at the same time, and can improve the detection and identification accuracy of a convolutional neural network. In addition, this application is through the design of two branches, pool the horizontal direction characteristic map and vertical direction characteristic map respectively, the attention module can follow a space direction and catch long-range dependency, can keep accurate positional information along another space direction simultaneously, make vertical direction and horizontal direction's information all can keep, then obtain the attention vector after serial transform, and multiply back to former eigenvector as weight factor point, so, can fuse the attention of space and the attention of passageway, solved current attention mechanism and operated unified problem on space and passageway, and can improve convolutional neural network's precision.

In an alternative embodiment, as shown in fig. 1, the attention module further includes a weight assignment unit 102;

and a weight distribution unit 102 configured to perform weight distribution on the input feature vector based on the attention vector in the horizontal direction and the attention vector in the vertical direction, so as to obtain a weighted feature vector.

In a specific implementation manner, please refer to fig. 3, wherein fig. 3 is a schematic structural diagram of an attention module provided in an embodiment of the present application; the first branch comprises a first deformable convolution layer, a first convolution layer and a first global pooling layer which are connected in sequence; the output of the first deformable convolution layer is:

wherein p is₀Representing each feature point in the input feature vector; y (p)_x) Represents p₀A position in the horizontal direction after the deformable convolution operation; r represents a convolution kernel, p_nIs to the radical of REnumeration of column positions; Δ p_xIndicating the amount of offset in the horizontal direction.

Correspondingly, the second branch comprises a second deformable convolution layer, a second convolution layer and a second global pooling layer which are connected in sequence; the output of the second deformable convolution layer is:

Specifically, the super parameters of the deformable convolution are respectively arranged in the horizontal direction and the vertical direction, and the size of the convolution kernel can be set to be 3x 3; the first branch and the second branch may have substantially the same structure, taking the first branch as an example, perform Network in Network on the output of the first deformable convolutional layer, the size of the convolution kernel of the first convolutional layer cascaded with the first deformable convolutional layer may be set to 1x1, and similarly, the size of the convolution kernel of the second convolutional layer cascaded with the second deformable convolutional layer is also set to 1x1, that is, the convolution with 1x1 is used to implement channel attenuation. In this way, by adding the offsets in the x and y directions to the deformable convolution kernel, the characteristic diagram after deformable convolution is obtained by matching with the cascaded 1x1 convolution. To reduce the complexity and computational overhead of the model, a suitable reduction ratio r (e.g., 32) is typically used to reduce the number of channels of the original features.

As shown in fig. 3, the data dimension of the feature vector output by the residual block (residual block) is C × H × W, where C denotes a channel, H denotes a height, and W denotes a width; performing two-way deformable convolution operation on the characteristic vector C multiplied by H multiplied by W, wherein the output data dimensionality of the first deformable convolution layer and the second deformable convolution layer is C multiplied by (W + H); then, carrying out 1x1 convolution on the two branches respectively, wherein the data dimension of the output of the first convolution layer and the second convolution layer is C/r x (W + H); then global pooling is performed in the H and W directions, respectively, i.e. each channel is encoded along the horizontal and vertical coordinates with a posing kernel of size (H, 1) or (1, W), respectively. Then, the data dimension of the output of the first global pooling layer is C/r × H × 1, and the data dimension of the output of the second global pooling layer is C/r × 1 × W; finally, the output of the c/r-th channel in the output of the first branch (i.e., the output of the first global pooling layer) can be represented by the following expression:

wherein x is_c/r(h, i) represents the ith feature point in the c/r channel with the height h in the output of the first convolution layer;

the output of the c/r-th channel in the output of the second branch (i.e., the output of the second global pooling layer) can be represented by the following expression:

The 2 transformations respectively aggregate features along two spatial directions to obtain a pair of direction-sensing feature maps. These 2 transformations also capture the long-term dependencies along one spatial direction in conjunction with the previous deformable convolution and save accurate location information along the other spatial direction, which helps the network to more accurately locate the object of interest.

Furthermore, the output end of the first global pooling layer and the output end of the second global pooling layer are respectively connected with a splicing conversion layer, the layers are used for splicing the output of the first branch and the output of the second branch to obtain a splicing vector, and a convolution transformation function is used for carrying out transformation processing on the splicing vector; therefore, the characteristics in the horizontal direction and the vertical direction can be well obtained, and the position information can be accurately positioned. In order to utilize the generated representation, the present application performs representation conversion on the encoded information obtained by the first branch and the second branch, and the representation conversion should be as simple as possible for driving or wearable application scenarios; secondly, the captured position information can be fully utilized, so that the interested area can be accurately captured; finally, it should also be able to note differences in horizontal as well as vertical characteristics in real time.

The result of the stitching transform layer transform can be represented by the following expression:

f＝(F₁([z^h，z^w]))

wherein, [, ]]A concate operation along a spatial dimension; f₁Convolution transform function of 1x 1; f is an intermediate feature map encoding spatial information in both the horizontal and vertical directions.

Correspondingly, the output of the splicing conversion layer keeps the current C/r channel attenuation coefficient, and the data dimension is C/r multiplied by 1 multiplied by (W + H); the output end of the splicing conversion layer is connected with the full-connection layer, information fusion extraction in the horizontal direction and the vertical direction is carried out through the full-connection layer, the output of the full-connection layer still keeps the current C/r channel attenuation coefficient, and correspondingly, the data dimensionality output by the full-connection layer is C/r multiplied by 1 multiplied by (W + H); feeding the outputs of the fully connected layers to a third convolutional layer and a fourth convolutional layer respectively for convolution, wherein the data dimension output by the third convolutional layer is C multiplied by H1 and the data dimension output by the fourth convolutional layer is C multiplied by 1 multiplied by W; finally, obtaining the attention vector of the input feature vector in the horizontal direction through the first ReLU layer, and obtaining the attention vector of the input feature vector in the vertical direction through the second ReLU layer;

the above-described attention vector in the horizontal direction can be expressed by the following expression:

g^h＝σ(F_h(f^h))

the above-described attention vector in the vertical direction can be expressed by the following expression:

g^w＝σ(F_w(f^w))

wherein f is^hAnd f^wDecomposed into 2 along the spatial dimension for fIndividual tensors, f^h∈R^C/r×H，f^w∈R^C/r×W；F_hAnd F_wIs a convolution transformation function; σ is the ReLU activation function.

Finally, can be for g^hAnd g^wAnd expanding, and then obtaining a weighted feature vector according to the following formula:

F_output＝F_input×g^h×g^w

wherein, F_inputThe feature vector is input by the residual error module; f_outputIs a weighted feature vector.

In summary, the attention module of the convolutional neural network provided by the embodiment of the present application optimizes the conventional attention mechanism process, performs elastic coding on the channel and the space by using deformable convolution simultaneously, fuses information in the horizontal direction and the vertical direction, can better position a target position, and can improve target detection accuracy in application scenarios such as detection, classification, segmentation, and the like.

In another aspect, an embodiment of the present application further provides an attention mechanism of a convolutional neural network, including:

The attention mechanism of the embodiment of the application is based on the same application concept as the attention module embodiment.

In addition, the embodiment of the present application further provides a convolutional neural network, which includes the attention module described in the above embodiment. The convolutional neural network can be trained for detection, classification, segmentation and the like, and can show high accuracy when applied to detection and identification of surrounding vehicles in an automatic driving application scene.

The present application provides results of experiments based on a variety of different model structures, as shown in tables 1 and 2 below:

Model	Params(M)	coco mAP(％)
			Mobilenetv2	4.3	41.5
Mobilenetv2+SE	4.7	41.6
			Mobilenetv2+CBAM	4.7	41.6
Mobilenetv2+CA	4.3	42.8
			this application	4.45	43.4

TABLE 1 test experiments

Model	Params(M)	Top-1 Acc(％)
			Mobilenetv2	3.5	72.3
Mobilenetv2+SE	3.89	73.5
			Mobilenetv2+CBAM	3.89	73.6
Mobilenetv2+CA	3.95	74.3
			This application	4.02	74.8

TABLE 2 Classification of experiments

In the experiment, SE Attenttion, CBAM (volumetric Block Attenttion Module), CA (coordinate Attentation) and the Attention module in the embodiment of the application are respectively added on the basis of a lightweight model Mobilentv 2, and detection and classification experiments are respectively carried out; the second column of tables 1 and 2 is the parameter number and the third column is the performance value. Experimental results show that when the attention module of the embodiment of the application is added, the best result can be obtained, and the accuracy of the network model can be improved under the condition of ensuring the parameter quantity; it is clear that the attention module of the embodiments of the present application facilitates detection and classification of targets more than SE, CBAM and CA.

It can be seen from the above embodiments that the attention module, the attention mechanism, and the convolutional neural network of the convolutional neural network provided by the present application can solve the problem that the attention mechanism operates uniformly in space and channels, and can improve the accuracy of the convolutional neural network.

It should be noted that: the sequence of the embodiments of the present application is only for description, and does not represent the advantages and disadvantages of the embodiments. And specific embodiments thereof have been described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. An attention module for a convolutional neural network, comprising:

an attention vector generation unit configured to feed the feature vectors input by the residual module to the first branch and the second branch; wherein the first branch is configured to perform a deformable convolution operation, a channel attenuation operation and a global pooling operation on the feature vector in a horizontal direction, and the second branch is configured to perform the deformable convolution operation, the channel attenuation operation and the global pooling operation on the feature vector in a vertical direction;

the attention vector generation unit is further configured to perform splicing processing on the output of the first branch and the output of the second branch to obtain a spliced vector, and perform transformation processing on the spliced vector by using a convolution transformation function; feeding the spliced vectors after the transformation processing to a full-connection layer, and performing convolution operation on the input of the full-connection layer in the horizontal direction and the vertical direction respectively to obtain the attention vector of the input feature vector in the horizontal direction and the attention vector of the input feature vector in the vertical direction.

2. The attention module of claim 1, wherein the first leg comprises a first deformable convolutional layer, a first convolutional layer, and a first global pooling layer connected in sequence;

the output of the first deformable convolution layer is:

3. The attention module of claim 2, wherein the data dimensions of the input feature vector are C x H x W; the data dimension of the output of the first deformable convolution layer is C × (W + H);

4. The attention module of claim 3, wherein the output of the c/r channel of the output of the first branch is:

5. The attention module of claim 1, wherein the second leg comprises a second deformable convolutional layer, a second convolutional layer, and a second global pooling layer connected in sequence;

the output of the second deformable convolution layer is:

6. The attention module of claim 5, wherein the data dimensions of the input feature vector are C x H x W; the output of the second deformable convolution layer has a data dimension of C × (W + H);

the convolution kernel size of the second convolutional layer is 1x1, and the data dimension of the output of the second convolutional layer is C/r x (W + H);

and the output of the second global pooling layer has a data dimension of C/r multiplied by 1 multiplied by W.

7. The attention module of claim 6, wherein the output of the c/r channel of the output of the second branch is:

wherein x is_c/r(j, w) represents the j-th feature point in the c/r-th channel of width w in the output of the second convolutional layer.

8. The attention module of claim 1, further comprising a weight assignment unit;

the weight distribution unit is configured to perform weight distribution on the input feature vector based on the attention vector in the horizontal direction and the attention vector in the vertical direction to obtain a weighted feature vector.

9. An attention mechanism for a convolutional neural network, comprising:

feeding the feature vectors input by the residual error module to the first branch and the second branch; wherein the first branch is configured to perform a deformable convolution operation, a channel attenuation operation and a global pooling operation on the feature vector in a horizontal direction, and the second branch is configured to perform the deformable convolution operation, the channel attenuation operation and the global pooling operation on the feature vector in a vertical direction;

splicing the output of the first branch and the output of the second branch to obtain a spliced vector;

transforming the splicing vector by using a convolution transformation function;

10. A convolutional neural network, comprising the attention module of any one of claims 1-8.