CN113627590A - Attention module and attention mechanism of convolutional neural network and convolutional neural network - Google Patents
Attention module and attention mechanism of convolutional neural network and convolutional neural network Download PDFInfo
- Publication number
- CN113627590A CN113627590A CN202110863925.9A CN202110863925A CN113627590A CN 113627590 A CN113627590 A CN 113627590A CN 202110863925 A CN202110863925 A CN 202110863925A CN 113627590 A CN113627590 A CN 113627590A
- Authority
- CN
- China
- Prior art keywords
- attention
- output
- vector
- branch
- layer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000013527 convolutional neural network Methods 0.000 title claims abstract description 45
- 230000007246 mechanism Effects 0.000 title claims abstract description 25
- 239000013598 vector Substances 0.000 claims abstract description 120
- 230000009466 transformation Effects 0.000 claims abstract description 19
- 238000011176 pooling Methods 0.000 claims description 31
- 238000012545 processing Methods 0.000 claims description 15
- 230000001131 transforming effect Effects 0.000 claims description 3
- 230000000717 retained effect Effects 0.000 abstract 1
- 238000001514 detection method Methods 0.000 description 20
- 230000006870 function Effects 0.000 description 13
- 238000000034 method Methods 0.000 description 11
- 238000013528 artificial neural network Methods 0.000 description 10
- 238000006243 chemical reaction Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 7
- 230000008569 process Effects 0.000 description 7
- 238000002474 experimental method Methods 0.000 description 5
- 238000000605 extraction Methods 0.000 description 5
- 238000012549 training Methods 0.000 description 4
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000004927 fusion Effects 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000000844 transformation Methods 0.000 description 2
- 101100153586 Caenorhabditis elegans top-1 gene Proteins 0.000 description 1
- 101100370075 Mus musculus Top1 gene Proteins 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 230000002950 deficient Effects 0.000 description 1
- 230000005284 excitation Effects 0.000 description 1
- 230000004438 eyesight Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000016776 visual perception Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Image Analysis (AREA)
Abstract
The application relates to an attention module, an attention mechanism and a convolutional neural network of the convolutional neural network, wherein the attention module firstly utilizes deformable convolution to respectively extract characteristics in the horizontal direction and the vertical direction so as to facilitate the subsequent encoding to capture the position information of an object; secondly, the attention module captures the remote dependency relationship along one spatial direction, and meanwhile, retains accurate position information along the other spatial direction, so that information in the vertical direction and the horizontal direction can be retained; and then obtaining an attention vector after series of transformation, and multiplying the attention vector as a weight factor point to the original feature vector, so that the attention of the space and the attention of the channel can be fused, the problem of unified operation of the existing attention mechanism on the space and the channel is solved, and the precision of the convolutional neural network can be improved.
Description
Technical Field
The application relates to the technical field of deep learning, in particular to an attention module and an attention mechanism of a convolutional neural network and the convolutional neural network.
Background
In the process of perception algorithm industrialization, the current deep learning (deep neural network) gradually evolves towards the goal of 'good and fast', the significance of end-to-end neural network algorithm deployment on a certain vehicle-mounted chip is significant, more and more methods focus on the change of a neural network structure, so that the width and the depth of the whole backbone network are continuously reduced, the loss of precision is brought successively, for tasks with higher precision requirements, such as vehicle detection in an automatic driving scene, the precision reduction can hardly be tolerated, and due to safety considerations, the detection rate of all vehicles visible to the naked eye is approximate to 100%.
At present, a convolutional neural network is widely applied to the fields of target detection and identification based on visual perception in an automatic driving scene as one kind of deep neural network. And researches show that better effects can be obtained by introducing an attention mechanism into the convolutional neural network.
Currently, attention mechanisms are introduced into convolutional neural networks, and these schemes are roughly classified into the following two categories:
the attention mechanism applies to the feature map spatial dimensions: human vision focuses on important areas in the image, ignoring unimportant parts of the image. Compared with the process of processing the whole image information, the image information of a certain area in the image is finely processed in the training process, the calculated amount and the training detection time are obviously reduced, more information of the specific area can be obtained in the aspect of image processing, and the generalization capability of the network model is enhanced.
The attention mechanism applies to the feature map channel dimensions: the most important part in the convolutional neural network is convolution operation, image features are extracted in space dimensions and channel dimensions by a convolution kernel, an attention mechanism is applied to the channel dimensions to find internal relations among the channels, and the feature extraction performance of the convolutional neural network can be remarkably improved.
In the prior art that an attention mechanism is introduced into a convolutional neural network, for example, a patent with application number CN201910769868.0 discloses an SSD object detection method based on an SE module, which belongs to the second category described above, and which, after acquiring a picture or a video to be subjected to object recognition, replaces a first convolutional layer of a convolutional neural network ResNet18 with a 3 × 3 convolutional layer, adds an SE module in a first and a second residual block of ResNet18 to form an SE-ResNet18 network structure, replaces a backbone network in an SSD object detection algorithm with the SE-res 18 network structure to obtain a detection model, trains the detection model for small object detection to obtain a trained deep neural network model, and detects the small object of the picture or the video according to the trained deep neural network model to obtain a detection result. In this patent, the interdependencies between channels are efficiently constructed by simply squeezing each 2-dimensional feature map, however, it only considers re-weighting the importance of each channel by modeling the channel relationships, and neglects the location information, which is important for generating spatially selective attribute maps. Therefore, the regression accuracy of the position is still defective.
For another example, the patent with application number 202010595050.4 discloses an unmanned aerial vehicle real-time vehicle detection method based on a convolutional neural network, which includes firstly clustering out 9 anchor frames, building a shallow neural network, adding an attention mechanism, and adding a tensor adaptive module; training and testing in embedded devices. A shallow neural network is constructed, the parameter quantity is small, the Jetson tx2 is suitable for running in an embedded device of an unmanned aerial vehicle, and the requirement of real-time performance is met. After feature fusion is carried out based on the feature pyramid network, a self-adaptive tensor selection module is introduced, so that the network can select the most appropriate detection tensor according to the semantic information of the target, and the accuracy of model detection is further improved. However, this patent adds a CBAM attention mechanism between the convolutional layers, but the spatial attention and the channel attention operate separately from each other, which reduces the spatial correlation of the targets, thus resulting in difficulty in further improving the accuracy.
Disclosure of Invention
The embodiment of the application provides an attention module and an attention mechanism of a convolutional neural network and the convolutional neural network. The problem that the attention mechanism is unified in operation in space and channels can be solved, and the precision of the convolutional neural network can be improved.
In one aspect, an attention module of a convolutional neural network is provided in an embodiment of the present application, including:
an attention vector generation unit configured to feed the feature vectors input by the residual module to the first branch and the second branch; the first branch is configured to perform deformable convolution operation, channel attenuation operation and global pooling operation on the feature vectors in the horizontal direction, and the second branch is configured to perform deformable convolution operation, channel attenuation operation and global pooling operation on the feature vectors in the vertical direction;
the attention vector generation unit is also configured to splice the output of the first branch and the output of the second branch to obtain a spliced vector, and the spliced vector is transformed by using a convolution transformation function; and feeding the spliced vectors subjected to the conversion processing to a full-connection layer, and performing convolution operation on the input of the full-connection layer in the horizontal direction and the vertical direction respectively to obtain the attention vector of the input feature vector in the horizontal direction and the attention vector of the input feature vector in the vertical direction.
Optionally, the first branch includes a first deformable convolution layer, a first convolution layer, and a first global pooling layer, which are connected in sequence;
the output of the first deformable convolution layer is:
wherein p is0Representing each feature point in the input feature vector; y (p)x) Represents p0A position in the horizontal direction after the deformable convolution operation; r represents a convolution kernel, pnIs an enumeration of the positions listed in R; Δ pxIndicating the amount of offset in the horizontal direction.
Optionally, the data dimension of the input feature vector is C × H × W; the data dimension of the output of the first deformable convolution layer is C × (W + H);
the convolution kernel size of the first convolution layer is 1x1, and the data dimension of the output of the first convolution layer is C/r x (W + H);
the data dimension of the output of the first global pooling layer is C/r × H × 1.
Optionally, the output of the c/r channel in the output of the first branch is:
wherein x isc/r(h, i) represents the ith feature point in the c/r-th channel with height h in the output of the first convolution layer.
Optionally, the second branch includes a second deformable convolution layer, a second convolution layer and a second global pooling layer, which are connected in sequence;
the output of the second deformable convolution layer is:
wherein p is0Representing each feature point in the input feature vector; y (p)y) Represents p0A position in the vertical direction after the deformable convolution operation; r represents a convolution kernel, pnIs an enumeration of the positions listed in R; Δ pyIndicating the amount of offset in the vertical direction.
Optionally, the data dimension of the input feature vector is C × H × W; the output of the second deformable convolution layer has a data dimension of C × (W + H);
the convolution kernel size of the second convolution layer is 1x1, and the data dimension of the output of the second convolution layer is C/r x (W + H);
the output of the second global pooling layer has a data dimension of C/r × 1 × W.
Optionally, the output of the c/r channel in the output of the second branch is:
wherein x isc/r(j, w) represents the jth feature point in the c/r-th channel of width w in the output of the second convolutional layer.
Optionally, the system further comprises a weight distribution unit;
and the weight distribution unit is configured to perform weight distribution on the input feature vector based on the attention vector in the horizontal direction and the attention vector in the vertical direction to obtain a weighted feature vector.
In another aspect, an embodiment of the present application provides an attention mechanism of a convolutional neural network, including:
feeding the feature vectors input by the residual error module to the first branch and the second branch; the first branch is configured to perform deformable convolution operation, channel attenuation operation and global pooling operation on the feature vectors in the horizontal direction, and the second branch is configured to perform deformable convolution operation, channel attenuation operation and global pooling operation on the feature vectors in the vertical direction;
splicing the output of the first branch and the output of the second branch to obtain a splicing vector;
transforming the splicing vectors by using a convolution transformation function;
feeding the splicing vectors after the transformation processing to a full connection layer;
and performing convolution operation on the input of the full connection layer in the horizontal direction and the vertical direction respectively to obtain the attention vector of the input feature vector in the horizontal direction and the attention vector of the input feature vector in the vertical direction.
In another aspect, embodiments of the present application provide a convolutional neural network, including the attention module provided in the above embodiments.
The attention module, the attention mechanism and the convolutional neural network of the convolutional neural network provided by the embodiment of the application have the following beneficial effects:
the attention module comprises an attention vector generation unit configured to feed the feature vectors input by the residual module to the first branch and the second branch; the first branch is configured to perform deformable convolution operation, channel attenuation operation and global pooling operation on the feature vectors in the horizontal direction, and the second branch is configured to perform deformable convolution operation, channel attenuation operation and global pooling operation on the feature vectors in the vertical direction; the attention vector generation unit is also configured to splice the output of the first branch and the output of the second branch to obtain a spliced vector, and the spliced vector is transformed by using a convolution transformation function; and feeding the spliced vectors subjected to the conversion processing to a full-connection layer, and performing convolution operation on the input of the full-connection layer in the horizontal direction and the vertical direction respectively to obtain the attention vector of the input feature vector in the horizontal direction and the attention vector of the input feature vector in the vertical direction. Therefore, the attention of the space and the attention of the channel can be fused through the design of the double branches, the problem that the operation of the existing attention mechanism is unified on the space and the channel is solved, the characteristic extraction effect can be enhanced, and the precision of the convolutional neural network can be improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic diagram of an attention module of a convolutional neural network provided in an embodiment of the present application;
fig. 2(a) is an example of a feature diagram in a horizontal direction provided by an embodiment of the present application;
fig. 2(b) is an example of a feature diagram in a vertical direction provided by an embodiment of the present application;
fig. 3 is a schematic structural diagram of an attention module according to an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In order to facilitate the following description of the present solution, a few basic concepts will be described first.
Deep neural network: one type of neural network belongs to one branch of machine learning.
Feature (feature): a method of representing an image. Conventional methods represent an image with RGB three-channel pixels. In order to better utilize a computer for recognition, redundant information in RGB needs to be filtered out, and more semantic features need to be extracted. Image features contain some salient information in the image, such as contour edges, color, etc.
Convolutional Neural Network (CNN): the convolutional neural network is a feedforward neural network, and has simple training and better generalization capability.
And (3) convolution kernel: the convolution kernel is the core of a convolutional neural network, and is generally regarded as an information aggregation that aggregates spatial (spatial) information and channel-wise (channel-wise) information on a local receptive field.
ReLU (rectified Linear Unit): the modified linear unit is an excitation function (activation function) commonly used in artificial neural networks, and generally refers to a nonlinear function represented by a ramp function and a variation thereof.
Referring to fig. 1, fig. 1 is a schematic diagram of an attention module of a convolutional neural network according to an embodiment of the present disclosure, where the attention module includes an attention vector generation unit 101;
an attention vector generation unit 101 configured to feed the feature vectors input by the residual module to the first branch and the second branch; the first branch is configured to perform deformable convolution operation, channel attenuation operation and global pooling operation on the feature vectors in the horizontal direction, and the second branch is configured to perform deformable convolution operation, channel attenuation operation and global pooling operation on the feature vectors in the vertical direction;
the attention vector generation unit 101 is further configured to perform splicing processing on the output of the first branch and the output of the second branch to obtain a spliced vector, and perform transformation processing on the spliced vector by using a convolution transformation function; and feeding the spliced vectors subjected to the conversion processing to a full-connection layer, and performing convolution operation on the input of the full-connection layer in the horizontal direction and the vertical direction respectively to obtain the attention vector of the input feature vector in the horizontal direction and the attention vector of the input feature vector in the vertical direction.
The attention module of convolutional neural network that this application embodiment provided uses deformable convolution in horizontal direction and vertical direction feature extraction, more can catch the positional information of object when making things convenient for follow-up coding. Specifically, a feature map in the horizontal direction is obtained by performing deformable convolution operation and channel attenuation operation on a feature vector in the horizontal direction through a first branch, as shown in fig. 2(a), fig. 2(a) is an example of a feature map in the horizontal direction provided by the embodiment of the present application, and a feature map in the vertical direction is obtained by performing deformable convolution operation and channel attenuation operation on a feature vector in the vertical direction through a second branch, as shown in fig. 2(b), fig. 2(b) is an example of a feature map in the vertical direction provided by the embodiment of the present application, that is, only one-direction deformable convolution is performed on one branch, which can enhance the effect of feature extraction compared with a conventional mode in which deformable convolution is performed in two directions at the same time, and can improve the detection and identification accuracy of a convolutional neural network. In addition, this application is through the design of two branches, pool the horizontal direction characteristic map and vertical direction characteristic map respectively, the attention module can follow a space direction and catch long-range dependency, can keep accurate positional information along another space direction simultaneously, make vertical direction and horizontal direction's information all can keep, then obtain the attention vector after serial transform, and multiply back to former eigenvector as weight factor point, so, can fuse the attention of space and the attention of passageway, solved current attention mechanism and operated unified problem on space and passageway, and can improve convolutional neural network's precision.
In an alternative embodiment, as shown in fig. 1, the attention module further includes a weight assignment unit 102;
and a weight distribution unit 102 configured to perform weight distribution on the input feature vector based on the attention vector in the horizontal direction and the attention vector in the vertical direction, so as to obtain a weighted feature vector.
In a specific implementation manner, please refer to fig. 3, wherein fig. 3 is a schematic structural diagram of an attention module provided in an embodiment of the present application; the first branch comprises a first deformable convolution layer, a first convolution layer and a first global pooling layer which are connected in sequence; the output of the first deformable convolution layer is:
wherein p is0Representing each feature point in the input feature vector; y (p)x) Represents p0A position in the horizontal direction after the deformable convolution operation; r represents a convolution kernel, pnIs to the radical of REnumeration of column positions; Δ pxIndicating the amount of offset in the horizontal direction.
Correspondingly, the second branch comprises a second deformable convolution layer, a second convolution layer and a second global pooling layer which are connected in sequence; the output of the second deformable convolution layer is:
wherein p is0Representing each feature point in the input feature vector; y (p)y) Represents p0A position in the vertical direction after the deformable convolution operation; r represents a convolution kernel, pnIs an enumeration of the positions listed in R; Δ pyIndicating the amount of offset in the vertical direction.
Specifically, the super parameters of the deformable convolution are respectively arranged in the horizontal direction and the vertical direction, and the size of the convolution kernel can be set to be 3x 3; the first branch and the second branch may have substantially the same structure, taking the first branch as an example, perform Network in Network on the output of the first deformable convolutional layer, the size of the convolution kernel of the first convolutional layer cascaded with the first deformable convolutional layer may be set to 1x1, and similarly, the size of the convolution kernel of the second convolutional layer cascaded with the second deformable convolutional layer is also set to 1x1, that is, the convolution with 1x1 is used to implement channel attenuation. In this way, by adding the offsets in the x and y directions to the deformable convolution kernel, the characteristic diagram after deformable convolution is obtained by matching with the cascaded 1x1 convolution. To reduce the complexity and computational overhead of the model, a suitable reduction ratio r (e.g., 32) is typically used to reduce the number of channels of the original features.
As shown in fig. 3, the data dimension of the feature vector output by the residual block (residual block) is C × H × W, where C denotes a channel, H denotes a height, and W denotes a width; performing two-way deformable convolution operation on the characteristic vector C multiplied by H multiplied by W, wherein the output data dimensionality of the first deformable convolution layer and the second deformable convolution layer is C multiplied by (W + H); then, carrying out 1x1 convolution on the two branches respectively, wherein the data dimension of the output of the first convolution layer and the second convolution layer is C/r x (W + H); then global pooling is performed in the H and W directions, respectively, i.e. each channel is encoded along the horizontal and vertical coordinates with a posing kernel of size (H, 1) or (1, W), respectively. Then, the data dimension of the output of the first global pooling layer is C/r × H × 1, and the data dimension of the output of the second global pooling layer is C/r × 1 × W; finally, the output of the c/r-th channel in the output of the first branch (i.e., the output of the first global pooling layer) can be represented by the following expression:
wherein x isc/r(h, i) represents the ith feature point in the c/r channel with the height h in the output of the first convolution layer;
the output of the c/r-th channel in the output of the second branch (i.e., the output of the second global pooling layer) can be represented by the following expression:
wherein x isc/r(j, w) represents the jth feature point in the c/r-th channel of width w in the output of the second convolutional layer.
The 2 transformations respectively aggregate features along two spatial directions to obtain a pair of direction-sensing feature maps. These 2 transformations also capture the long-term dependencies along one spatial direction in conjunction with the previous deformable convolution and save accurate location information along the other spatial direction, which helps the network to more accurately locate the object of interest.
Furthermore, the output end of the first global pooling layer and the output end of the second global pooling layer are respectively connected with a splicing conversion layer, the layers are used for splicing the output of the first branch and the output of the second branch to obtain a splicing vector, and a convolution transformation function is used for carrying out transformation processing on the splicing vector; therefore, the characteristics in the horizontal direction and the vertical direction can be well obtained, and the position information can be accurately positioned. In order to utilize the generated representation, the present application performs representation conversion on the encoded information obtained by the first branch and the second branch, and the representation conversion should be as simple as possible for driving or wearable application scenarios; secondly, the captured position information can be fully utilized, so that the interested area can be accurately captured; finally, it should also be able to note differences in horizontal as well as vertical characteristics in real time.
The result of the stitching transform layer transform can be represented by the following expression:
f=(F1([zh,zw]))
wherein, [, ]]A concate operation along a spatial dimension; f1Convolution transform function of 1x 1; f is an intermediate feature map encoding spatial information in both the horizontal and vertical directions.
Correspondingly, the output of the splicing conversion layer keeps the current C/r channel attenuation coefficient, and the data dimension is C/r multiplied by 1 multiplied by (W + H); the output end of the splicing conversion layer is connected with the full-connection layer, information fusion extraction in the horizontal direction and the vertical direction is carried out through the full-connection layer, the output of the full-connection layer still keeps the current C/r channel attenuation coefficient, and correspondingly, the data dimensionality output by the full-connection layer is C/r multiplied by 1 multiplied by (W + H); feeding the outputs of the fully connected layers to a third convolutional layer and a fourth convolutional layer respectively for convolution, wherein the data dimension output by the third convolutional layer is C multiplied by H1 and the data dimension output by the fourth convolutional layer is C multiplied by 1 multiplied by W; finally, obtaining the attention vector of the input feature vector in the horizontal direction through the first ReLU layer, and obtaining the attention vector of the input feature vector in the vertical direction through the second ReLU layer;
the above-described attention vector in the horizontal direction can be expressed by the following expression:
gh=σ(Fh(fh))
the above-described attention vector in the vertical direction can be expressed by the following expression:
gw=σ(Fw(fw))
wherein f ishAnd fwDecomposed into 2 along the spatial dimension for fIndividual tensors, fh∈RC/r×H,fw∈RC/r×W;FhAnd FwIs a convolution transformation function; σ is the ReLU activation function.
Finally, can be for ghAnd gwAnd expanding, and then obtaining a weighted feature vector according to the following formula:
Foutput=Finput×gh×gw
wherein, FinputThe feature vector is input by the residual error module; foutputIs a weighted feature vector.
In summary, the attention module of the convolutional neural network provided by the embodiment of the present application optimizes the conventional attention mechanism process, performs elastic coding on the channel and the space by using deformable convolution simultaneously, fuses information in the horizontal direction and the vertical direction, can better position a target position, and can improve target detection accuracy in application scenarios such as detection, classification, segmentation, and the like.
In another aspect, an embodiment of the present application further provides an attention mechanism of a convolutional neural network, including:
feeding the feature vectors input by the residual error module to the first branch and the second branch; the first branch is configured to perform deformable convolution operation, channel attenuation operation and global pooling operation on the feature vectors in the horizontal direction, and the second branch is configured to perform deformable convolution operation, channel attenuation operation and global pooling operation on the feature vectors in the vertical direction;
splicing the output of the first branch and the output of the second branch to obtain a splicing vector;
transforming the splicing vectors by using a convolution transformation function;
feeding the splicing vectors after the transformation processing to a full connection layer;
and performing convolution operation on the input of the full connection layer in the horizontal direction and the vertical direction respectively to obtain the attention vector of the input feature vector in the horizontal direction and the attention vector of the input feature vector in the vertical direction.
The attention mechanism of the embodiment of the application is based on the same application concept as the attention module embodiment.
In addition, the embodiment of the present application further provides a convolutional neural network, which includes the attention module described in the above embodiment. The convolutional neural network can be trained for detection, classification, segmentation and the like, and can show high accuracy when applied to detection and identification of surrounding vehicles in an automatic driving application scene.
The present application provides results of experiments based on a variety of different model structures, as shown in tables 1 and 2 below:
Model | Params(M) | coco mAP(%) |
Mobilenetv2 | 4.3 | 41.5 |
Mobilenetv2+SE | 4.7 | 41.6 |
Mobilenetv2+CBAM | 4.7 | 41.6 |
Mobilenetv2+CA | 4.3 | 42.8 |
this application | 4.45 | 43.4 |
TABLE 1 test experiments
Model | Params(M) | Top-1 Acc(%) |
Mobilenetv2 | 3.5 | 72.3 |
Mobilenetv2+SE | 3.89 | 73.5 |
Mobilenetv2+CBAM | 3.89 | 73.6 |
Mobilenetv2+CA | 3.95 | 74.3 |
This application | 4.02 | 74.8 |
TABLE 2 Classification of experiments
In the experiment, SE Attenttion, CBAM (volumetric Block Attenttion Module), CA (coordinate Attentation) and the Attention module in the embodiment of the application are respectively added on the basis of a lightweight model Mobilentv 2, and detection and classification experiments are respectively carried out; the second column of tables 1 and 2 is the parameter number and the third column is the performance value. Experimental results show that when the attention module of the embodiment of the application is added, the best result can be obtained, and the accuracy of the network model can be improved under the condition of ensuring the parameter quantity; it is clear that the attention module of the embodiments of the present application facilitates detection and classification of targets more than SE, CBAM and CA.
It can be seen from the above embodiments that the attention module, the attention mechanism, and the convolutional neural network of the convolutional neural network provided by the present application can solve the problem that the attention mechanism operates uniformly in space and channels, and can improve the accuracy of the convolutional neural network.
It should be noted that: the sequence of the embodiments of the present application is only for description, and does not represent the advantages and disadvantages of the embodiments. And specific embodiments thereof have been described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments.
The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.
Claims (10)
1. An attention module for a convolutional neural network, comprising:
an attention vector generation unit configured to feed the feature vectors input by the residual module to the first branch and the second branch; wherein the first branch is configured to perform a deformable convolution operation, a channel attenuation operation and a global pooling operation on the feature vector in a horizontal direction, and the second branch is configured to perform the deformable convolution operation, the channel attenuation operation and the global pooling operation on the feature vector in a vertical direction;
the attention vector generation unit is further configured to perform splicing processing on the output of the first branch and the output of the second branch to obtain a spliced vector, and perform transformation processing on the spliced vector by using a convolution transformation function; feeding the spliced vectors after the transformation processing to a full-connection layer, and performing convolution operation on the input of the full-connection layer in the horizontal direction and the vertical direction respectively to obtain the attention vector of the input feature vector in the horizontal direction and the attention vector of the input feature vector in the vertical direction.
2. The attention module of claim 1, wherein the first leg comprises a first deformable convolutional layer, a first convolutional layer, and a first global pooling layer connected in sequence;
the output of the first deformable convolution layer is:
wherein p is0Representing each feature point in the input feature vector; y (p)x) Represents p0A position in the horizontal direction after the deformable convolution operation; r represents a convolution kernel, pnIs an enumeration of the positions listed in R; Δ pxIndicating the amount of offset in the horizontal direction.
3. The attention module of claim 2, wherein the data dimensions of the input feature vector are C x H x W; the data dimension of the output of the first deformable convolution layer is C × (W + H);
the convolution kernel size of the first convolution layer is 1x1, and the data dimension of the output of the first convolution layer is C/r x (W + H);
the data dimension of the output of the first global pooling layer is C/r × H × 1.
5. The attention module of claim 1, wherein the second leg comprises a second deformable convolutional layer, a second convolutional layer, and a second global pooling layer connected in sequence;
the output of the second deformable convolution layer is:
wherein p is0Representing each feature point in the input feature vector; y (p)y) Represents p0A position in the vertical direction after the deformable convolution operation; r represents a convolution kernel, pnIs an enumeration of the positions listed in R; Δ pyIndicating the amount of offset in the vertical direction.
6. The attention module of claim 5, wherein the data dimensions of the input feature vector are C x H x W; the output of the second deformable convolution layer has a data dimension of C × (W + H);
the convolution kernel size of the second convolutional layer is 1x1, and the data dimension of the output of the second convolutional layer is C/r x (W + H);
and the output of the second global pooling layer has a data dimension of C/r multiplied by 1 multiplied by W.
8. The attention module of claim 1, further comprising a weight assignment unit;
the weight distribution unit is configured to perform weight distribution on the input feature vector based on the attention vector in the horizontal direction and the attention vector in the vertical direction to obtain a weighted feature vector.
9. An attention mechanism for a convolutional neural network, comprising:
feeding the feature vectors input by the residual error module to the first branch and the second branch; wherein the first branch is configured to perform a deformable convolution operation, a channel attenuation operation and a global pooling operation on the feature vector in a horizontal direction, and the second branch is configured to perform the deformable convolution operation, the channel attenuation operation and the global pooling operation on the feature vector in a vertical direction;
splicing the output of the first branch and the output of the second branch to obtain a spliced vector;
transforming the splicing vector by using a convolution transformation function;
feeding the splicing vectors after the transformation processing to a full connection layer;
and performing convolution operation on the input of the full connection layer in the horizontal direction and the vertical direction respectively to obtain the attention vector of the input feature vector in the horizontal direction and the attention vector of the input feature vector in the vertical direction.
10. A convolutional neural network, comprising the attention module of any one of claims 1-8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110863925.9A CN113627590A (en) | 2021-07-29 | 2021-07-29 | Attention module and attention mechanism of convolutional neural network and convolutional neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110863925.9A CN113627590A (en) | 2021-07-29 | 2021-07-29 | Attention module and attention mechanism of convolutional neural network and convolutional neural network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113627590A true CN113627590A (en) | 2021-11-09 |
Family
ID=78381553
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110863925.9A Pending CN113627590A (en) | 2021-07-29 | 2021-07-29 | Attention module and attention mechanism of convolutional neural network and convolutional neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113627590A (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108595590A (en) * | 2018-04-19 | 2018-09-28 | 中国科学院电子学研究所苏州研究院 | A kind of Chinese Text Categorization based on fusion attention model |
CN108734290A (en) * | 2018-05-16 | 2018-11-02 | 湖北工业大学 | It is a kind of based on the convolutional neural networks construction method of attention mechanism and application |
CN109993220A (en) * | 2019-03-23 | 2019-07-09 | 西安电子科技大学 | Multi-source Remote Sensing Images Classification method based on two-way attention fused neural network |
CN111832620A (en) * | 2020-06-11 | 2020-10-27 | 桂林电子科技大学 | Image emotion classification method based on double-attention multilayer feature fusion |
CN112580782A (en) * | 2020-12-14 | 2021-03-30 | 华东理工大学 | Channel enhancement-based double-attention generation countermeasure network and image generation method |
CN112651973A (en) * | 2020-12-14 | 2021-04-13 | 南京理工大学 | Semantic segmentation method based on cascade of feature pyramid attention and mixed attention |
CN112861978A (en) * | 2021-02-20 | 2021-05-28 | 齐齐哈尔大学 | Multi-branch feature fusion remote sensing scene image classification method based on attention mechanism |
WO2021115159A1 (en) * | 2019-12-09 | 2021-06-17 | 中兴通讯股份有限公司 | Character recognition network model training method, character recognition method, apparatuses, terminal, and computer storage medium therefor |
-
2021
- 2021-07-29 CN CN202110863925.9A patent/CN113627590A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108595590A (en) * | 2018-04-19 | 2018-09-28 | 中国科学院电子学研究所苏州研究院 | A kind of Chinese Text Categorization based on fusion attention model |
CN108734290A (en) * | 2018-05-16 | 2018-11-02 | 湖北工业大学 | It is a kind of based on the convolutional neural networks construction method of attention mechanism and application |
CN109993220A (en) * | 2019-03-23 | 2019-07-09 | 西安电子科技大学 | Multi-source Remote Sensing Images Classification method based on two-way attention fused neural network |
WO2021115159A1 (en) * | 2019-12-09 | 2021-06-17 | 中兴通讯股份有限公司 | Character recognition network model training method, character recognition method, apparatuses, terminal, and computer storage medium therefor |
CN111832620A (en) * | 2020-06-11 | 2020-10-27 | 桂林电子科技大学 | Image emotion classification method based on double-attention multilayer feature fusion |
CN112580782A (en) * | 2020-12-14 | 2021-03-30 | 华东理工大学 | Channel enhancement-based double-attention generation countermeasure network and image generation method |
CN112651973A (en) * | 2020-12-14 | 2021-04-13 | 南京理工大学 | Semantic segmentation method based on cascade of feature pyramid attention and mixed attention |
CN112861978A (en) * | 2021-02-20 | 2021-05-28 | 齐齐哈尔大学 | Multi-branch feature fusion remote sensing scene image classification method based on attention mechanism |
Non-Patent Citations (3)
Title |
---|
YUE WANG ETC: ""Fusing Distinguish Degree Neural Networks for Relational Classification"", 《2018 11TH INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND DESIGN (ISCID)》, 9 December 2018 (2018-12-09) * |
李生武;张选德;: "基于自注意力机制的多域卷积神经网络的视觉追踪", 计算机应用, no. 08, 31 December 2020 (2020-12-31) * |
雷鹏程;刘丛;唐坚刚;彭敦陆;: "分层特征融合注意力网络图像超分辨率重建", 中国图象图形学报, no. 09, 16 September 2020 (2020-09-16) * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111339903B (en) | Multi-person human body posture estimation method | |
CN111325111A (en) | Pedestrian re-identification method integrating inverse attention and multi-scale deep supervision | |
CN112200111A (en) | Global and local feature fused occlusion robust pedestrian re-identification method | |
CN110619638A (en) | Multi-mode fusion significance detection method based on convolution block attention module | |
CN114565655B (en) | Depth estimation method and device based on pyramid segmentation attention | |
CN113239820B (en) | Pedestrian attribute identification method and system based on attribute positioning and association | |
US20230334893A1 (en) | Method for optimizing human body posture recognition model, device and computer-readable storage medium | |
CN112836646A (en) | Video pedestrian re-identification method based on channel attention mechanism and application | |
CN113743544A (en) | Cross-modal neural network construction method, pedestrian retrieval method and system | |
CN114529982A (en) | Lightweight human body posture estimation method and system based on stream attention | |
CN111582154A (en) | Pedestrian re-identification method based on multitask skeleton posture division component | |
CN111242091A (en) | Age identification model training method and device and electronic equipment | |
CN116092185A (en) | Depth video behavior recognition method and system based on multi-view feature interaction fusion | |
CN114332919A (en) | Pedestrian detection method and device based on multi-spatial relationship perception and terminal equipment | |
CN117115616A (en) | Real-time low-illumination image target detection method based on convolutional neural network | |
CN115240121B (en) | Joint modeling method and device for enhancing local features of pedestrians | |
CN113627590A (en) | Attention module and attention mechanism of convolutional neural network and convolutional neural network | |
CN111428612A (en) | Pedestrian re-identification method, terminal, device and storage medium | |
CN116311504A (en) | Small sample behavior recognition method, system and equipment | |
CN115661515A (en) | Three-dimensional image classifier and classification method based on hierarchical feature extraction and structure perception | |
CN115641581A (en) | Target detection method, electronic device, and storage medium | |
CN114663917A (en) | Multi-view-angle-based multi-person three-dimensional human body pose estimation method and device | |
CN112966546A (en) | Embedded attitude estimation method based on unmanned aerial vehicle scout image | |
CN114998653B (en) | ViT network-based small sample remote sensing image classification method, medium and equipment | |
CN117636074B (en) | Multi-mode image classification method and system based on feature interaction fusion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |