CN113627590A - Attention module and attention mechanism of convolutional neural network and convolutional neural network - Google Patents

Attention module and attention mechanism of convolutional neural network and convolutional neural network Download PDF

Info

Publication number
CN113627590A
CN113627590A CN202110863925.9A CN202110863925A CN113627590A CN 113627590 A CN113627590 A CN 113627590A CN 202110863925 A CN202110863925 A CN 202110863925A CN 113627590 A CN113627590 A CN 113627590A
Authority
CN
China
Prior art keywords
attention
output
vector
branch
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110863925.9A
Other languages
Chinese (zh)
Inventor
李丰军
周剑光
陈志轩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Automotive Innovation Co Ltd
Original Assignee
China Automotive Innovation Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Automotive Innovation Co Ltd filed Critical China Automotive Innovation Co Ltd
Priority to CN202110863925.9A priority Critical patent/CN113627590A/en
Publication of CN113627590A publication Critical patent/CN113627590A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The application relates to an attention module, an attention mechanism and a convolutional neural network of the convolutional neural network, wherein the attention module firstly utilizes deformable convolution to respectively extract characteristics in the horizontal direction and the vertical direction so as to facilitate the subsequent encoding to capture the position information of an object; secondly, the attention module captures the remote dependency relationship along one spatial direction, and meanwhile, retains accurate position information along the other spatial direction, so that information in the vertical direction and the horizontal direction can be retained; and then obtaining an attention vector after series of transformation, and multiplying the attention vector as a weight factor point to the original feature vector, so that the attention of the space and the attention of the channel can be fused, the problem of unified operation of the existing attention mechanism on the space and the channel is solved, and the precision of the convolutional neural network can be improved.

Description

Attention module and attention mechanism of convolutional neural network and convolutional neural network
Technical Field
The application relates to the technical field of deep learning, in particular to an attention module and an attention mechanism of a convolutional neural network and the convolutional neural network.
Background
In the process of perception algorithm industrialization, the current deep learning (deep neural network) gradually evolves towards the goal of 'good and fast', the significance of end-to-end neural network algorithm deployment on a certain vehicle-mounted chip is significant, more and more methods focus on the change of a neural network structure, so that the width and the depth of the whole backbone network are continuously reduced, the loss of precision is brought successively, for tasks with higher precision requirements, such as vehicle detection in an automatic driving scene, the precision reduction can hardly be tolerated, and due to safety considerations, the detection rate of all vehicles visible to the naked eye is approximate to 100%.
At present, a convolutional neural network is widely applied to the fields of target detection and identification based on visual perception in an automatic driving scene as one kind of deep neural network. And researches show that better effects can be obtained by introducing an attention mechanism into the convolutional neural network.
Currently, attention mechanisms are introduced into convolutional neural networks, and these schemes are roughly classified into the following two categories:
the attention mechanism applies to the feature map spatial dimensions: human vision focuses on important areas in the image, ignoring unimportant parts of the image. Compared with the process of processing the whole image information, the image information of a certain area in the image is finely processed in the training process, the calculated amount and the training detection time are obviously reduced, more information of the specific area can be obtained in the aspect of image processing, and the generalization capability of the network model is enhanced.
The attention mechanism applies to the feature map channel dimensions: the most important part in the convolutional neural network is convolution operation, image features are extracted in space dimensions and channel dimensions by a convolution kernel, an attention mechanism is applied to the channel dimensions to find internal relations among the channels, and the feature extraction performance of the convolutional neural network can be remarkably improved.
In the prior art that an attention mechanism is introduced into a convolutional neural network, for example, a patent with application number CN201910769868.0 discloses an SSD object detection method based on an SE module, which belongs to the second category described above, and which, after acquiring a picture or a video to be subjected to object recognition, replaces a first convolutional layer of a convolutional neural network ResNet18 with a 3 × 3 convolutional layer, adds an SE module in a first and a second residual block of ResNet18 to form an SE-ResNet18 network structure, replaces a backbone network in an SSD object detection algorithm with the SE-res 18 network structure to obtain a detection model, trains the detection model for small object detection to obtain a trained deep neural network model, and detects the small object of the picture or the video according to the trained deep neural network model to obtain a detection result. In this patent, the interdependencies between channels are efficiently constructed by simply squeezing each 2-dimensional feature map, however, it only considers re-weighting the importance of each channel by modeling the channel relationships, and neglects the location information, which is important for generating spatially selective attribute maps. Therefore, the regression accuracy of the position is still defective.
For another example, the patent with application number 202010595050.4 discloses an unmanned aerial vehicle real-time vehicle detection method based on a convolutional neural network, which includes firstly clustering out 9 anchor frames, building a shallow neural network, adding an attention mechanism, and adding a tensor adaptive module; training and testing in embedded devices. A shallow neural network is constructed, the parameter quantity is small, the Jetson tx2 is suitable for running in an embedded device of an unmanned aerial vehicle, and the requirement of real-time performance is met. After feature fusion is carried out based on the feature pyramid network, a self-adaptive tensor selection module is introduced, so that the network can select the most appropriate detection tensor according to the semantic information of the target, and the accuracy of model detection is further improved. However, this patent adds a CBAM attention mechanism between the convolutional layers, but the spatial attention and the channel attention operate separately from each other, which reduces the spatial correlation of the targets, thus resulting in difficulty in further improving the accuracy.
Disclosure of Invention
The embodiment of the application provides an attention module and an attention mechanism of a convolutional neural network and the convolutional neural network. The problem that the attention mechanism is unified in operation in space and channels can be solved, and the precision of the convolutional neural network can be improved.
In one aspect, an attention module of a convolutional neural network is provided in an embodiment of the present application, including:
an attention vector generation unit configured to feed the feature vectors input by the residual module to the first branch and the second branch; the first branch is configured to perform deformable convolution operation, channel attenuation operation and global pooling operation on the feature vectors in the horizontal direction, and the second branch is configured to perform deformable convolution operation, channel attenuation operation and global pooling operation on the feature vectors in the vertical direction;
the attention vector generation unit is also configured to splice the output of the first branch and the output of the second branch to obtain a spliced vector, and the spliced vector is transformed by using a convolution transformation function; and feeding the spliced vectors subjected to the conversion processing to a full-connection layer, and performing convolution operation on the input of the full-connection layer in the horizontal direction and the vertical direction respectively to obtain the attention vector of the input feature vector in the horizontal direction and the attention vector of the input feature vector in the vertical direction.
Optionally, the first branch includes a first deformable convolution layer, a first convolution layer, and a first global pooling layer, which are connected in sequence;
the output of the first deformable convolution layer is:
Figure BDA0003186856090000031
wherein p is0Representing each feature point in the input feature vector; y (p)x) Represents p0A position in the horizontal direction after the deformable convolution operation; r represents a convolution kernel, pnIs an enumeration of the positions listed in R; Δ pxIndicating the amount of offset in the horizontal direction.
Optionally, the data dimension of the input feature vector is C × H × W; the data dimension of the output of the first deformable convolution layer is C × (W + H);
the convolution kernel size of the first convolution layer is 1x1, and the data dimension of the output of the first convolution layer is C/r x (W + H);
the data dimension of the output of the first global pooling layer is C/r × H × 1.
Optionally, the output of the c/r channel in the output of the first branch is:
Figure BDA0003186856090000032
wherein x isc/r(h, i) represents the ith feature point in the c/r-th channel with height h in the output of the first convolution layer.
Optionally, the second branch includes a second deformable convolution layer, a second convolution layer and a second global pooling layer, which are connected in sequence;
the output of the second deformable convolution layer is:
Figure BDA0003186856090000033
wherein p is0Representing each feature point in the input feature vector; y (p)y) Represents p0A position in the vertical direction after the deformable convolution operation; r represents a convolution kernel, pnIs an enumeration of the positions listed in R; Δ pyIndicating the amount of offset in the vertical direction.
Optionally, the data dimension of the input feature vector is C × H × W; the output of the second deformable convolution layer has a data dimension of C × (W + H);
the convolution kernel size of the second convolution layer is 1x1, and the data dimension of the output of the second convolution layer is C/r x (W + H);
the output of the second global pooling layer has a data dimension of C/r × 1 × W.
Optionally, the output of the c/r channel in the output of the second branch is:
Figure BDA0003186856090000041
wherein x isc/r(j, w) represents the jth feature point in the c/r-th channel of width w in the output of the second convolutional layer.
Optionally, the system further comprises a weight distribution unit;
and the weight distribution unit is configured to perform weight distribution on the input feature vector based on the attention vector in the horizontal direction and the attention vector in the vertical direction to obtain a weighted feature vector.
In another aspect, an embodiment of the present application provides an attention mechanism of a convolutional neural network, including:
feeding the feature vectors input by the residual error module to the first branch and the second branch; the first branch is configured to perform deformable convolution operation, channel attenuation operation and global pooling operation on the feature vectors in the horizontal direction, and the second branch is configured to perform deformable convolution operation, channel attenuation operation and global pooling operation on the feature vectors in the vertical direction;
splicing the output of the first branch and the output of the second branch to obtain a splicing vector;
transforming the splicing vectors by using a convolution transformation function;
feeding the splicing vectors after the transformation processing to a full connection layer;
and performing convolution operation on the input of the full connection layer in the horizontal direction and the vertical direction respectively to obtain the attention vector of the input feature vector in the horizontal direction and the attention vector of the input feature vector in the vertical direction.
In another aspect, embodiments of the present application provide a convolutional neural network, including the attention module provided in the above embodiments.
The attention module, the attention mechanism and the convolutional neural network of the convolutional neural network provided by the embodiment of the application have the following beneficial effects:
the attention module comprises an attention vector generation unit configured to feed the feature vectors input by the residual module to the first branch and the second branch; the first branch is configured to perform deformable convolution operation, channel attenuation operation and global pooling operation on the feature vectors in the horizontal direction, and the second branch is configured to perform deformable convolution operation, channel attenuation operation and global pooling operation on the feature vectors in the vertical direction; the attention vector generation unit is also configured to splice the output of the first branch and the output of the second branch to obtain a spliced vector, and the spliced vector is transformed by using a convolution transformation function; and feeding the spliced vectors subjected to the conversion processing to a full-connection layer, and performing convolution operation on the input of the full-connection layer in the horizontal direction and the vertical direction respectively to obtain the attention vector of the input feature vector in the horizontal direction and the attention vector of the input feature vector in the vertical direction. Therefore, the attention of the space and the attention of the channel can be fused through the design of the double branches, the problem that the operation of the existing attention mechanism is unified on the space and the channel is solved, the characteristic extraction effect can be enhanced, and the precision of the convolutional neural network can be improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic diagram of an attention module of a convolutional neural network provided in an embodiment of the present application;
fig. 2(a) is an example of a feature diagram in a horizontal direction provided by an embodiment of the present application;
fig. 2(b) is an example of a feature diagram in a vertical direction provided by an embodiment of the present application;
fig. 3 is a schematic structural diagram of an attention module according to an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In order to facilitate the following description of the present solution, a few basic concepts will be described first.
Deep neural network: one type of neural network belongs to one branch of machine learning.
Feature (feature): a method of representing an image. Conventional methods represent an image with RGB three-channel pixels. In order to better utilize a computer for recognition, redundant information in RGB needs to be filtered out, and more semantic features need to be extracted. Image features contain some salient information in the image, such as contour edges, color, etc.
Convolutional Neural Network (CNN): the convolutional neural network is a feedforward neural network, and has simple training and better generalization capability.
And (3) convolution kernel: the convolution kernel is the core of a convolutional neural network, and is generally regarded as an information aggregation that aggregates spatial (spatial) information and channel-wise (channel-wise) information on a local receptive field.
ReLU (rectified Linear Unit): the modified linear unit is an excitation function (activation function) commonly used in artificial neural networks, and generally refers to a nonlinear function represented by a ramp function and a variation thereof.
Referring to fig. 1, fig. 1 is a schematic diagram of an attention module of a convolutional neural network according to an embodiment of the present disclosure, where the attention module includes an attention vector generation unit 101;
an attention vector generation unit 101 configured to feed the feature vectors input by the residual module to the first branch and the second branch; the first branch is configured to perform deformable convolution operation, channel attenuation operation and global pooling operation on the feature vectors in the horizontal direction, and the second branch is configured to perform deformable convolution operation, channel attenuation operation and global pooling operation on the feature vectors in the vertical direction;
the attention vector generation unit 101 is further configured to perform splicing processing on the output of the first branch and the output of the second branch to obtain a spliced vector, and perform transformation processing on the spliced vector by using a convolution transformation function; and feeding the spliced vectors subjected to the conversion processing to a full-connection layer, and performing convolution operation on the input of the full-connection layer in the horizontal direction and the vertical direction respectively to obtain the attention vector of the input feature vector in the horizontal direction and the attention vector of the input feature vector in the vertical direction.
The attention module of convolutional neural network that this application embodiment provided uses deformable convolution in horizontal direction and vertical direction feature extraction, more can catch the positional information of object when making things convenient for follow-up coding. Specifically, a feature map in the horizontal direction is obtained by performing deformable convolution operation and channel attenuation operation on a feature vector in the horizontal direction through a first branch, as shown in fig. 2(a), fig. 2(a) is an example of a feature map in the horizontal direction provided by the embodiment of the present application, and a feature map in the vertical direction is obtained by performing deformable convolution operation and channel attenuation operation on a feature vector in the vertical direction through a second branch, as shown in fig. 2(b), fig. 2(b) is an example of a feature map in the vertical direction provided by the embodiment of the present application, that is, only one-direction deformable convolution is performed on one branch, which can enhance the effect of feature extraction compared with a conventional mode in which deformable convolution is performed in two directions at the same time, and can improve the detection and identification accuracy of a convolutional neural network. In addition, this application is through the design of two branches, pool the horizontal direction characteristic map and vertical direction characteristic map respectively, the attention module can follow a space direction and catch long-range dependency, can keep accurate positional information along another space direction simultaneously, make vertical direction and horizontal direction's information all can keep, then obtain the attention vector after serial transform, and multiply back to former eigenvector as weight factor point, so, can fuse the attention of space and the attention of passageway, solved current attention mechanism and operated unified problem on space and passageway, and can improve convolutional neural network's precision.
In an alternative embodiment, as shown in fig. 1, the attention module further includes a weight assignment unit 102;
and a weight distribution unit 102 configured to perform weight distribution on the input feature vector based on the attention vector in the horizontal direction and the attention vector in the vertical direction, so as to obtain a weighted feature vector.
In a specific implementation manner, please refer to fig. 3, wherein fig. 3 is a schematic structural diagram of an attention module provided in an embodiment of the present application; the first branch comprises a first deformable convolution layer, a first convolution layer and a first global pooling layer which are connected in sequence; the output of the first deformable convolution layer is:
Figure BDA0003186856090000071
wherein p is0Representing each feature point in the input feature vector; y (p)x) Represents p0A position in the horizontal direction after the deformable convolution operation; r represents a convolution kernel, pnIs to the radical of REnumeration of column positions; Δ pxIndicating the amount of offset in the horizontal direction.
Correspondingly, the second branch comprises a second deformable convolution layer, a second convolution layer and a second global pooling layer which are connected in sequence; the output of the second deformable convolution layer is:
Figure BDA0003186856090000072
wherein p is0Representing each feature point in the input feature vector; y (p)y) Represents p0A position in the vertical direction after the deformable convolution operation; r represents a convolution kernel, pnIs an enumeration of the positions listed in R; Δ pyIndicating the amount of offset in the vertical direction.
Specifically, the super parameters of the deformable convolution are respectively arranged in the horizontal direction and the vertical direction, and the size of the convolution kernel can be set to be 3x 3; the first branch and the second branch may have substantially the same structure, taking the first branch as an example, perform Network in Network on the output of the first deformable convolutional layer, the size of the convolution kernel of the first convolutional layer cascaded with the first deformable convolutional layer may be set to 1x1, and similarly, the size of the convolution kernel of the second convolutional layer cascaded with the second deformable convolutional layer is also set to 1x1, that is, the convolution with 1x1 is used to implement channel attenuation. In this way, by adding the offsets in the x and y directions to the deformable convolution kernel, the characteristic diagram after deformable convolution is obtained by matching with the cascaded 1x1 convolution. To reduce the complexity and computational overhead of the model, a suitable reduction ratio r (e.g., 32) is typically used to reduce the number of channels of the original features.
As shown in fig. 3, the data dimension of the feature vector output by the residual block (residual block) is C × H × W, where C denotes a channel, H denotes a height, and W denotes a width; performing two-way deformable convolution operation on the characteristic vector C multiplied by H multiplied by W, wherein the output data dimensionality of the first deformable convolution layer and the second deformable convolution layer is C multiplied by (W + H); then, carrying out 1x1 convolution on the two branches respectively, wherein the data dimension of the output of the first convolution layer and the second convolution layer is C/r x (W + H); then global pooling is performed in the H and W directions, respectively, i.e. each channel is encoded along the horizontal and vertical coordinates with a posing kernel of size (H, 1) or (1, W), respectively. Then, the data dimension of the output of the first global pooling layer is C/r × H × 1, and the data dimension of the output of the second global pooling layer is C/r × 1 × W; finally, the output of the c/r-th channel in the output of the first branch (i.e., the output of the first global pooling layer) can be represented by the following expression:
Figure BDA0003186856090000081
wherein x isc/r(h, i) represents the ith feature point in the c/r channel with the height h in the output of the first convolution layer;
the output of the c/r-th channel in the output of the second branch (i.e., the output of the second global pooling layer) can be represented by the following expression:
Figure BDA0003186856090000082
wherein x isc/r(j, w) represents the jth feature point in the c/r-th channel of width w in the output of the second convolutional layer.
The 2 transformations respectively aggregate features along two spatial directions to obtain a pair of direction-sensing feature maps. These 2 transformations also capture the long-term dependencies along one spatial direction in conjunction with the previous deformable convolution and save accurate location information along the other spatial direction, which helps the network to more accurately locate the object of interest.
Furthermore, the output end of the first global pooling layer and the output end of the second global pooling layer are respectively connected with a splicing conversion layer, the layers are used for splicing the output of the first branch and the output of the second branch to obtain a splicing vector, and a convolution transformation function is used for carrying out transformation processing on the splicing vector; therefore, the characteristics in the horizontal direction and the vertical direction can be well obtained, and the position information can be accurately positioned. In order to utilize the generated representation, the present application performs representation conversion on the encoded information obtained by the first branch and the second branch, and the representation conversion should be as simple as possible for driving or wearable application scenarios; secondly, the captured position information can be fully utilized, so that the interested area can be accurately captured; finally, it should also be able to note differences in horizontal as well as vertical characteristics in real time.
The result of the stitching transform layer transform can be represented by the following expression:
f=(F1([zh,zw]))
wherein, [, ]]A concate operation along a spatial dimension; f1Convolution transform function of 1x 1; f is an intermediate feature map encoding spatial information in both the horizontal and vertical directions.
Correspondingly, the output of the splicing conversion layer keeps the current C/r channel attenuation coefficient, and the data dimension is C/r multiplied by 1 multiplied by (W + H); the output end of the splicing conversion layer is connected with the full-connection layer, information fusion extraction in the horizontal direction and the vertical direction is carried out through the full-connection layer, the output of the full-connection layer still keeps the current C/r channel attenuation coefficient, and correspondingly, the data dimensionality output by the full-connection layer is C/r multiplied by 1 multiplied by (W + H); feeding the outputs of the fully connected layers to a third convolutional layer and a fourth convolutional layer respectively for convolution, wherein the data dimension output by the third convolutional layer is C multiplied by H1 and the data dimension output by the fourth convolutional layer is C multiplied by 1 multiplied by W; finally, obtaining the attention vector of the input feature vector in the horizontal direction through the first ReLU layer, and obtaining the attention vector of the input feature vector in the vertical direction through the second ReLU layer;
the above-described attention vector in the horizontal direction can be expressed by the following expression:
gh=σ(Fh(fh))
the above-described attention vector in the vertical direction can be expressed by the following expression:
gw=σ(Fw(fw))
wherein f ishAnd fwDecomposed into 2 along the spatial dimension for fIndividual tensors, fh∈RC/r×H,fw∈RC/r×W;FhAnd FwIs a convolution transformation function; σ is the ReLU activation function.
Finally, can be for ghAnd gwAnd expanding, and then obtaining a weighted feature vector according to the following formula:
Foutput=Finput×gh×gw
wherein, FinputThe feature vector is input by the residual error module; foutputIs a weighted feature vector.
In summary, the attention module of the convolutional neural network provided by the embodiment of the present application optimizes the conventional attention mechanism process, performs elastic coding on the channel and the space by using deformable convolution simultaneously, fuses information in the horizontal direction and the vertical direction, can better position a target position, and can improve target detection accuracy in application scenarios such as detection, classification, segmentation, and the like.
In another aspect, an embodiment of the present application further provides an attention mechanism of a convolutional neural network, including:
feeding the feature vectors input by the residual error module to the first branch and the second branch; the first branch is configured to perform deformable convolution operation, channel attenuation operation and global pooling operation on the feature vectors in the horizontal direction, and the second branch is configured to perform deformable convolution operation, channel attenuation operation and global pooling operation on the feature vectors in the vertical direction;
splicing the output of the first branch and the output of the second branch to obtain a splicing vector;
transforming the splicing vectors by using a convolution transformation function;
feeding the splicing vectors after the transformation processing to a full connection layer;
and performing convolution operation on the input of the full connection layer in the horizontal direction and the vertical direction respectively to obtain the attention vector of the input feature vector in the horizontal direction and the attention vector of the input feature vector in the vertical direction.
The attention mechanism of the embodiment of the application is based on the same application concept as the attention module embodiment.
In addition, the embodiment of the present application further provides a convolutional neural network, which includes the attention module described in the above embodiment. The convolutional neural network can be trained for detection, classification, segmentation and the like, and can show high accuracy when applied to detection and identification of surrounding vehicles in an automatic driving application scene.
The present application provides results of experiments based on a variety of different model structures, as shown in tables 1 and 2 below:
Model Params(M) coco mAP(%)
Mobilenetv2 4.3 41.5
Mobilenetv2+SE 4.7 41.6
Mobilenetv2+CBAM 4.7 41.6
Mobilenetv2+CA 4.3 42.8
this application 4.45 43.4
TABLE 1 test experiments
Model Params(M) Top-1 Acc(%)
Mobilenetv2 3.5 72.3
Mobilenetv2+SE 3.89 73.5
Mobilenetv2+CBAM 3.89 73.6
Mobilenetv2+CA 3.95 74.3
This application 4.02 74.8
TABLE 2 Classification of experiments
In the experiment, SE Attenttion, CBAM (volumetric Block Attenttion Module), CA (coordinate Attentation) and the Attention module in the embodiment of the application are respectively added on the basis of a lightweight model Mobilentv 2, and detection and classification experiments are respectively carried out; the second column of tables 1 and 2 is the parameter number and the third column is the performance value. Experimental results show that when the attention module of the embodiment of the application is added, the best result can be obtained, and the accuracy of the network model can be improved under the condition of ensuring the parameter quantity; it is clear that the attention module of the embodiments of the present application facilitates detection and classification of targets more than SE, CBAM and CA.
It can be seen from the above embodiments that the attention module, the attention mechanism, and the convolutional neural network of the convolutional neural network provided by the present application can solve the problem that the attention mechanism operates uniformly in space and channels, and can improve the accuracy of the convolutional neural network.
It should be noted that: the sequence of the embodiments of the present application is only for description, and does not represent the advantages and disadvantages of the embodiments. And specific embodiments thereof have been described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments.
The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (10)

1. An attention module for a convolutional neural network, comprising:
an attention vector generation unit configured to feed the feature vectors input by the residual module to the first branch and the second branch; wherein the first branch is configured to perform a deformable convolution operation, a channel attenuation operation and a global pooling operation on the feature vector in a horizontal direction, and the second branch is configured to perform the deformable convolution operation, the channel attenuation operation and the global pooling operation on the feature vector in a vertical direction;
the attention vector generation unit is further configured to perform splicing processing on the output of the first branch and the output of the second branch to obtain a spliced vector, and perform transformation processing on the spliced vector by using a convolution transformation function; feeding the spliced vectors after the transformation processing to a full-connection layer, and performing convolution operation on the input of the full-connection layer in the horizontal direction and the vertical direction respectively to obtain the attention vector of the input feature vector in the horizontal direction and the attention vector of the input feature vector in the vertical direction.
2. The attention module of claim 1, wherein the first leg comprises a first deformable convolutional layer, a first convolutional layer, and a first global pooling layer connected in sequence;
the output of the first deformable convolution layer is:
Figure FDA0003186856080000011
wherein p is0Representing each feature point in the input feature vector; y (p)x) Represents p0A position in the horizontal direction after the deformable convolution operation; r represents a convolution kernel, pnIs an enumeration of the positions listed in R; Δ pxIndicating the amount of offset in the horizontal direction.
3. The attention module of claim 2, wherein the data dimensions of the input feature vector are C x H x W; the data dimension of the output of the first deformable convolution layer is C × (W + H);
the convolution kernel size of the first convolution layer is 1x1, and the data dimension of the output of the first convolution layer is C/r x (W + H);
the data dimension of the output of the first global pooling layer is C/r × H × 1.
4. The attention module of claim 3, wherein the output of the c/r channel of the output of the first branch is:
Figure FDA0003186856080000021
wherein x isc/r(h, i) represents the ith feature point in the c/r-th channel with height h in the output of the first convolution layer.
5. The attention module of claim 1, wherein the second leg comprises a second deformable convolutional layer, a second convolutional layer, and a second global pooling layer connected in sequence;
the output of the second deformable convolution layer is:
Figure FDA0003186856080000022
wherein p is0Representing each feature point in the input feature vector; y (p)y) Represents p0A position in the vertical direction after the deformable convolution operation; r represents a convolution kernel, pnIs an enumeration of the positions listed in R; Δ pyIndicating the amount of offset in the vertical direction.
6. The attention module of claim 5, wherein the data dimensions of the input feature vector are C x H x W; the output of the second deformable convolution layer has a data dimension of C × (W + H);
the convolution kernel size of the second convolutional layer is 1x1, and the data dimension of the output of the second convolutional layer is C/r x (W + H);
and the output of the second global pooling layer has a data dimension of C/r multiplied by 1 multiplied by W.
7. The attention module of claim 6, wherein the output of the c/r channel of the output of the second branch is:
Figure FDA0003186856080000023
wherein x isc/r(j, w) represents the j-th feature point in the c/r-th channel of width w in the output of the second convolutional layer.
8. The attention module of claim 1, further comprising a weight assignment unit;
the weight distribution unit is configured to perform weight distribution on the input feature vector based on the attention vector in the horizontal direction and the attention vector in the vertical direction to obtain a weighted feature vector.
9. An attention mechanism for a convolutional neural network, comprising:
feeding the feature vectors input by the residual error module to the first branch and the second branch; wherein the first branch is configured to perform a deformable convolution operation, a channel attenuation operation and a global pooling operation on the feature vector in a horizontal direction, and the second branch is configured to perform the deformable convolution operation, the channel attenuation operation and the global pooling operation on the feature vector in a vertical direction;
splicing the output of the first branch and the output of the second branch to obtain a spliced vector;
transforming the splicing vector by using a convolution transformation function;
feeding the splicing vectors after the transformation processing to a full connection layer;
and performing convolution operation on the input of the full connection layer in the horizontal direction and the vertical direction respectively to obtain the attention vector of the input feature vector in the horizontal direction and the attention vector of the input feature vector in the vertical direction.
10. A convolutional neural network, comprising the attention module of any one of claims 1-8.
CN202110863925.9A 2021-07-29 2021-07-29 Attention module and attention mechanism of convolutional neural network and convolutional neural network Pending CN113627590A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110863925.9A CN113627590A (en) 2021-07-29 2021-07-29 Attention module and attention mechanism of convolutional neural network and convolutional neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110863925.9A CN113627590A (en) 2021-07-29 2021-07-29 Attention module and attention mechanism of convolutional neural network and convolutional neural network

Publications (1)

Publication Number Publication Date
CN113627590A true CN113627590A (en) 2021-11-09

Family

ID=78381553

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110863925.9A Pending CN113627590A (en) 2021-07-29 2021-07-29 Attention module and attention mechanism of convolutional neural network and convolutional neural network

Country Status (1)

Country Link
CN (1) CN113627590A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108595590A (en) * 2018-04-19 2018-09-28 中国科学院电子学研究所苏州研究院 A kind of Chinese Text Categorization based on fusion attention model
CN108734290A (en) * 2018-05-16 2018-11-02 湖北工业大学 It is a kind of based on the convolutional neural networks construction method of attention mechanism and application
CN109993220A (en) * 2019-03-23 2019-07-09 西安电子科技大学 Multi-source Remote Sensing Images Classification method based on two-way attention fused neural network
CN111832620A (en) * 2020-06-11 2020-10-27 桂林电子科技大学 Image emotion classification method based on double-attention multilayer feature fusion
CN112580782A (en) * 2020-12-14 2021-03-30 华东理工大学 Channel enhancement-based double-attention generation countermeasure network and image generation method
CN112651973A (en) * 2020-12-14 2021-04-13 南京理工大学 Semantic segmentation method based on cascade of feature pyramid attention and mixed attention
CN112861978A (en) * 2021-02-20 2021-05-28 齐齐哈尔大学 Multi-branch feature fusion remote sensing scene image classification method based on attention mechanism
WO2021115159A1 (en) * 2019-12-09 2021-06-17 中兴通讯股份有限公司 Character recognition network model training method, character recognition method, apparatuses, terminal, and computer storage medium therefor

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108595590A (en) * 2018-04-19 2018-09-28 中国科学院电子学研究所苏州研究院 A kind of Chinese Text Categorization based on fusion attention model
CN108734290A (en) * 2018-05-16 2018-11-02 湖北工业大学 It is a kind of based on the convolutional neural networks construction method of attention mechanism and application
CN109993220A (en) * 2019-03-23 2019-07-09 西安电子科技大学 Multi-source Remote Sensing Images Classification method based on two-way attention fused neural network
WO2021115159A1 (en) * 2019-12-09 2021-06-17 中兴通讯股份有限公司 Character recognition network model training method, character recognition method, apparatuses, terminal, and computer storage medium therefor
CN111832620A (en) * 2020-06-11 2020-10-27 桂林电子科技大学 Image emotion classification method based on double-attention multilayer feature fusion
CN112580782A (en) * 2020-12-14 2021-03-30 华东理工大学 Channel enhancement-based double-attention generation countermeasure network and image generation method
CN112651973A (en) * 2020-12-14 2021-04-13 南京理工大学 Semantic segmentation method based on cascade of feature pyramid attention and mixed attention
CN112861978A (en) * 2021-02-20 2021-05-28 齐齐哈尔大学 Multi-branch feature fusion remote sensing scene image classification method based on attention mechanism

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
YUE WANG ETC: ""Fusing Distinguish Degree Neural Networks for Relational Classification"", 《2018 11TH INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND DESIGN (ISCID)》, 9 December 2018 (2018-12-09) *
李生武;张选德;: "基于自注意力机制的多域卷积神经网络的视觉追踪", 计算机应用, no. 08, 31 December 2020 (2020-12-31) *
雷鹏程;刘丛;唐坚刚;彭敦陆;: "分层特征融合注意力网络图像超分辨率重建", 中国图象图形学报, no. 09, 16 September 2020 (2020-09-16) *

Similar Documents

Publication Publication Date Title
CN111339903B (en) Multi-person human body posture estimation method
CN111325111A (en) Pedestrian re-identification method integrating inverse attention and multi-scale deep supervision
CN112200111A (en) Global and local feature fused occlusion robust pedestrian re-identification method
CN110619638A (en) Multi-mode fusion significance detection method based on convolution block attention module
CN114565655B (en) Depth estimation method and device based on pyramid segmentation attention
CN113239820B (en) Pedestrian attribute identification method and system based on attribute positioning and association
US20230334893A1 (en) Method for optimizing human body posture recognition model, device and computer-readable storage medium
CN112836646A (en) Video pedestrian re-identification method based on channel attention mechanism and application
CN113743544A (en) Cross-modal neural network construction method, pedestrian retrieval method and system
CN114529982A (en) Lightweight human body posture estimation method and system based on stream attention
CN111582154A (en) Pedestrian re-identification method based on multitask skeleton posture division component
CN111242091A (en) Age identification model training method and device and electronic equipment
CN116092185A (en) Depth video behavior recognition method and system based on multi-view feature interaction fusion
CN114332919A (en) Pedestrian detection method and device based on multi-spatial relationship perception and terminal equipment
CN117115616A (en) Real-time low-illumination image target detection method based on convolutional neural network
CN115240121B (en) Joint modeling method and device for enhancing local features of pedestrians
CN113627590A (en) Attention module and attention mechanism of convolutional neural network and convolutional neural network
CN111428612A (en) Pedestrian re-identification method, terminal, device and storage medium
CN116311504A (en) Small sample behavior recognition method, system and equipment
CN115661515A (en) Three-dimensional image classifier and classification method based on hierarchical feature extraction and structure perception
CN115641581A (en) Target detection method, electronic device, and storage medium
CN114663917A (en) Multi-view-angle-based multi-person three-dimensional human body pose estimation method and device
CN112966546A (en) Embedded attitude estimation method based on unmanned aerial vehicle scout image
CN114998653B (en) ViT network-based small sample remote sensing image classification method, medium and equipment
CN117636074B (en) Multi-mode image classification method and system based on feature interaction fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination