CN115222998B

CN115222998B - Image classification method

Info

Publication number: CN115222998B
Application number: CN202211120458.1A
Authority: CN
Inventors: 颜成钢; 殷俊; 颜拥; 王洪波; 胡冀; 熊剑平; 李亮; 郑博仑; 林聚财; 孔书晗; 王亚运; 孙垚棋; 金恒; 朱尊杰; 高宇涵; 殷海兵; 王鸿奎; 陈楚翘; 刘一秀; 李文超
Original assignee: Hangzhou Dianzi University; Zhejiang Dahua Technology Co Ltd
Current assignee: Hangzhou Dianzi University; Zhejiang Dahua Technology Co Ltd
Priority date: 2022-09-15
Filing date: 2022-09-15
Publication date: 2023-01-03
Anticipated expiration: 2042-09-15
Also published as: CN115222998A

Abstract

The invention discloses an image classification method. Firstly, constructing a channel dimension attention network and a space dimension multi-scale self-attention network; then, inputting the preprocessed input image into a channel dimension attention network to generate an attention feature map based on channel dimensions; inputting the attention feature map based on the channel dimension into a multi-scale self-attention network of the space dimension, and generating a multi-scale self-attention feature map based on the space dimension of the feature map; and finally, inputting the finally generated multi-dimensional and multi-scale attention feature map into a classifier unit, converting the vector output by the model into probability representation, and finishing image classification. The invention designs a novel multi-scale self-attention mechanism, which utilizes a series of deep separable convolution operations to generate a local characteristic diagram and an area characteristic diagram which are highly related to characteristic information, thereby not only strengthening the fine-grained characteristic extraction capability of the self-attention mechanism, but also efficiently extracting effective global information.

Description

Image classification method

Technical Field

The invention belongs to the technical field of image classification, and particularly relates to an image classification method, in particular to an image classification method based on a self-attention mechanism of multi-dimensional and multi-scale feature representation.

Background

In recent years, transformers have been widely used in the field of NLP by virtue of their powerful context modeling capabilities. Researchers in the field of computer vision also have a jump to try to introduce the core designed in the transform, i.e., the self-attention mechanism, into the visual task. ViT, the first model to introduce a Transformer into the CV domain, first demonstrated that the Self-Attention mechanism in the Transformer was completely relied on to achieve the most advanced performance in image classification. Currently, an image classification method based on a self-attention mechanism has become a mainstream method in current research. However, since ViT inherits the entire architecture of Transformer, and the Transformer was originally proposed in the machine translation task, its design is more suitable for the task in the NLP domain, so there are the following bottlenecks in the development of ViT.

(1) ViT inherits the columnar structure of the Transformer. It takes a coarse image block as input and can only output a low resolution feature map, which is expensive both in terms of computation and storage. Currently, scholars alleviate this problem by introducing a pyramid of features. (2) The Transformer models the relationship between labeled image blocks (tokens) in order. In the image classification task, the input is usually a 2D image, pixels have a high spatial structure, and the ViT mode destroys the structural information of a two-dimensional image and is not beneficial to performing context modeling on feature maps with different scales. Some scholars have attempted to solve this problem by introducing methods such as convolution operation, overlapping posing, and zero-padding. And (3) a self-attention mechanism in a global scope. Self-attention the response at a certain position in the sequence is calculated by focusing on the global information and taking its weighted average in the projection space. The method does not consider the attention degree among local fine-grained features, and lacks the ability of sensing local feature information. The students have successively designed various multi-scale Vision transformers based on the latest Backbone technologies proposed by the first two bottlenecks, such as Pyramid Vision transformers (Wenhai Wang, enze Xie, xiang Li, ding-Ping Fan, kaitao Song, ding Liang, tong Lu, ping Luo, and Ling o 2021.Pyramid Vision transformers: A Versatile Backbone for sense Prediction Without transformers, 568-578). Most of them achieve fusion of self-attention calculations of different scales by creating additional tokens outside the self-attention mechanism, which, although improving the performance of the model, is complicated in terms of implementation deployment.

In addition to the above bottleneck, which is widely noticed, we find that the self-attention mechanism compresses the channel information of the feature map, and only calculates attention in the spatial dimension. This single mode of attention lacks representativeness in representing the degree of importance between features, and the model may also be noisy in the learning process.

Disclosure of Invention

The invention aims to provide an image classification method aiming at the bottlenecks, so that the application of a self-attention mechanism in an image classification task is explored.

The technical scheme adopted by the invention for solving the technical problem is as follows.

Step 1, constructing a channel dimension attention network and a space dimension multi-scale self-attention network based on a Pyramid Vision Transformer (PVT) architecture.

The architecture of the Pyramid Video Transform (PVT) contains a total of four stages (Stage), and the resolution of the input is gradually reduced by the Embedding layer (Patch Embedding). In each stage, a channel dimension attention network and a space dimension multi-scale self-attention network are respectively constructed.

And 2, preprocessing an input image, inputting the preprocessed input image into a channel dimension attention network, and generating an attention feature map based on channel dimensions.

And 3, inputting the attention feature map based on the channel dimension into a multi-scale self-attention network of the space dimension, and generating the multi-scale self-attention feature map based on the space dimension of the feature map.

And 4, repeating the

steps

2 and 3 until a fourth stage, inputting the finally generated multi-dimensional and multi-scale attention feature map into a classifier unit, converting the vector output by the model into probability representation, and finishing image classification.

Further, in step 1, the first layer of the channel dimension attention network is two parallel pooling layers, including maximum pooling and average pooling. The second layer is a shared parameter layer and consists of a plurality of layers of perceptrons and a hidden layer. The third level is the element summation operation of the maximum pooled feature map and the average pooled feature map. The fourth layer is a sigmoid layer. The fifth layer is the softmax layer for the original signature. The sixth layer is a summation operation of the two weight matrices output by the fourth and fifth layers by element. And the seventh layer is that the weight matrix output by the sixth layer and the original characteristic diagram are subjected to matrix multiplication operation, and finally, the channel dimension characteristic diagram is output.

Further, in the multi-scale self-attention network with spatial dimensions described in step 1, the first layer is two parallel convolution kernels, the convolution kernels are 7 × 7 and 3 × 3 respectively, and the step sizes are 7 and 1 respectively. The second layer is a layer normalization operation. The third layer is the convolution operation with a convolution kernel of 3 x 3 and a step size of 2. The fourth layer is self-attention calculation, and comprises matrix multiplication, softmax layer normalization, multiplication of a weight matrix and an original matrix, and final output of a space dimension attention feature map.

In step 2, convolution operation with zero padding is used for an input image to generate an image embedding vector; in order to realize an image classification task, a classification vector CLS is spliced before an image is embedded with a vector and is used as the input of a channel dimension attention network; and inputting the intermediate feature map into a channel dimension attention network, and generating an attention feature map based on the image channel dimension.

In step 3, carrying out dimension resetting on the attention feature map of the channel dimension to generate a two-dimensional local feature map, and taking the two-dimensional local feature map as the input of a multi-scale self-attention network of the space dimension; by utilizing the hierarchical structure of convolution kernels and through the depth separable convolution operation using different convolution kernels and step sizes, a self-attention network is divided into two routes of local feature calculation and regional feature calculation, and a local feature map and a regional feature map with highly correlated semantic features are generated. And (3) calculating a final spatial dimension multi-scale self-attention feature map by taking the local context information as a Query (Query) and the regional context information as Key values (Key and Value).

In step 4, the CLS classification vector is repeatedly updated in four stages, and the multi-dimensional and multi-scale high-level semantic features are extracted from shallow to deep. In the last stage, the final CLS vector is input into a feedforward neural network layer FNN of a classifier unit to generate a num multiplied by 1 vector, wherein num represents the number of image categories of a training set, and finally category probability calculation is completed through a softmax layer of the classifier unit to complete final classification.

The beneficial effects of the invention include compared with the prior art.

The method introduces the channel attention on the basis of the self-attention mechanism, and establishes a multi-dimensional characterization learning space. Compared with the traditional self-attention method, the method can more efficiently realize the feature extraction of the channel dimension and the space dimension of the image in the image classification task, so that the model learns more abstract high-level feature representation, and the noise disturbance in the model learning process is reduced.

In the design of the self-attention network, different from the traditional method for calculating the self-attention in the global scope, the invention designs a novel multi-scale self-attention mechanism, which utilizes a series of deep separable convolution operations to generate a local feature map and an area feature map with highly correlated feature information, thereby not only strengthening the fine-grained feature extraction capability of the self-attention mechanism, but also efficiently extracting effective global information.

The method is based on the optimization model training, realizes an original feature enhancer in the channel attention through a softmax layer, enhances effective feature representation in a deep network with the channel attention and the space attention superposed, and inhibits noise generated when a weight matrix tends to 0 due to repeated dot product operation.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention.

FIG. 2 is a flow chart of data preprocessing according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of a channel dimension attention network structure according to an embodiment of the present invention.

Fig. 4 is a schematic diagram of a spatial dimension self-attention network structure according to an embodiment of the present invention.

Fig. 5 is a comparison graph of attention feature extraction results for a large target object according to an embodiment of the present invention.

Fig. 6 is a comparison graph of attention feature extraction results for a small target object according to an embodiment of the present invention.

Detailed Description

In order to facilitate understanding and implementing the invention by those skilled in the art, the invention is further described below with reference to the accompanying drawings and examples.

Referring to fig. 1, the model is divided into four stages, and in the model training process, the input image is first subjected to data preprocessing to obtain an image embedding vector. The image embedding vectors then go through channel attention calculation and spatial self-attention calculation in sequence. And after the four stages are all calculated, the image is sent to a classifier unit for class probability calculation, and a classification result is finally output.

The invention provides an image classification method which specifically comprises the following steps.

Step 1, constructing a channel dimension attention network and a space dimension self-attention network based on a skeleton of a Pyramid Vision Transformer.

As shown in fig. 3, the first layer of the channel dimension attention network of this embodiment is two parallel pooling layers, including maximum pooling and average pooling. The second layer is a shared parameter layer, which consists of Multiple Layers of Perceptrons (MLPs) and a hidden layer. The third level is the element summation operation of the maximum pooled feature map and the average pooled feature map. The fourth layer is a sigmoid layer. The fifth layer is the feature enhancer softmax layer for the original feature map. The sixth layer is a summation operation of the two weight matrices output by the fourth and fifth layers by element. And the seventh layer is that the weight matrix output by the sixth layer and the original characteristic diagram are subjected to matrix multiplication, and finally, a channel dimension attention characteristic diagram is output.

As shown in fig. 4, in the self-attention network with spatial dimension of the present embodiment, the first layer is two parallel convolution kernels, the convolution kernels are 7 × 7 and 3 × 3, respectively, and the step sizes are 7 and 1, respectively, to generate the region feature map and the local feature map, respectively. The second layer is layer normalization. The third layer is a convolution operation with a convolution kernel of 3 x 3 and a step size of 2. The fourth layer is self-attention calculation, including matrix multiplication, softmax calculation, multiplication of a weight matrix and an original matrix, and finally outputting a spatial dimension self-attention feature map.

Step 2, inputting the preprocessed input image into a channel dimension attention network to generate an attention feature map based on channel dimensions;

as shown in fig. 2, the embodiment performs preprocessing on an input image to obtain an image embedding vector, and the specific implementation thereof includes the following sub-steps.

Step 2.1, for any input image, generating a one-dimensional image embedding vector by convolution operation (the sizes of kernel size and convolution kernel are set to be 2s-1, stride is set to be s, padding size and image filling size are set to be s-1) with zero padding and flatten (tensor flattening operation) operation

In a

Of the head-most stitching classification vector

。

Step 2.2, carrying out two-dimensional position coding on the input image to obtain a two-dimensional position coding vector

And inserting the vector into the one-dimensional vector generated after final splicing in the step 2.1 to serve as the final input x of the model.

(1)。

Wherein, the first and the second end of the pipe are connected with each other,

and

is a classification vector and an image embedding vector [ | luminance]A splice between the vectors is represented and,

representing a position-coding vector.

Referring to fig. 3, in the present embodiment, a feature image x after being preprocessed is input to a channel dimension attention network, so as to generate an attention feature map based on image channel dimensions; the specific implementation thereof comprises the following substeps.

Step 2.3, the preprocessed characteristic image is processed

Inputting the two characteristic maps into the adaptive maximum pooling layer and the adaptive average pooling layer simultaneously, and outputting two intermediate characteristic maps

。

Step 2.4: mixing

Respectively inputting the data into a shared parameter layer which consists of a multi-layer perceptron (MLP) and a hidden layer, wherein the hidden layer mainly plays a role in reducing the parameter overhead, and the invention sets a parameter reduction ratio r to 16, so that in the MLP, the characteristic diagram output by a first full-connection layer is

Obtained by the RELU activation function

Then, the dimension of the feature map is converted into the dimension of the feature map through a full connection layer

. And (3) carrying out element summation operation on the average pooling characteristic diagram and the maximum pooling characteristic diagram output by the shared parameter layer, and finally connecting a sigmoid layer to generate a channel attention weight matrix Mc (x), wherein the calculation process can be summarized as a formula (2).

(2)。

Step 2.5, the original characteristic image is processed

Inputting the feature into a softmax enhanced feature layer, and outputting a weight matrix of an original feature image

The weight matrix is

Carrying out element summation operation with Mc (x), carrying out multiplication operation on a summation result and the original characteristic image x, and generating a final channel attention graph after dimension conversion

. Equation (3) describes the calculation process of this step.

(3)。

Step 3, inputting the channel attention graph into a multi-scale self-attention network of space dimensionality to generate an attention characteristic graph based on the space dimensionality of the characteristic graph; the specific implementation thereof comprises the following substeps.

Step 3.1: will channelAttention map

Dimension resetting to two-dimensional local feature map

As input to a multi-scale self-attention network of spatial dimensions.

And 3.2, dividing the self-attention network into two characteristic extraction routes of local characteristics and regional characteristics.

The first local feature route is used for converting a two-dimensional local feature map

And performing depth separable convolution operation with the convolution kernel size of 3 multiplied by 3 and the step size of 1 to generate a Query matrix.

The second course of regional features uses a depth separable convolution with convolution kernel size of 7 x 7 and step size of 7 to generate a two-dimensional regional feature map

。

3.3, in order to enable data to be uniformly distributed and enable the training of the model to be more stable, a two-dimensional region characteristic diagram is used

And dimension resetting is carried out to be a one-dimensional vector, and dimension resetting is carried out to be a two-dimensional characteristic diagram again after layer normalization operation is carried out.

Step 3.4: for the regenerated two-dimensional region feature map

The Key matrix and Value matrix are generated using a deep separable convolution operation with a convolution kernel size of 3 x 3 and a step size of 2.

And 3.5, performing Flattebn operation on the Query matrix, the Key matrix and the Value matrix, and flattening the matrix from a two-dimensional matrix to a one-dimensional vector. Then we annotate in the way of matrix computation with the traditional self-attention mechanismCalculating the intention, and generating a final space dimension self-attention feature map

. The self-attention calculation process for the spatial dimension can be described as equation (4).

(4)。

And 4, repeating the

steps

2 and 3 until a fourth stage, converting the output of the model into probability representation through a classifier unit by using the finally generated multi-dimensional and multi-scale attention feature map, and finishing image classification.

In this example, the image classification process is implemented by training a multidimensional, multi-scale, self-attention network. In the training process, the forward propagation process is represented as: and repeatedly updating the CLS classification vector at each stage, and extracting multi-dimensional and multi-scale image features from light to deep. In the last stage, the final CLS vector is subjected to FNN to generate a num multiplied by 1 one-dimensional vector through a feedforward neural network layer, wherein num represents the number of image types of a training set, then the mapping result vector is subjected to normalization processing by utilizing a softmax function to obtain a probability result, the final result is compared with the label vector of the original image, then the back propagation process is realized, and the supervised training of the model is completed.

In a specific embodiment of the present application, the method is applied to the ImageNet1K dataset, and compared with other classical classification learning methods, the effectiveness of the method provided by the present application is demonstrated.

(1) Introduction of data sets.

We trained the multi-dimensional multi-scale self-attention image classification method proposed in this example using a training set of ImageNet1K dataset, and used the highest accuracy on the validation set as an index to evaluate model performance. The ImageNet1K dataset contains 130 ten thousand images and 1000 classes, with the number of training and validation images being 128 thousand and 50000 respectively. We used all images for training and fine-tuned the model on ImageNet 1K.

(2) And (4) setting an experiment.

In our experiments, we applied mixup (mixed class enhancement), random horizontal flipping, tag smoothing, and random erasure as data enhancement algorithms. We use AdamW optimization algorithm with cosine learning rate scheduling. We trained the model using 300 time periods, with the weight decay set to 0.01, the initial learning rate set to 0.001, and the momentum set to 0.9. During training, we randomly crop 224 x 224 regions and make 224 x 244 center crop after adjusting the short side to 256 for evaluation, and furthermore, our model was trained on 4 RTX 3090Ti servers.

(3) And (4) experimental analysis.

In this section, we used the same level of parameter number as the criteria for model performance comparison and compared the proposed multidimensional multiscale self-attention image classification method with two other methods that are highly relevant to our method, including the representative convolutional neural network-based image classification method (table 1) and the transform-based image classification method (table 2).

In the ImageNet1K dataset, we compared the proposed multidimensional multi-scale self-attention-driven image classification method with a classification method based on a convolutional neural network. As shown in table 2, compared with the methods of the ResNet residual network family (including ResNet, SEResNet, and SENet), our method is smaller, more efficient and more accurate. This stems primarily from the attention mechanism in our method, which can improve model performance by refining feature maps.

TABLE 1 Performance comparison results with convolutional neural network-based image classification methods

We further compare the multidimensional multiscale self-attention image classification method with the state-of-the-art visual transform-based image classification method. Our method is consistently better than the baseline methods ViT and PVT in all respects, and we achieve higher accuracy with fewer parameters and FLOPs. Advantages mainly benefit from that our method realizes more abstract high-level feature representation and enhances fine-grained feature extraction capability.

Table 2 compares the results with the performance of the transform-based image classification method

(4) Attention is drawn to image visualization.

To achieve qualitative analysis of the method, we performed attention-image visualization of the proposed method as well as the baseline method PVT with a Grad-CAM network (gradient-based visual interpretation network for the deep network). Because the feature map output by the last layer of the network has rich high-level semantics and detailed spatial information, the weight of the model is input into the Grad-CAM network, the purpose is to find the gradient of all features and map the gradient at the last layer, and finally the importance of each neuron is calculated according to the gradient information. Our proposed method and PVT weights are trained with the ImageNet1K dataset. In this embodiment, four images are selected from the verification set of the ImageNet1K data set for each of the large target object and the small target object (the ratio of the target object occupying the image size is greater than one third to divide the large target object, and less than one third to divide the small target object), see fig. 5 and 6, which respectively show the attention image visualization results of the conventional self-attention method and the multi-dimensional and multi-scale self-attention method provided by this embodiment.

In fig. 5, the recognition effect of PVT and the methods herein demonstrates that even if the multi-scale self-attentive sphere of interest is local and regional, it can still obtain useful global information. In fig. 6, we observe that for small target objects in complex images, such as quartz clocks (column 2), PVT can be confused by other similar objects in the image. In contrast, our method can accurately locate and overlay a target object even if the image is complex, with other similar visual appearances. In addition to this, we found that each attention map implemented by PVT contains noise perturbations. The results further prove that the multi-dimensional and multi-scale representation learning can effectively reduce the noise interference of the image classification method in the training process, and can better utilize the local position information of the target object to aggregate fine-grained features for positioning and covering the target region.

The image classification problem based on the self-attention mechanism is one of the most widely studied and applied classification problems at present, and the fine-grained feature representation of the image is one of the research focuses and difficulties in the field. The invention provides an image classification method based on a multi-dimensional and multi-scale self-attention mechanism. First, we learn channel attention as a first dimension of characterization and multi-scale spatial self-attention as a second dimension of characterization, and compared with single spatial self-attention, the model can learn more abstract high-level feature representation. Secondly, a novel multi-scale space self-attention method is provided, and information interaction between local and regional features is achieved through convolution. In addition, an original feature enhancer is introduced into the channel attention, the noise disturbance condition generated when the weight matrix possibly appearing in the deeper layer of the network tends to be 0 is restrained, and the training process of the model is optimized. Compared with the traditional image classification method based on the self-attention mechanism, the method can improve the generalization performance of the model, not only can enhance the extraction capability of the model to fine-grained features, but also can effectively extract global information from the image, reduce noise disturbance occurring in the model training process and improve the classification performance of the model image.

Claims

1. An image classification method, characterized by comprising the steps of:

step 1, constructing a channel dimension attention network and a space dimension multi-scale self-attention network based on a PVT architecture;

step 3, inputting the attention feature map based on the channel dimension into a space dimension multi-scale self-attention network to generate a multi-scale self-attention feature map based on the space dimension of the feature map;

step 4, repeating the step 2 and the step 3 until a fourth stage in the framework, inputting the finally generated multi-dimensional and multi-scale attention feature map into a classifier unit, converting the vector output by the model into probability representation, and finishing image classification;

the PVT architecture comprises four stages in total, and the resolution of input is gradually reduced through an embedded layer; in each stage, a channel dimension attention network and a space dimension multi-scale self-attention network are respectively constructed;

the channel dimension attention network described in step 1:

the first layer is two parallel pooling layers, including maximum pooling and average pooling;

the second layer is a shared parameter layer and consists of a plurality of layers of perceptrons and a hidden layer;

the third layer is the element summation operation of the maximum pooling characteristic map and the average pooling characteristic map;

the fourth layer is a sigmoid layer;

the fifth layer is a softmax layer for the original feature map;

the sixth layer is to perform element summation operation on the two weight matrixes output by the fourth layer and the fifth layer;

the seventh layer is that the weight matrix output by the sixth layer and the original characteristic diagram are subjected to matrix multiplication operation, and finally, a channel dimension characteristic diagram is output;

the multi-scale self-attention network of spatial dimensions described in step 1:

the first layer is two parallel convolution kernels, the convolution kernels are respectively 7 × 7 and 3 × 3, and the step lengths are respectively 7 and 1;

the second layer is a layer normalization operation;

the third layer is convolution operation with convolution kernel of 3 × 3 and step size of 2;

the fourth layer is self-attention calculation, including matrix multiplication, softmax layer normalization, multiplication of the weight matrix and the original matrix, and finally outputting the spatial dimension attention feature map.

2. An image classification method according to claim 1, characterized in that the convolution operation with zero padding is used to generate image embedding vectors for the input images in step 2; in order to realize an image classification task, a classification vector CLS is spliced before an image is embedded with a vector and is used as the input of a channel dimension attention network; and inputting the intermediate feature map into a channel dimension attention network, and generating an attention feature map based on the image channel dimension.

3. The image classification method according to claim 1 or 2, characterized in that in step 3, the attention feature map of the channel dimension is subjected to dimension resetting to generate a two-dimensional local feature map, and the two-dimensional local feature map is used as an input of a multi-scale self-attention network of the spatial dimension; dividing a self-attention network into two routes of local feature calculation and regional feature calculation by using different convolution kernels and depth separable convolution operations of step length by utilizing a hierarchical structure of convolution kernels, and generating a local feature map and a regional feature map which are highly related to semantic features; and taking the local context information as Query, and taking the regional context information as Key Value Key and Value to calculate a final spatial dimension multi-scale self-attention feature map.

4. The image classification method according to claim 3, characterized in that in step 4, the CLS classification vector is repeatedly updated in four stages, and multi-dimensional and multi-scale high-level semantic features are extracted from shallow to deep; in the last stage, inputting the final CLS classification vector into a feedforward neural network layer FNN of a classifier unit to generate a num multiplied by 1 vector, wherein num represents the number of image classes of a training set, and finally completing class probability calculation and final classification through a softmax layer of the classifier unit.

5. The image classification method according to claim 2, characterized in that in step 2, the input image is preprocessed and then input into a channel dimension attention network, and an attention feature map based on channel dimensions is generated; the specific implementation comprises the following substeps:

step 2.1: for any input image, generating a one-dimensional image embedding vector through convolution operation with zero padding and flatten operation, and splicing the classification vector at the forefront of the image embedding vector;

step 2.2: performing two-dimensional position coding on the input image to obtain a two-dimensional position coding vector, and inserting the two-dimensional position coding vector into the one-dimensional vector generated after final splicing in the step 2.1 as a final input x of the model:

x＝[x _cls ||x _patch ]+x _pos (1)

wherein x is _cls And x _patch Is a classification vector and an image embedding vector [ | luminance]Representing a concatenation between vectors, x _pos Representing a position-coding vector.

6. The image classification method according to claim 2, wherein in step 2, the preprocessed feature map x is input to a channel dimension attention network to generate an attention feature map based on image channel dimensions; the specific implementation comprises the following substeps:

step 2.3: enabling the preprocessed characteristic image x to be in the scope of R ^H×W×C Simultaneously inputting the two intermediate characteristic graphs into an adaptive maximum pooling layer and an adaptive average pooling layer, and outputting two intermediate characteristic graphs x _Avg ，x _Max ∈R ^C×1×1 ；

Step 2.4: the intermediate feature map x _Avg ，x _Max Respectively inputting the parameters into a shared parameter layer, wherein the shared parameter layer consists of a plurality of layers of perceptrons and a hidden layer; the parameter reduction ratio r is set to 16, so that in the multilayer perceptron, the characteristic diagram of the output of the first fully-connected layer is x _Avg(FC1) ，x _Max(FC2) ∈R ^(16/C)×1×1 X is obtained by the RELU activation function _Avg(RELU) ，x _Max(RELU) ∈R ^(16/C)×1×1 Then, the dimension of the feature map is converted into x through a full connection layer _Avg(FC2) ，x _Max(FC2) ∈R ^C×1×1 (ii) a Carrying out element summation operation on the average pooling characteristic diagram and the maximum pooling characteristic diagram output by the shared parameter layer, and finally connecting a sigmoid layer to generate a channel attention weight matrix Mc (x), wherein the calculation process is summarized as a formula (2):

Mc(x)＝σ(MLP(AvgPool(x))+MLP(Maxpool(x))) (2)

step 2.5: making the original characteristic image x be in the range of R ^H×W×C Inputting the weight matrix x into a softmax reinforced characteristic layer and outputting the weight matrix x of the original characteristic image _w The weight matrix x _w Carrying out element summation operation with Mc (x), carrying out multiplication operation on a summation result and the original characteristic image x, and generating a final channel attention graph x after carrying out dimension conversion ₁ ∈R ^H×W×C The calculation process is as follows:

x ₁ ＝(Mc(x)+softmax(x))x (3)。

7. an image classification method according to claim 3, characterized in that the step 3 comprises the following sub-steps:

step 3.1: channel attention map x ₁ ∈R ^H×W×C Dimension reset to two-dimensional local feature map local _x ∈R ^H×W×1 An input to a multi-scale self-attention network as a spatial dimension;

step 3.2: dividing a self-attention network into two feature extraction routes, namely a local feature and a regional feature;

the first local feature route is a two-dimensional local feature map local _x ∈R ^H×W×1 Performing depth separable convolution operation with convolution kernel size of 3 multiplied by 3 and step length of 1 to generate a Query matrix;

the second Region feature route generates a two-dimensional Region feature map Region using a depth separable convolution with a convolution kernel size of 7 × 7 and a step size of 7 _x ∈R ^{(H/7)×(W/7)×1} ；

Step 3.3: map two-dimensional Region feature _x ∈R ^{(H/7)×(W/7)×1} The dimensionality is reset into a one-dimensional vector, and the two-dimensional region characteristic diagram is reset after layer normalization operation is carried out;

step 3.4: for the regenerated two-dimensional Region characteristic map Region _x ∈R ^{(H/7)×(W/7)×1} Generating a Key matrix and a Value matrix by using depth separable convolution operation with the convolution kernel size of 3 multiplied by 3 and the step length of 2;

step 3.5: performing Flattebn operation on the Query matrix, the Key matrix and the Value matrix, and flattening the matrix from a two-dimensional matrix to a one-dimensional vector; then, the attention calculation is carried out according to a matrix calculation mode of a self-attention mechanism, and a final space dimension self-attention feature map SA (x) is generated ₁ ) (ii) a The self-attention calculation process for the spatial dimension is described as equation (4):

local _x ＝Reshape2D(x ₁ )

Q＝Flatten(Conv2d(local _x ，k))

regional _x ＝Reshape2D(LN(Flatten(Conv2d(local _x ，k))))

K，V＝Flatten(Conv2d(regional _x ，k))