CN115222998B - Image classification method - Google Patents

Image classification method Download PDF

Info

Publication number
CN115222998B
CN115222998B CN202211120458.1A CN202211120458A CN115222998B CN 115222998 B CN115222998 B CN 115222998B CN 202211120458 A CN202211120458 A CN 202211120458A CN 115222998 B CN115222998 B CN 115222998B
Authority
CN
China
Prior art keywords
attention
layer
feature map
dimension
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211120458.1A
Other languages
Chinese (zh)
Other versions
CN115222998A (en
Inventor
颜成钢
殷俊
颜拥
王洪波
胡冀
熊剑平
李亮
郑博仑
林聚财
孔书晗
王亚运
孙垚棋
金恒
朱尊杰
高宇涵
殷海兵
王鸿奎
陈楚翘
刘一秀
李文超
王廷宇
张勇东
张继勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Zhejiang Dahua Technology Co Ltd
Original Assignee
Hangzhou Dianzi University
Zhejiang Dahua Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University, Zhejiang Dahua Technology Co Ltd filed Critical Hangzhou Dianzi University
Priority to CN202211120458.1A priority Critical patent/CN115222998B/en
Publication of CN115222998A publication Critical patent/CN115222998A/en
Application granted granted Critical
Publication of CN115222998B publication Critical patent/CN115222998B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/30Noise filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an image classification method. Firstly, constructing a channel dimension attention network and a space dimension multi-scale self-attention network; then, inputting the preprocessed input image into a channel dimension attention network to generate an attention feature map based on channel dimensions; inputting the attention feature map based on the channel dimension into a multi-scale self-attention network of the space dimension, and generating a multi-scale self-attention feature map based on the space dimension of the feature map; and finally, inputting the finally generated multi-dimensional and multi-scale attention feature map into a classifier unit, converting the vector output by the model into probability representation, and finishing image classification. The invention designs a novel multi-scale self-attention mechanism, which utilizes a series of deep separable convolution operations to generate a local characteristic diagram and an area characteristic diagram which are highly related to characteristic information, thereby not only strengthening the fine-grained characteristic extraction capability of the self-attention mechanism, but also efficiently extracting effective global information.

Description

Image classification method
Technical Field
The invention belongs to the technical field of image classification, and particularly relates to an image classification method, in particular to an image classification method based on a self-attention mechanism of multi-dimensional and multi-scale feature representation.
Background
In recent years, transformers have been widely used in the field of NLP by virtue of their powerful context modeling capabilities. Researchers in the field of computer vision also have a jump to try to introduce the core designed in the transform, i.e., the self-attention mechanism, into the visual task. ViT, the first model to introduce a Transformer into the CV domain, first demonstrated that the Self-Attention mechanism in the Transformer was completely relied on to achieve the most advanced performance in image classification. Currently, an image classification method based on a self-attention mechanism has become a mainstream method in current research. However, since ViT inherits the entire architecture of Transformer, and the Transformer was originally proposed in the machine translation task, its design is more suitable for the task in the NLP domain, so there are the following bottlenecks in the development of ViT.
(1) ViT inherits the columnar structure of the Transformer. It takes a coarse image block as input and can only output a low resolution feature map, which is expensive both in terms of computation and storage. Currently, scholars alleviate this problem by introducing a pyramid of features. (2) The Transformer models the relationship between labeled image blocks (tokens) in order. In the image classification task, the input is usually a 2D image, pixels have a high spatial structure, and the ViT mode destroys the structural information of a two-dimensional image and is not beneficial to performing context modeling on feature maps with different scales. Some scholars have attempted to solve this problem by introducing methods such as convolution operation, overlapping posing, and zero-padding. And (3) a self-attention mechanism in a global scope. Self-attention the response at a certain position in the sequence is calculated by focusing on the global information and taking its weighted average in the projection space. The method does not consider the attention degree among local fine-grained features, and lacks the ability of sensing local feature information. The students have successively designed various multi-scale Vision transformers based on the latest Backbone technologies proposed by the first two bottlenecks, such as Pyramid Vision transformers (Wenhai Wang, enze Xie, xiang Li, ding-Ping Fan, kaitao Song, ding Liang, tong Lu, ping Luo, and Ling o 2021.Pyramid Vision transformers: A Versatile Backbone for sense Prediction Without transformers, 568-578). Most of them achieve fusion of self-attention calculations of different scales by creating additional tokens outside the self-attention mechanism, which, although improving the performance of the model, is complicated in terms of implementation deployment.
In addition to the above bottleneck, which is widely noticed, we find that the self-attention mechanism compresses the channel information of the feature map, and only calculates attention in the spatial dimension. This single mode of attention lacks representativeness in representing the degree of importance between features, and the model may also be noisy in the learning process.
Disclosure of Invention
The invention aims to provide an image classification method aiming at the bottlenecks, so that the application of a self-attention mechanism in an image classification task is explored.
The technical scheme adopted by the invention for solving the technical problem is as follows.
Step 1, constructing a channel dimension attention network and a space dimension multi-scale self-attention network based on a Pyramid Vision Transformer (PVT) architecture.
The architecture of the Pyramid Video Transform (PVT) contains a total of four stages (Stage), and the resolution of the input is gradually reduced by the Embedding layer (Patch Embedding). In each stage, a channel dimension attention network and a space dimension multi-scale self-attention network are respectively constructed.
And 2, preprocessing an input image, inputting the preprocessed input image into a channel dimension attention network, and generating an attention feature map based on channel dimensions.
And 3, inputting the attention feature map based on the channel dimension into a multi-scale self-attention network of the space dimension, and generating the multi-scale self-attention feature map based on the space dimension of the feature map.
And 4, repeating the steps 2 and 3 until a fourth stage, inputting the finally generated multi-dimensional and multi-scale attention feature map into a classifier unit, converting the vector output by the model into probability representation, and finishing image classification.
Further, in step 1, the first layer of the channel dimension attention network is two parallel pooling layers, including maximum pooling and average pooling. The second layer is a shared parameter layer and consists of a plurality of layers of perceptrons and a hidden layer. The third level is the element summation operation of the maximum pooled feature map and the average pooled feature map. The fourth layer is a sigmoid layer. The fifth layer is the softmax layer for the original signature. The sixth layer is a summation operation of the two weight matrices output by the fourth and fifth layers by element. And the seventh layer is that the weight matrix output by the sixth layer and the original characteristic diagram are subjected to matrix multiplication operation, and finally, the channel dimension characteristic diagram is output.
Further, in the multi-scale self-attention network with spatial dimensions described in step 1, the first layer is two parallel convolution kernels, the convolution kernels are 7 × 7 and 3 × 3 respectively, and the step sizes are 7 and 1 respectively. The second layer is a layer normalization operation. The third layer is the convolution operation with a convolution kernel of 3 x 3 and a step size of 2. The fourth layer is self-attention calculation, and comprises matrix multiplication, softmax layer normalization, multiplication of a weight matrix and an original matrix, and final output of a space dimension attention feature map.
In step 2, convolution operation with zero padding is used for an input image to generate an image embedding vector; in order to realize an image classification task, a classification vector CLS is spliced before an image is embedded with a vector and is used as the input of a channel dimension attention network; and inputting the intermediate feature map into a channel dimension attention network, and generating an attention feature map based on the image channel dimension.
In step 3, carrying out dimension resetting on the attention feature map of the channel dimension to generate a two-dimensional local feature map, and taking the two-dimensional local feature map as the input of a multi-scale self-attention network of the space dimension; by utilizing the hierarchical structure of convolution kernels and through the depth separable convolution operation using different convolution kernels and step sizes, a self-attention network is divided into two routes of local feature calculation and regional feature calculation, and a local feature map and a regional feature map with highly correlated semantic features are generated. And (3) calculating a final spatial dimension multi-scale self-attention feature map by taking the local context information as a Query (Query) and the regional context information as Key values (Key and Value).
In step 4, the CLS classification vector is repeatedly updated in four stages, and the multi-dimensional and multi-scale high-level semantic features are extracted from shallow to deep. In the last stage, the final CLS vector is input into a feedforward neural network layer FNN of a classifier unit to generate a num multiplied by 1 vector, wherein num represents the number of image categories of a training set, and finally category probability calculation is completed through a softmax layer of the classifier unit to complete final classification.
The beneficial effects of the invention include compared with the prior art.
The method introduces the channel attention on the basis of the self-attention mechanism, and establishes a multi-dimensional characterization learning space. Compared with the traditional self-attention method, the method can more efficiently realize the feature extraction of the channel dimension and the space dimension of the image in the image classification task, so that the model learns more abstract high-level feature representation, and the noise disturbance in the model learning process is reduced.
In the design of the self-attention network, different from the traditional method for calculating the self-attention in the global scope, the invention designs a novel multi-scale self-attention mechanism, which utilizes a series of deep separable convolution operations to generate a local feature map and an area feature map with highly correlated feature information, thereby not only strengthening the fine-grained feature extraction capability of the self-attention mechanism, but also efficiently extracting effective global information.
The method is based on the optimization model training, realizes an original feature enhancer in the channel attention through a softmax layer, enhances effective feature representation in a deep network with the channel attention and the space attention superposed, and inhibits noise generated when a weight matrix tends to 0 due to repeated dot product operation.
Drawings
FIG. 1 is a flow chart of an embodiment of the present invention.
FIG. 2 is a flow chart of data preprocessing according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of a channel dimension attention network structure according to an embodiment of the present invention.
Fig. 4 is a schematic diagram of a spatial dimension self-attention network structure according to an embodiment of the present invention.
Fig. 5 is a comparison graph of attention feature extraction results for a large target object according to an embodiment of the present invention.
Fig. 6 is a comparison graph of attention feature extraction results for a small target object according to an embodiment of the present invention.
Detailed Description
In order to facilitate understanding and implementing the invention by those skilled in the art, the invention is further described below with reference to the accompanying drawings and examples.
Referring to fig. 1, the model is divided into four stages, and in the model training process, the input image is first subjected to data preprocessing to obtain an image embedding vector. The image embedding vectors then go through channel attention calculation and spatial self-attention calculation in sequence. And after the four stages are all calculated, the image is sent to a classifier unit for class probability calculation, and a classification result is finally output.
The invention provides an image classification method which specifically comprises the following steps.
Step 1, constructing a channel dimension attention network and a space dimension self-attention network based on a skeleton of a Pyramid Vision Transformer.
As shown in fig. 3, the first layer of the channel dimension attention network of this embodiment is two parallel pooling layers, including maximum pooling and average pooling. The second layer is a shared parameter layer, which consists of Multiple Layers of Perceptrons (MLPs) and a hidden layer. The third level is the element summation operation of the maximum pooled feature map and the average pooled feature map. The fourth layer is a sigmoid layer. The fifth layer is the feature enhancer softmax layer for the original feature map. The sixth layer is a summation operation of the two weight matrices output by the fourth and fifth layers by element. And the seventh layer is that the weight matrix output by the sixth layer and the original characteristic diagram are subjected to matrix multiplication, and finally, a channel dimension attention characteristic diagram is output.
As shown in fig. 4, in the self-attention network with spatial dimension of the present embodiment, the first layer is two parallel convolution kernels, the convolution kernels are 7 × 7 and 3 × 3, respectively, and the step sizes are 7 and 1, respectively, to generate the region feature map and the local feature map, respectively. The second layer is layer normalization. The third layer is a convolution operation with a convolution kernel of 3 x 3 and a step size of 2. The fourth layer is self-attention calculation, including matrix multiplication, softmax calculation, multiplication of a weight matrix and an original matrix, and finally outputting a spatial dimension self-attention feature map.
Step 2, inputting the preprocessed input image into a channel dimension attention network to generate an attention feature map based on channel dimensions;
as shown in fig. 2, the embodiment performs preprocessing on an input image to obtain an image embedding vector, and the specific implementation thereof includes the following sub-steps.
Step 2.1, for any input image, generating a one-dimensional image embedding vector by convolution operation (the sizes of kernel size and convolution kernel are set to be 2s-1, stride is set to be s, padding size and image filling size are set to be s-1) with zero padding and flatten (tensor flattening operation) operation
Figure 278286DEST_PATH_IMAGE001
In a
Figure 992164DEST_PATH_IMAGE001
Of the head-most stitching classification vector
Figure 598333DEST_PATH_IMAGE002
Step 2.2, carrying out two-dimensional position coding on the input image to obtain a two-dimensional position coding vector
Figure 462384DEST_PATH_IMAGE003
And inserting the vector into the one-dimensional vector generated after final splicing in the step 2.1 to serve as the final input x of the model.
Figure 870231DEST_PATH_IMAGE004
(1)。
Wherein, the first and the second end of the pipe are connected with each other,
Figure 895956DEST_PATH_IMAGE002
and
Figure 756465DEST_PATH_IMAGE001
is a classification vector and an image embedding vector [ | luminance]A splice between the vectors is represented and,
Figure 283261DEST_PATH_IMAGE003
representing a position-coding vector.
Referring to fig. 3, in the present embodiment, a feature image x after being preprocessed is input to a channel dimension attention network, so as to generate an attention feature map based on image channel dimensions; the specific implementation thereof comprises the following substeps.
Step 2.3, the preprocessed characteristic image is processed
Figure 420981DEST_PATH_IMAGE005
Inputting the two characteristic maps into the adaptive maximum pooling layer and the adaptive average pooling layer simultaneously, and outputting two intermediate characteristic maps
Figure 509285DEST_PATH_IMAGE006
Step 2.4: mixing
Figure 60352DEST_PATH_IMAGE007
Respectively inputting the data into a shared parameter layer which consists of a multi-layer perceptron (MLP) and a hidden layer, wherein the hidden layer mainly plays a role in reducing the parameter overhead, and the invention sets a parameter reduction ratio r to 16, so that in the MLP, the characteristic diagram output by a first full-connection layer is
Figure 125260DEST_PATH_IMAGE008
Obtained by the RELU activation function
Figure 851908DEST_PATH_IMAGE009
Then, the dimension of the feature map is converted into the dimension of the feature map through a full connection layer
Figure 875227DEST_PATH_IMAGE010
. And (3) carrying out element summation operation on the average pooling characteristic diagram and the maximum pooling characteristic diagram output by the shared parameter layer, and finally connecting a sigmoid layer to generate a channel attention weight matrix Mc (x), wherein the calculation process can be summarized as a formula (2).
Figure 949143DEST_PATH_IMAGE011
(2)。
Step 2.5, the original characteristic image is processed
Figure 693109DEST_PATH_IMAGE005
Inputting the feature into a softmax enhanced feature layer, and outputting a weight matrix of an original feature image
Figure 398896DEST_PATH_IMAGE012
The weight matrix is
Figure 61959DEST_PATH_IMAGE012
Carrying out element summation operation with Mc (x), carrying out multiplication operation on a summation result and the original characteristic image x, and generating a final channel attention graph after dimension conversion
Figure 853197DEST_PATH_IMAGE013
. Equation (3) describes the calculation process of this step.
Figure 761373DEST_PATH_IMAGE014
(3)。
Step 3, inputting the channel attention graph into a multi-scale self-attention network of space dimensionality to generate an attention characteristic graph based on the space dimensionality of the characteristic graph; the specific implementation thereof comprises the following substeps.
Step 3.1: will channelAttention map
Figure 728192DEST_PATH_IMAGE015
Dimension resetting to two-dimensional local feature map
Figure 93314DEST_PATH_IMAGE016
As input to a multi-scale self-attention network of spatial dimensions.
And 3.2, dividing the self-attention network into two characteristic extraction routes of local characteristics and regional characteristics.
The first local feature route is used for converting a two-dimensional local feature map
Figure 247215DEST_PATH_IMAGE017
And performing depth separable convolution operation with the convolution kernel size of 3 multiplied by 3 and the step size of 1 to generate a Query matrix.
The second course of regional features uses a depth separable convolution with convolution kernel size of 7 x 7 and step size of 7 to generate a two-dimensional regional feature map
Figure 723196DEST_PATH_IMAGE018
3.3, in order to enable data to be uniformly distributed and enable the training of the model to be more stable, a two-dimensional region characteristic diagram is used
Figure 544521DEST_PATH_IMAGE019
And dimension resetting is carried out to be a one-dimensional vector, and dimension resetting is carried out to be a two-dimensional characteristic diagram again after layer normalization operation is carried out.
Step 3.4: for the regenerated two-dimensional region feature map
Figure 80545DEST_PATH_IMAGE018
The Key matrix and Value matrix are generated using a deep separable convolution operation with a convolution kernel size of 3 x 3 and a step size of 2.
And 3.5, performing Flattebn operation on the Query matrix, the Key matrix and the Value matrix, and flattening the matrix from a two-dimensional matrix to a one-dimensional vector. Then we annotate in the way of matrix computation with the traditional self-attention mechanismCalculating the intention, and generating a final space dimension self-attention feature map
Figure 987321DEST_PATH_IMAGE020
. The self-attention calculation process for the spatial dimension can be described as equation (4).
Figure 499948DEST_PATH_IMAGE021
Figure 441359DEST_PATH_IMAGE022
Figure 148284DEST_PATH_IMAGE023
Figure 401411DEST_PATH_IMAGE024
Figure 94561DEST_PATH_IMAGE025
(4)。
And 4, repeating the steps 2 and 3 until a fourth stage, converting the output of the model into probability representation through a classifier unit by using the finally generated multi-dimensional and multi-scale attention feature map, and finishing image classification.
In this example, the image classification process is implemented by training a multidimensional, multi-scale, self-attention network. In the training process, the forward propagation process is represented as: and repeatedly updating the CLS classification vector at each stage, and extracting multi-dimensional and multi-scale image features from light to deep. In the last stage, the final CLS vector is subjected to FNN to generate a num multiplied by 1 one-dimensional vector through a feedforward neural network layer, wherein num represents the number of image types of a training set, then the mapping result vector is subjected to normalization processing by utilizing a softmax function to obtain a probability result, the final result is compared with the label vector of the original image, then the back propagation process is realized, and the supervised training of the model is completed.
In a specific embodiment of the present application, the method is applied to the ImageNet1K dataset, and compared with other classical classification learning methods, the effectiveness of the method provided by the present application is demonstrated.
(1) Introduction of data sets.
We trained the multi-dimensional multi-scale self-attention image classification method proposed in this example using a training set of ImageNet1K dataset, and used the highest accuracy on the validation set as an index to evaluate model performance. The ImageNet1K dataset contains 130 ten thousand images and 1000 classes, with the number of training and validation images being 128 thousand and 50000 respectively. We used all images for training and fine-tuned the model on ImageNet 1K.
(2) And (4) setting an experiment.
In our experiments, we applied mixup (mixed class enhancement), random horizontal flipping, tag smoothing, and random erasure as data enhancement algorithms. We use AdamW optimization algorithm with cosine learning rate scheduling. We trained the model using 300 time periods, with the weight decay set to 0.01, the initial learning rate set to 0.001, and the momentum set to 0.9. During training, we randomly crop 224 x 224 regions and make 224 x 244 center crop after adjusting the short side to 256 for evaluation, and furthermore, our model was trained on 4 RTX 3090Ti servers.
(3) And (4) experimental analysis.
In this section, we used the same level of parameter number as the criteria for model performance comparison and compared the proposed multidimensional multiscale self-attention image classification method with two other methods that are highly relevant to our method, including the representative convolutional neural network-based image classification method (table 1) and the transform-based image classification method (table 2).
In the ImageNet1K dataset, we compared the proposed multidimensional multi-scale self-attention-driven image classification method with a classification method based on a convolutional neural network. As shown in table 2, compared with the methods of the ResNet residual network family (including ResNet, SEResNet, and SENet), our method is smaller, more efficient and more accurate. This stems primarily from the attention mechanism in our method, which can improve model performance by refining feature maps.
TABLE 1 Performance comparison results with convolutional neural network-based image classification methods
Figure 749533DEST_PATH_IMAGE027
We further compare the multidimensional multiscale self-attention image classification method with the state-of-the-art visual transform-based image classification method. Our method is consistently better than the baseline methods ViT and PVT in all respects, and we achieve higher accuracy with fewer parameters and FLOPs. Advantages mainly benefit from that our method realizes more abstract high-level feature representation and enhances fine-grained feature extraction capability.
Table 2 compares the results with the performance of the transform-based image classification method
Figure 502725DEST_PATH_IMAGE028
(4) Attention is drawn to image visualization.
To achieve qualitative analysis of the method, we performed attention-image visualization of the proposed method as well as the baseline method PVT with a Grad-CAM network (gradient-based visual interpretation network for the deep network). Because the feature map output by the last layer of the network has rich high-level semantics and detailed spatial information, the weight of the model is input into the Grad-CAM network, the purpose is to find the gradient of all features and map the gradient at the last layer, and finally the importance of each neuron is calculated according to the gradient information. Our proposed method and PVT weights are trained with the ImageNet1K dataset. In this embodiment, four images are selected from the verification set of the ImageNet1K data set for each of the large target object and the small target object (the ratio of the target object occupying the image size is greater than one third to divide the large target object, and less than one third to divide the small target object), see fig. 5 and 6, which respectively show the attention image visualization results of the conventional self-attention method and the multi-dimensional and multi-scale self-attention method provided by this embodiment.
In fig. 5, the recognition effect of PVT and the methods herein demonstrates that even if the multi-scale self-attentive sphere of interest is local and regional, it can still obtain useful global information. In fig. 6, we observe that for small target objects in complex images, such as quartz clocks (column 2), PVT can be confused by other similar objects in the image. In contrast, our method can accurately locate and overlay a target object even if the image is complex, with other similar visual appearances. In addition to this, we found that each attention map implemented by PVT contains noise perturbations. The results further prove that the multi-dimensional and multi-scale representation learning can effectively reduce the noise interference of the image classification method in the training process, and can better utilize the local position information of the target object to aggregate fine-grained features for positioning and covering the target region.
The image classification problem based on the self-attention mechanism is one of the most widely studied and applied classification problems at present, and the fine-grained feature representation of the image is one of the research focuses and difficulties in the field. The invention provides an image classification method based on a multi-dimensional and multi-scale self-attention mechanism. First, we learn channel attention as a first dimension of characterization and multi-scale spatial self-attention as a second dimension of characterization, and compared with single spatial self-attention, the model can learn more abstract high-level feature representation. Secondly, a novel multi-scale space self-attention method is provided, and information interaction between local and regional features is achieved through convolution. In addition, an original feature enhancer is introduced into the channel attention, the noise disturbance condition generated when the weight matrix possibly appearing in the deeper layer of the network tends to be 0 is restrained, and the training process of the model is optimized. Compared with the traditional image classification method based on the self-attention mechanism, the method can improve the generalization performance of the model, not only can enhance the extraction capability of the model to fine-grained features, but also can effectively extract global information from the image, reduce noise disturbance occurring in the model training process and improve the classification performance of the model image.

Claims (7)

1. An image classification method, characterized by comprising the steps of:
step 1, constructing a channel dimension attention network and a space dimension multi-scale self-attention network based on a PVT architecture;
step 2, inputting the preprocessed input image into a channel dimension attention network to generate an attention feature map based on channel dimensions;
step 3, inputting the attention feature map based on the channel dimension into a space dimension multi-scale self-attention network to generate a multi-scale self-attention feature map based on the space dimension of the feature map;
step 4, repeating the step 2 and the step 3 until a fourth stage in the framework, inputting the finally generated multi-dimensional and multi-scale attention feature map into a classifier unit, converting the vector output by the model into probability representation, and finishing image classification;
the PVT architecture comprises four stages in total, and the resolution of input is gradually reduced through an embedded layer; in each stage, a channel dimension attention network and a space dimension multi-scale self-attention network are respectively constructed;
the channel dimension attention network described in step 1:
the first layer is two parallel pooling layers, including maximum pooling and average pooling;
the second layer is a shared parameter layer and consists of a plurality of layers of perceptrons and a hidden layer;
the third layer is the element summation operation of the maximum pooling characteristic map and the average pooling characteristic map;
the fourth layer is a sigmoid layer;
the fifth layer is a softmax layer for the original feature map;
the sixth layer is to perform element summation operation on the two weight matrixes output by the fourth layer and the fifth layer;
the seventh layer is that the weight matrix output by the sixth layer and the original characteristic diagram are subjected to matrix multiplication operation, and finally, a channel dimension characteristic diagram is output;
the multi-scale self-attention network of spatial dimensions described in step 1:
the first layer is two parallel convolution kernels, the convolution kernels are respectively 7 × 7 and 3 × 3, and the step lengths are respectively 7 and 1;
the second layer is a layer normalization operation;
the third layer is convolution operation with convolution kernel of 3 × 3 and step size of 2;
the fourth layer is self-attention calculation, including matrix multiplication, softmax layer normalization, multiplication of the weight matrix and the original matrix, and finally outputting the spatial dimension attention feature map.
2. An image classification method according to claim 1, characterized in that the convolution operation with zero padding is used to generate image embedding vectors for the input images in step 2; in order to realize an image classification task, a classification vector CLS is spliced before an image is embedded with a vector and is used as the input of a channel dimension attention network; and inputting the intermediate feature map into a channel dimension attention network, and generating an attention feature map based on the image channel dimension.
3. The image classification method according to claim 1 or 2, characterized in that in step 3, the attention feature map of the channel dimension is subjected to dimension resetting to generate a two-dimensional local feature map, and the two-dimensional local feature map is used as an input of a multi-scale self-attention network of the spatial dimension; dividing a self-attention network into two routes of local feature calculation and regional feature calculation by using different convolution kernels and depth separable convolution operations of step length by utilizing a hierarchical structure of convolution kernels, and generating a local feature map and a regional feature map which are highly related to semantic features; and taking the local context information as Query, and taking the regional context information as Key Value Key and Value to calculate a final spatial dimension multi-scale self-attention feature map.
4. The image classification method according to claim 3, characterized in that in step 4, the CLS classification vector is repeatedly updated in four stages, and multi-dimensional and multi-scale high-level semantic features are extracted from shallow to deep; in the last stage, inputting the final CLS classification vector into a feedforward neural network layer FNN of a classifier unit to generate a num multiplied by 1 vector, wherein num represents the number of image classes of a training set, and finally completing class probability calculation and final classification through a softmax layer of the classifier unit.
5. The image classification method according to claim 2, characterized in that in step 2, the input image is preprocessed and then input into a channel dimension attention network, and an attention feature map based on channel dimensions is generated; the specific implementation comprises the following substeps:
step 2.1: for any input image, generating a one-dimensional image embedding vector through convolution operation with zero padding and flatten operation, and splicing the classification vector at the forefront of the image embedding vector;
step 2.2: performing two-dimensional position coding on the input image to obtain a two-dimensional position coding vector, and inserting the two-dimensional position coding vector into the one-dimensional vector generated after final splicing in the step 2.1 as a final input x of the model:
x=[x cls ||x patch ]+x pos (1)
wherein x is cls And x patch Is a classification vector and an image embedding vector [ | luminance]Representing a concatenation between vectors, x pos Representing a position-coding vector.
6. The image classification method according to claim 2, wherein in step 2, the preprocessed feature map x is input to a channel dimension attention network to generate an attention feature map based on image channel dimensions; the specific implementation comprises the following substeps:
step 2.3: enabling the preprocessed characteristic image x to be in the scope of R H×W×C Simultaneously inputting the two intermediate characteristic graphs into an adaptive maximum pooling layer and an adaptive average pooling layer, and outputting two intermediate characteristic graphs x Avg ,x Max ∈R C×1×1
Step 2.4: the intermediate feature map x Avg ,x Max Respectively inputting the parameters into a shared parameter layer, wherein the shared parameter layer consists of a plurality of layers of perceptrons and a hidden layer; the parameter reduction ratio r is set to 16, so that in the multilayer perceptron, the characteristic diagram of the output of the first fully-connected layer is x Avg(FC1) ,x Max(FC2) ∈R (16/C)×1×1 X is obtained by the RELU activation function Avg(RELU) ,x Max(RELU) ∈R (16/C)×1×1 Then, the dimension of the feature map is converted into x through a full connection layer Avg(FC2) ,x Max(FC2) ∈R C×1×1 (ii) a Carrying out element summation operation on the average pooling characteristic diagram and the maximum pooling characteristic diagram output by the shared parameter layer, and finally connecting a sigmoid layer to generate a channel attention weight matrix Mc (x), wherein the calculation process is summarized as a formula (2):
Mc(x)=σ(MLP(AvgPool(x))+MLP(Maxpool(x))) (2)
step 2.5: making the original characteristic image x be in the range of R H×W×C Inputting the weight matrix x into a softmax reinforced characteristic layer and outputting the weight matrix x of the original characteristic image w The weight matrix x w Carrying out element summation operation with Mc (x), carrying out multiplication operation on a summation result and the original characteristic image x, and generating a final channel attention graph x after carrying out dimension conversion 1 ∈R H×W×C The calculation process is as follows:
x 1 =(Mc(x)+softmax(x))x (3)。
7. an image classification method according to claim 3, characterized in that the step 3 comprises the following sub-steps:
step 3.1: channel attention map x 1 ∈R H×W×C Dimension reset to two-dimensional local feature map local x ∈R H×W×1 An input to a multi-scale self-attention network as a spatial dimension;
step 3.2: dividing a self-attention network into two feature extraction routes, namely a local feature and a regional feature;
the first local feature route is a two-dimensional local feature map local x ∈R H×W×1 Performing depth separable convolution operation with convolution kernel size of 3 multiplied by 3 and step length of 1 to generate a Query matrix;
the second Region feature route generates a two-dimensional Region feature map Region using a depth separable convolution with a convolution kernel size of 7 × 7 and a step size of 7 x ∈R (H/7)×(W/7)×1
Step 3.3: map two-dimensional Region feature x ∈R (H/7)×(W/7)×1 The dimensionality is reset into a one-dimensional vector, and the two-dimensional region characteristic diagram is reset after layer normalization operation is carried out;
step 3.4: for the regenerated two-dimensional Region characteristic map Region x ∈R (H/7)×(W/7)×1 Generating a Key matrix and a Value matrix by using depth separable convolution operation with the convolution kernel size of 3 multiplied by 3 and the step length of 2;
step 3.5: performing Flattebn operation on the Query matrix, the Key matrix and the Value matrix, and flattening the matrix from a two-dimensional matrix to a one-dimensional vector; then, the attention calculation is carried out according to a matrix calculation mode of a self-attention mechanism, and a final space dimension self-attention feature map SA (x) is generated 1 ) (ii) a The self-attention calculation process for the spatial dimension is described as equation (4):
local x =Reshape2D(x 1 )
Q=Flatten(Conv2d(local x ,k))
regional x =Reshape2D(LN(Flatten(Conv2d(local x ,k))))
K,V=Flatten(Conv2d(regional x ,k))
Figure FDA0003916744400000041
CN202211120458.1A 2022-09-15 2022-09-15 Image classification method Active CN115222998B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211120458.1A CN115222998B (en) 2022-09-15 2022-09-15 Image classification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211120458.1A CN115222998B (en) 2022-09-15 2022-09-15 Image classification method

Publications (2)

Publication Number Publication Date
CN115222998A CN115222998A (en) 2022-10-21
CN115222998B true CN115222998B (en) 2023-01-03

Family

ID=83617247

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211120458.1A Active CN115222998B (en) 2022-09-15 2022-09-15 Image classification method

Country Status (1)

Country Link
CN (1) CN115222998B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115761250B (en) * 2022-11-21 2023-10-10 北京科技大学 Compound reverse synthesis method and device
CN115844425B (en) * 2022-12-12 2024-05-17 天津大学 DRDS brain electrical signal identification method based on transducer brain region time sequence analysis
CN118015525A (en) * 2024-04-07 2024-05-10 深圳市锐明像素科技有限公司 Method, device, terminal and storage medium for identifying road ponding in image

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113344146A (en) * 2021-08-03 2021-09-03 武汉大学 Image classification method and system based on double attention mechanism and electronic equipment
CN114067107A (en) * 2022-01-13 2022-02-18 中国海洋大学 Multi-scale fine-grained image recognition method and system based on multi-grained attention

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10262237B2 (en) * 2016-12-08 2019-04-16 Intel Corporation Technologies for improved object detection accuracy with multi-scale representation and training
CN113709455B (en) * 2021-09-27 2023-10-24 北京交通大学 Multi-level image compression method using transducer

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113344146A (en) * 2021-08-03 2021-09-03 武汉大学 Image classification method and system based on double attention mechanism and electronic equipment
CN114067107A (en) * 2022-01-13 2022-02-18 中国海洋大学 Multi-scale fine-grained image recognition method and system based on multi-grained attention

Also Published As

Publication number Publication date
CN115222998A (en) 2022-10-21

Similar Documents

Publication Publication Date Title
CN111489358B (en) Three-dimensional point cloud semantic segmentation method based on deep learning
CN110263912B (en) Image question-answering method based on multi-target association depth reasoning
Cheng et al. An analysis of generative adversarial networks and variants for image synthesis on MNIST dataset
CN110490946B (en) Text image generation method based on cross-modal similarity and antagonism network generation
CN115222998B (en) Image classification method
CN112288011B (en) Image matching method based on self-attention deep neural network
CN110929080B (en) Optical remote sensing image retrieval method based on attention and generation countermeasure network
CN112733866A (en) Network construction method for improving text description correctness of controllable image
CN114049381A (en) Twin cross target tracking method fusing multilayer semantic information
CN113159023A (en) Scene text recognition method based on explicit supervision mechanism
Ning et al. Conditional generative adversarial networks based on the principle of homologycontinuity for face aging
CN112597324A (en) Image hash index construction method, system and equipment based on correlation filtering
Wang et al. Urban building extraction from high-resolution remote sensing imagery based on multi-scale recurrent conditional generative adversarial network
CN111899203A (en) Real image generation method based on label graph under unsupervised training and storage medium
CN114973222A (en) Scene text recognition method based on explicit supervision mechanism
Zou et al. Image classification model based on deep learning in internet of things
CN116610778A (en) Bidirectional image-text matching method based on cross-modal global and local attention mechanism
CN116740527A (en) Remote sensing image change detection method combining U-shaped network and self-attention mechanism
Fan et al. A novel sonar target detection and classification algorithm
CN115909036A (en) Local-global adaptive guide enhanced vehicle weight identification method and system
CN114780767A (en) Large-scale image retrieval method and system based on deep convolutional neural network
US20230053618A1 (en) Recurrent unit for generating or processing a sequence of images
Fan et al. Hcpvf: Hierarchical cascaded point-voxel fusion for 3d object detection
CN113850182A (en) Action identification method based on DAMR-3 DNet
CN117033609A (en) Text visual question-answering method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant