CN113011444B

CN113011444B - Image identification method based on neural network frequency domain attention mechanism

Info

Publication number: CN113011444B
Application number: CN202011504311.3A
Authority: CN
Inventors: 李玺; 秦泽群; 张芃怡
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2020-12-18
Filing date: 2020-12-18
Publication date: 2022-05-13
Anticipated expiration: 2040-12-18
Also published as: CN113011444A

Abstract

The invention discloses a frequency domain attention mechanism design method based on a neural network, which is used for image recognition. The method specifically comprises the following steps: acquiring an image recognition data set for training a neural network, and defining an algorithm target; establishing a single frequency domain transformation basis function selection model; establishing a combined frequency domain transformation basis function selection model; establishing a frequency domain attention mechanism based on a neural network; training a prediction model based on the modeling result; and performing image recognition by using the prediction model. By bringing the information of different frequency domains into an attention mechanism, the invention realizes the great improvement of precision on various image identification tasks (image classification, target detection and example segmentation) under the condition of the same calculated amount and complexity, and has good application value.

Description

Image identification method based on neural network frequency domain attention mechanism

Technical Field

The invention belongs to the field of image processing, and particularly relates to an image identification method based on a neural network frequency domain attention mechanism.

Background

In recent years, the attention mechanism of the neural network gradually attracts people due to simple calculation and remarkable effect, and is widely applied to many fields such as computer vision. The mechanism mainly comprises two key steps: the first is how to efficiently extract information from the neural network as input to the attention mechanism; the second is how to design an attention calculation method, get reasonable attention from the input, and improve the learning of the neural network. For the first point, existing methods all use global average pooling operations to efficiently extract information for attention calculations; for the second point, the existing method generally uses a fully-connected network as an attention calculation method, and meanwhile, since the fully-connected network has the calculation complexity of inputting a scale square term, the complexity of the first step is also constrained, so that people must use a global average pooling operation to extract information. Although the global average pooling operation is computationally simple and efficient, it is equivalent to extracting only the lowest frequency portion of the information, while the information of other frequencies is entirely discarded.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides an image identification method based on a neural network frequency domain attention mechanism, which adopts a design of the neural network frequency domain attention mechanism combined with multi-band information, has the same calculation complexity as the global average pooling operation, and can extract more frequency spectrum information, so that the input of the attention mechanism contains more abundant information, thereby improving the accuracy of the whole network and simultaneously keeping the same calculation amount.

In order to achieve the purpose, the technical scheme of the invention is as follows:

an image identification method based on a neural network frequency domain attention mechanism comprises the following steps:

s1, acquiring an image recognition data set for training a neural network;

s2, establishing an attention basic network by taking ResNet as a backbone;

s3, establishing a single frequency domain transformation basis function selection model based on the attention basic network in the S2;

s4, establishing a combined frequency domain transformation basis function selection model on the basis of S2 and S3;

s5, establishing a frequency domain attention mechanism based on the neural network on the basis of S4 to form a final model;

s6, training the final model in S5 based on the image recognition data set in S1 to obtain an image prediction model;

and S7, inputting the image to be recognized into the image prediction model for image recognition.

Preferably, in step S1, the data set for image recognition includes a group of images

In which I_iK is the number of images in the image group for the ith image;

the algorithm targets are defined as: and acquiring a classification result of each picture.

Further, in step S2, the process of establishing the attention infrastructure network is as follows:

s21, constructing ResNet as a basic backbone network;

s22, adding an attention mechanism on the basis of ResNet to construct an attention base network, wherein X is the same as R^C×H×WFor the output characteristics of a single layer in the ResNet network, wherein C, H, W is the channel number, the height of the characteristic diagram, and the width of the characteristic diagram of the characteristic, respectively, the attention mechanism is to transform the output X of the layer as follows:

att＝sigmoid(fc(f_i))

wherein att ∈ R^CFor the attention vector obtained after transformation, sigmoid () is a sigmoid activation function, fc (-) is a two-layer fully-connected network, f_i∈R^CIs the frequency spectrum of the input data X;

transformed output characteristics of one layer of ResNet network

Comprises the following steps:

wherein

Att, the ith channel of the transformed feature_iIs the ith value, X, of the attention vector_i,:,:The ith channel which is input data X; an attention mechanism is added to each layer in the ResNet network, the output characteristics of the current layer are transformed, and then the transformed characteristics are used

Inputting the feature after attention processing into the next layer of ResNet, and obtaining the attention base network.

Further, in step S3, the process of selecting a model by the single frequency domain transform basis function is as follows:

s31, outputting the characteristic X belonging to R of each layer^C×H×WIs divided into C two-dimensional characteristic graphs x^2d∈R^H×WAnd for each two-dimensional feature map x^2dPerforming discrete cosine transform, wherein the transform process comprises the following steps:

s.t.h∈{0,1,…,H-1},w∈{0,1,…,W-1}

for a two-dimensional feature map x of size H W^2dObtaining H multiplied by W transformed frequency spectrum components; f. of^2d∈R^H×WNamely, obtaining a discrete cosine transform frequency spectrum result;

for transforming the frequency spectrum f by discrete cosine^2dIn [ h, w ]]A value of the location;

s32, aiming at C two-dimensional feature maps x^2dResulting C spectra f^2dEach time f is selected^2dFor X ∈ R, then^C×H×WEach time obtaining an f_i∈R^C(ii) a F is to be measured_iAnd (5) bringing the attention basic network established in the S2, training and testing the performance of the frequency spectrum component as single input, and finally obtaining the performance sequence of all the frequency spectrum components according to the test results of different frequency components.

Further, in step S4, the process of establishing the combined frequency domain transform basis function selection model is as follows:

s41, sorting the performance according to the single frequency spectrum obtained in the step S32 when the single frequency spectrum is used as input, and sequentially taking 1, 2, 4, 8, 16 and 32 frequency spectrum components with the highest performance to form 6 combinations of frequency components with different quantities;

s42 for renA combination of X ∈ R^C×H×WDividing the channel dimension, namely C dimension, according to the number of the frequency components; assuming that the number of frequency domains in a combination is nf, then nf should be able to divide C by X⁰,X¹,…,X^nf-1]For the divided part, the input is divided as follows:

wherein

Represents the first of X

To

A channel; after division, each part is sequentially subjected to frequency spectrum decomposition by using corresponding frequency bands in the frequency component combination according to the method of S32 to obtain [ f⁰,f¹,…,f^nf-1]Each of which

s.t.j belongs to {0,1, …, nf-1 }; and then splicing the frequency spectrum of each part:

f_i＝cat([f⁰,f¹,…,f^nf-1])

where cat (. cndot.) is the splicing function, yielding f_i∈R^C；

S43, f obtained by respectively combining 6 combinations of 1, 2, 4, 8, 16 and 32 spectrum components_i∈R^CSubstituting the data into the attention basic network established in S2, training and testing the model to obtain the performance of each combination;

s44, selecting the combination with the highest performance as the frequency spectrum input f 'of the final model'_i。

Further, in step S5, the process of establishing the frequency domain attention mechanism based on the neural network is as follows:

s51, input Spectrum f 'for the Final model obtained in S44'_iThe following attention mechanism is established and the attention vector is obtained:

att′＝sigmoid(fc(f′_i))

s53, for each channel of the input image or the characteristic X of the basic network in S2, performing attention scale transformation according to the attention vector att' to obtain final output

Wherein

Att 'as the ith channel of the transformed feature'_iIs the ith value, X, of attention vector att_i,:,:And inputting the ith channel of the image or the feature, and establishing a frequency domain attention mechanism of the neural network according to the ith channel to form a final model.

Further, the specific process of step S6 is as follows: based on the image recognition data set in S1, after single spectrum performance ranking obtained by S2 and S3 is used, 1, 2, 4, 8, 16 and 32 frequencies with the highest performance are respectively selected to obtain 6 spectrum combinations, and then the 6 spectrum combinations are substituted into S4 to obtain each combination spectrum performance ranking and obtain the spectrum combination with the highest performance; and substituting the spectrum combination with the highest performance into S5 to serve as an input spectrum of a final model, and performing final model training based on the image recognition data set in S1 to obtain an image recognition prediction model.

Further, step S7 is specifically as follows: and after the prediction model in the step S6 is obtained, inputting the image to be recognized into the prediction model for prediction to obtain an image classification prediction result.

Compared with the existing attention mechanism method, the image identification method based on the neural network frequency domain attention mechanism has the following beneficial effects:

firstly, the image identification method based on the neural network frequency domain attention mechanism defines an attention mechanism based on frequency domain analysis. The original attention mechanism is popularized to the frequency domain, and information noticed by the attention mechanism is more complete due to the complete property of the frequency domain.

Secondly, compared with the original mean value method, the frequency domain analysis method expanded by the image identification method based on the neural network frequency domain attention mechanism has the same parameter amount and calculated amount, and can seamlessly expand the original arbitrary attention mechanism network.

Finally, the invention realizes the great improvement of precision on various image identification tasks (image classification, target detection and example segmentation) under the condition of the same calculation amount and complexity by bringing the information of different frequency domains into an attention mechanism, and has good application value.

Drawings

FIG. 1 is a flowchart of an image recognition method based on a neural network frequency domain attention mechanism.

Detailed Description

The invention will be further elucidated and described with reference to the drawings and the detailed description. The technical characteristics of the embodiments of the invention can be correspondingly combined without mutual conflict.

In a preferred embodiment of the present invention, as shown in fig. 1, there is provided an image recognition method based on a neural network frequency domain attention mechanism, which includes the following steps:

and S1, acquiring an image recognition data set for training the neural network.

In step S1 of the present embodiment, the data set for image recognition includes a group of images

Wherein I_iK is the number of images in the image group for the ith image;

And S2, establishing an attention base network by using ResNet as a backbone.

In step S2 of this embodiment, the specific process is as follows:

s21, constructing ResNet as a basic backbone network;

att＝sigmoid(fc(f_i))

wherein att ∈ R^CFor the attention vector obtained after transformation, sigmoid () is a sigmoid activation function, fc (-) is a two-layer fully-connected network, f_i∈R^CIs the spectrum of the input data X. The method of obtaining the frequency spectrum may be a single frequency domain transform basis function selection model in S3, or may be a combined frequency domain transform basis function selection model in S4.

Transformed output characteristics of one layer of ResNet network

Comprises the following steps:

wherein

As a result of attentionInputting the feature after the force processing into the next layer of ResNet, and obtaining the attention base network.

S3, establishing a single frequency domain transformation basis function selection model based on the attention base network in S2.

In step S3 of this embodiment, the specific process is as follows:

s.t.h∈{0,1,…,H-1},w∈{0,1,…,W-1}

s32, aiming at C two-dimensional feature maps x^2dResulting C spectra f^2dEach time f is selected^2dOf a spectrum (e.g. the first C spectra f)^2dOnly select

Second C frequency spectra f^2dOnly select

) Then for X ∈ R^C×H×WEach time obtaining an f_i∈R^C(ii) a F is to be measured_iAnd (5) bringing the attention basic network established in the S2, training and testing the performance of the frequency spectrum component as single input, and finally obtaining the performance sequence of all the frequency spectrum components according to the test results of different frequency components.

And S4, establishing a combined frequency domain transformation basis function selection model on the basis of S2 and S3.

In step S4 of this embodiment, the specific process is as follows:

s42, for any combination, inputting X epsilon R^C×H×WDividing the channel dimension, namely C dimension, according to the number of the frequency components; assuming that the number of frequency domains in a combination is nf, then nf should be able to divide C by X⁰,X¹,…,X^nf-1]For the divided part, the input is divided as follows:

wherein

Represents the first of X

To

f_i＝cat([f⁰,f¹,…,f^nf-1])

where cat (. cndot.) is the splicing function, yielding f_i∈R^C；

S43, 1, 2,4. F obtained from 6 combinations of 8, 16 and 32 spectrum components_i∈R^CSubstituting the data into the attention basic network established in S2, training and testing the model to obtain the performance of each combination;

And S5, establishing a frequency domain attention mechanism based on the neural network on the basis of the S4, and forming a final model. In step S5 of the present embodiment, the process of establishing the frequency domain attention mechanism based on the neural network is as follows:

att′＝sigmoid(fc(f′_i))

Wherein

Att 'as the ith channel of the transformed feature'_iIs the ith value, X, of attention vector att_i,:,:And inputting the ith channel of the image or the feature, and establishing a frequency domain attention mechanism of the neural network to form a final model.

And S6, training the final model in S5 based on the image recognition data set in S1 to obtain an image prediction model.

In step S6 of the present embodiment, the process of training the prediction model based on the modeling results of S3, S4, and S5 is as follows: based on the image recognition data set in S1, after single spectrum performance ranking obtained in S2 and S3 is used, 1, 2, 4, 8, 16, 32 frequencies with the highest performance are respectively taken to obtain 6 spectrum combinations, and then the 6 spectrum combinations are substituted into S4 to obtain each combination spectrum performance ranking, and a spectrum combination with the highest performance is obtained; and substituting the spectrum combination with the highest performance into S5 to serve as an input spectrum of a final model, and performing final model training based on the image recognition data set in S1 to obtain an image recognition prediction model.

In step S7 of this embodiment, the specific process is as follows: and after the prediction model in the step S6 is obtained, inputting the image to be recognized into the prediction model for prediction to obtain an image classification prediction result.

The methods of S1-S7 are applied to specific data sets to demonstrate the technical effects that can be achieved.

Examples

The implementation method of this embodiment is as described above, and specific steps are not elaborated, and the effect is shown only for case data. The invention is implemented on a data set with truth value labels of two images, which respectively comprises the following steps:

ImageNet dataset [1 ]: the data set contained 1000 classes of natural images, 1281167 training pictures, 50000 verification images, each image labeled to contain a category.

MS COCO data set [2 ]: the data set includes object detection tasks and instance segmentation tasks, including 80 countable object classes and 91 countable object classes. The data set had over 33 million images, 150 object instances.

In this embodiment, classification accuracy comparison is mainly performed on the ImageNet data set, which is Top-1 accuracy and Top-5 accuracy respectively. In addition, the present embodiment compares the parameter quantities Parameters with the calculated quantity FLOPS.

Table 1 comparison of evaluation indexes on ImageNet dataset in this example

On the MS COCO data set, the present embodiment uses the network proposed in the patent as a backbone network, and uses fast RCNN and Mask RCNN to respectively implement an object detection task and an instance segmentation task, where the comparison indexes include an average accuracy AP, an average accuracy AP50 when the threshold is 0.5, and an average accuracy AP75 when the threshold is 0.75.

Table 2 comparison of each index of object detection task on MS COCO data set in this embodiment

Table 3 comparison of each index of example segmentation task on MS COCO dataset in this example

Method	AP	AP50	AP75
				ResNet-50	34.1	55.5	36.2
SENet	35.4	57.4	37.8
				GCNet	35.7	58.4	37.6
ECANet	35.6	58.1	37.7
				The method of the invention	36.2	58.6	38.1

The prior art cited above for comparison with the present invention can be found in the following references:

[1]Deng J,Dong W,Socher R,et al.ImageNet:A large-scale hierarchical image database[C]//IEEE Conference on Computer Vision&Pattern Recognition.IEEE,2009.

[2]Lin T Y,Maire M,Belongie S,et al.Microsoft COCO:Common Objects in Context[C]//European Conference on Computer Vision.Springer International Publishing,2014.

[3]He K,Zhang X,Ren S,et al.Deep Residual Learning for Image Recognition[C]//IEEE Conference on Computer Vision&Pattern Recognition.IEEE Computer Society,2016.

[4]Hu J,Shen L,Albanie S,et al.Squeeze-and-Excitation Networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,PP(99).

[5]Wang Q,Wu B,Zhu P,et al.ECA-Net:Efficient Channel Attention for Deep Convolutional Neural Networks[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).IEEE,2020.

[6]Woo S,Park J,Lee JY,So Kweon I.Cbam:Convolutional block attention module.InProceedings of the European conference on computer vision(ECCV)2018.

[7]Gao Z,Xie J,Wang Q,Li P.Global second-order pooling convolutional networks[C]//2019 IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2019.

[8]Cao Y,Xu J,Lin S,Wei F,Hu H.Gcnet:Non-local networks meet squeeze-excitation networks and beyond[C]//2019 IEEE International Conference on Computer Vision Workshops.IEEE,2019.

[9]Bello I,Zoph B,Le Q,et al.Attention Augmented Convolutional Networks[C]//2019 IEEE/CVF International Conference on Computer Vision(ICCV).IEEE,2020.

[10]Ren S,He K,Girshick R,et al.Faster R-CNN:Towards Real-Time Object Detection with Region Proposal Networks[J].IEEE Transactions on Pattern Analysis&Machine Intelligence,2017,39(6):1137-1149.

[11]He K,Gkioxari G,Dollár P,Girshick R.Mask r-cnn[C]\\2017 IEEE international conference on computer vision.IEEE,2017.

the above-described embodiments are merely preferred embodiments of the present invention, and are not intended to limit the present invention. Various changes and modifications may be made by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present invention. Therefore, the technical scheme obtained by adopting the mode of equivalent replacement or equivalent transformation is within the protection scope of the invention.

Claims

1. An image identification method based on a neural network frequency domain attention mechanism is characterized by comprising the following steps:

s1, acquiring an image recognition data set for training a neural network;

s2, establishing an attention basic network by taking ResNet as a backbone;

s7, inputting the image to be recognized into the image prediction model for image recognition;

in step S1, the data set for image recognition includes a group of images

Wherein I_i″For the ith "image, K is the number of images in the image group;

the algorithm targets are defined as: obtaining a classification result of each picture;

in step S2, the process of establishing the attention-based network is as follows:

s21, constructing ResNet as a basic backbone network;

att＝sigmoid(fc(f_i))

transformed output characteristics of one layer of ResNet network

Comprises the following steps:

wherein

Att, the ith channel of the transformed feature_iIs the ith value, X, of the attention vector_{i，：，：}The ith channel which is input data X; an attention mechanism is added to each layer in the ResNet network, the output characteristics of the current layer are transformed, and then the transformed characteristics are used

Inputting the feature subjected to attention processing into the next layer of ResNet to obtain an attention base network;

in step S3, the process of selecting a model by a single frequency domain transform basis function is as follows:

s32, for C two-dimensional feature maps x^2dResulting C spectra f^2dEach time f is selected^2dFor X ∈ R, then^C×H×WEach time obtaining an f_i∈R^C(ii) a F is to be measured_iBringing into the attention-based network established at S2, training and testing the performance of the spectral components as a single input, according toFinally obtaining the performance sequence of all the frequency spectrum components according to the test results of different frequency components;

in step S4, the process of establishing the combined frequency domain transform basis function selection model is as follows:

s42, inputting X epsilon for any combination^WC×H×WDividing the channel dimension, namely C dimension, according to the number of the frequency components; assuming that the number of frequency domains in a combination is nf, then nf should be able to divide C by X⁰，X¹，…，X^nf-1]For the divided part, the input is divided as follows:

wherein

Represents the first of X

To

A channel; after division, each part is sequentially subjected to frequency spectrum decomposition by using corresponding frequency bands in the frequency component combination according to the method of S32 to obtain [ f⁰，f¹，…，f^nf-1]Each of which

And then splicing the frequency spectrum of each part:

f_i＝cat([f⁰，f¹，…，f^nf-1])

where cat (. cndot.) is the splicing function, yielding f_i∈R^C；

S43, f obtained by 6 combinations of 1, 2, 4, 8, 16 and 32 frequency spectrum components_i∈R^CSubstituting the data into the attention basic network established in S2, training and testing the model to obtain the performance of each combination;

s44, selecting the combination with the highest performance as the frequency spectrum input f 'of the final model'_i；

In step S5, the process of establishing the frequency domain attention mechanism based on the neural network is as follows:

att′＝sigmoid(fc(f′_i))

Wherein

Att 'as the ith channel of the transformed feature'_iIs the ith value, X, of attention vector att_{i，：，：}And inputting the ith channel of the image or the feature, and establishing a frequency domain attention mechanism of the neural network according to the ith channel to form a final model.

2. The image recognition method based on the neural network frequency domain attention mechanism as claimed in claim 1, wherein the step S6 is as follows: based on the image recognition data set in S1, after single spectrum performance ranking obtained in S2 and S3 is used, 1, 2, 4, 8, 16, 32 frequencies with the highest performance are respectively taken to obtain 6 spectrum combinations, and then the 6 spectrum combinations are substituted into S4 to obtain each combination spectrum performance ranking, and a spectrum combination with the highest performance is obtained; and substituting the spectrum combination with the highest performance into S5 to serve as an input spectrum of a final model, and performing final model training based on the image recognition data set in S1 to obtain an image recognition prediction model.

3. The image recognition method based on the neural network frequency domain attention mechanism as claimed in claim 2, wherein the step S7 is as follows: and after the prediction model in the step S6 is obtained, inputting the image to be recognized into the prediction model for prediction to obtain an image classification prediction result.