CN112750082B

CN112750082B - Human face super-resolution method and system based on fusion attention mechanism

Info

Publication number: CN112750082B
Application number: CN202110081811.9A
Authority: CN
Inventors: 卢涛; 赵康辉; 张彦铎; 吴云韬; 金从元; 张力; 余晗
Original assignee: Wuhan Institute of Technology; Wuhan Fiberhome Technical Services Co Ltd
Current assignee: Wuhan Institute of Technology; Wuhan Fiberhome Technical Services Co Ltd
Priority date: 2021-01-21
Filing date: 2021-01-21
Publication date: 2023-05-16
Anticipated expiration: 2041-01-21
Also published as: CN112750082A

Abstract

The invention discloses a face super-resolution method and a face super-resolution system based on a fused attention mechanism, belonging to the field of face image super-resolution, wherein the method comprises the following steps: downsampling a high-resolution face image to a target low-resolution face image, performing blocking operation, separating out mutually overlapped image blocks, and extracting shallow features by using a shallow feature extractor; fusing the characteristics of the pixel, the channel and the space triple attention module, and enhancing the structural details of the reconstructed face; constructing a fused attention network as a deep feature extractor, inputting shallow facial features into the fused attention network to obtain deep features, wherein the fused attention network comprises a plurality of fused attention groups, and each fused attention group comprises a plurality of fused attention blocks; and (3) up-sampling the deep feature map, and reconstructing the up-sampled face feature map into a high-resolution face image of the target. The invention is superior to other latest face image super-resolution algorithms, and can generate higher-quality face high-resolution images.

Description

Human face super-resolution method and system based on fusion attention mechanism

Technical Field

The invention belongs to the field of computer vision face super-resolution, and particularly relates to a face super-resolution method and a face super-resolution system based on a fused attention mechanism.

Background

The face Super-Resolution (face hallucination), which is a special field of Super-Resolution (SR), is a technology for deducing a High Resolution (HR) image from an input Low Resolution (LR) face image, and can significantly enhance the detail information of the Low Resolution face image. In real world surveillance scenes, the distance between the imaging sensor and the face tends to be too large, resulting in low resolution face images. And recovering the high-resolution face image by using the face super-resolution, thereby facilitating the identification of the target person. The method plays an important role in a plurality of applications such as face detection, face recognition and analysis.

Generally, the super-resolution of the face is similar to that of a general image restoration method, and three types of sources can be classified according to prior information: interpolation, reconstruction, and learning-based methods. Interpolation-based methods scale the pixel size of an image without generating pixels and calculate the value of missing pixels by mathematical formulas based on surrounding pixels. The super-resolution of the face based on reconstruction depends on the fusion sub-pixel registration information of a plurality of LR input images. However, when the magnification is too large, the efficiency and performance of the interpolation and reconstruction-based method may be greatly reduced. In recent decades, learning-based methods have been widely used in human face super-resolution, because learning-based methods can make full use of prior information in training samples, map LR images into HR images, and achieve satisfactory visual effects.

Recently, convolutional neural network (Convolutional Neural Networks, CNN) based methods have significantly improved over traditional SR methods. Among them, dong et al propose a deep convolution network (Learning a Deep Convolutional Network for Image Super-Resolution) for image super-Resolution, which is achieved by introducing three layers of CNN. Thereafter, in the development process of deep learning, the reconstruction performance of the SR is continuously improved, and the performance of the face SR is also improved. Attention mechanisms are introduced into the face SR to focus the face structure information. Wang et al propose a texture attention module (Face Super-Resolution by Learning Multi-view Texture Compensation) to obtain the correspondence between Face images and multi-view Face images. Song et al propose a two-stage face SR method (Learning to hallucinate face images via Component Generation and Enhancement, LCGE) that separately performs SR on five organ structures in a face image, and then restores these reconstructed organ structures to the face image, focusing the CNN's attention on local face information. Zhang proposes a channel attention mechanism (Image super-resolution using very deep residual channel attention networks, RCAN) to adaptively readjust the characteristics of the channel patterns by taking into account the interdependencies between channels.

Although the above-described face SR method using the attention mechanism achieves satisfactory results, most methods consider only a single attention mechanism, which limits the multi-feature extraction capability of CNN and lacks fusion and interaction of face structure information. Therefore, it is very important how to fully utilize various attention features to improve the reconstruction performance of the face SR.

Disclosure of Invention

Aiming at the defects or improvement demands of the prior art, the invention provides a face super-resolution method and a face super-resolution system based on a fused attention mechanism, which solve the technical problem that the existing face super-resolution reconstruction algorithm cannot simultaneously utilize multiple attention features, so that the reconstruction performance of a face image has a certain limitation.

In order to achieve the above object, according to one aspect of the present invention, there is provided a face super-resolution method based on a fused attention mechanism, including:

s1: a downsampling module is constructed to downsample the high-resolution face image to a target low-resolution face image;

s2: constructing a shallow feature extractor, performing blocking operation on the target low-resolution face image, separating out mutually overlapped image blocks, and extracting a shallow feature image by using the shallow feature extractor;

s3: constructing a fused attention block, fusing the characteristics of the pixel, the channel and the space triple attention module, generating the fused attention characteristic of the network, and enhancing the structural details of the reconstructed face;

s4: constructing a fused attention network as a deep feature extractor, and inputting shallow facial features into the fused attention network to obtain a deep facial feature map, wherein the fused attention network comprises a plurality of fused attention groups, and each fused attention group comprises a plurality of fused attention blocks;

s5: an up-sampling module is constructed, and up-sampling is carried out on the obtained deep feature map of the human face;

s6: and constructing a face image reconstruction module, and reconstructing the up-sampled face feature image into a high-resolution face image of the target.

In some alternative embodiments, step S2 comprises:

constructing a shallow feature extractor using a convolution layer, and extracting a shallow feature map, wherein the shallow feature map is expressed as: f (F) ₀ ＝f(I _LR )，F ₀ Representing a shallow feature map, f representing a convolution operation, I _LR Representing the input low resolution face image.

In some alternative embodiments, step S3 comprises:

constructing a fusion attention block consisting of three parallel attentions of pixel attentions, channel attentions and spatial attentions;

for input vector X ^H×W×C H represents the height of the feature map, W represents the width of the feature map, C represents the number of channels in which the feature map is located, and X ^H×W×C And inputting the three parallel attention points, fusing different characteristics extracted by different parallel attention points, and finally performing dimension reduction through one convolution to ensure that the input and the output are consistent in dimension.

In some alternative embodiments, in pixel attention, the computation is first reduced using a convolution to reduce the dimension, followed byThe latter consists of three parallel branches, wherein the uppermost and lowermost branches consist of a convolution and an activation function for obtaining dual pixel attention features; the middle layer branch consists of two convolutions and another activation function, is used for obtaining residual characteristics, and finally performs element level multiplication on the output characteristics of the three branches, and then obtains the final pixel attention characteristics by one convolution

Wherein T is ₁ 、T ₂ And T ₃ Respectively representing three layers of branch characteristics, and f represents convolution operation;

in the channel attention, the global space information of the channel is firstly converted into a channel descriptor through a global average pool, a characteristic diagram of 1×1×C is obtained, and then the characteristic diagram is compressed into the channel descriptor through downsampling

The feature map of each channel is up sampled again to be restored to 1 multiplied by C feature map, finally, a descriptor representing the weight of each channel is obtained by 1 multiplied by C through an activation function, and finally, the weight of each channel is multiplied by the two-dimensional matrix of the corresponding channel of the original feature map, and r is the coefficient of channel scaling;

in spatial attention, the channel size is reduced by starting with one convolution layer, and then the receptive field is enlarged by one convolution layer and the maximum pooling layer; immediately following a convolution group, the convolution group is composed of a plurality of convolution layers; finally, the final spatial attention feature is obtained from an activation function by upsampling the layer to recover the spatial dimension and using convolution to recover the channel dimension.

In some alternative embodiments, step S4 comprises:

the fused attention network comprises a plurality of fused attention groups FAG and long skip connections LSC, wherein each fused attention group further comprises a plurality of fused attention blocks with short skip connections SSCs, and the fused attention group of the m-th group is expressed as: f (F) _m ＝H _m (F _m-1 )＝H _m (H _m-1 (…H ₁ (F ₀ )…))，H _m Represents the mth fused attention group, F _m And F _m-1 Input and output for the mth fused attention group;

stacking a fused attention block within each fused attention group, and representing an nth fused attention block in an mth fused attention group as: f (F) _m,n ＝G _m,n (F _m,n-1 )＝H _m,n (H _m,n-1 (…H _m,1 (F _m-1 )…)，F _m,n-1 And F _m,n Is the input and output of the nth fused attention block in the mth fused attention group.

In some alternative embodiments, step S5 comprises:

the upsampled features are expressed as: f (F) _UP ＝H _UP (F _BF )，F _UP And H _UP Representing the up-sampled features and up-sampling modules, respectively.

In some alternative embodiments, step S6 includes:

the face image reconstruction module is expressed as: i _SR ＝H _Recon (F _UP )，H _Recon And I _SR Represented as a reconstruction module consisting of a convolution and a target high resolution face image, respectively.

In some alternative embodiments, the loss function L (θ) for the entire network is expressed as:

n represents the size of the dataset, +.>

And->

The i Zhang Chao th face image and the i high-resolution face image in the data set are represented.

According to another aspect of the present invention, there is provided a face super-resolution system based on a fused attention mechanism, including:

the downsampling module is used for downsampling the high-resolution face image to a target low-resolution face image;

the shallow feature extractor module is used for performing blocking operation on the target low-resolution face image, and extracting a shallow feature image by using the shallow feature extractor after separating out image blocks which are overlapped with each other;

the deep feature extractor module is used for constructing a fused attention block, fusing the features of the pixel, the channel and the space triple attention module, generating the fused attention feature of the network and enhancing the structural details of the reconstructed face; constructing a fused attention network as a deep feature extractor, and inputting shallow facial features into the fused attention network to obtain a deep facial feature map, wherein the fused attention network comprises a plurality of fused attention groups, and each fused attention group comprises a plurality of fused attention blocks;

the up-sampling module is used for up-sampling the obtained deep feature map of the face;

and the face image reconstruction module is used for reconstructing the up-sampled face feature image into a high-resolution face image of the target.

According to another aspect of the present invention there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of any of the methods described above.

In general, the above technical solutions conceived by the present invention, compared with the prior art, enable the following beneficial effects to be obtained:

the invention provides a face super-resolution method and a face super-resolution system based on a fused attention mechanism, which are used for fusing pixel, channel and spatial attention characteristics, so that different attention span characteristics can be interacted and fused by a network, and the characteristic expression capability of the network is enhanced. The fused attention network provided by the invention can concentrate various attention characteristics of the network to interaction of facial structure information, so that the reconstruction performance of the facial image is improved.

Drawings

Fig. 1 is a schematic flow chart of a face super-resolution method based on a fused attention mechanism provided by an embodiment of the invention;

fig. 2 is a schematic diagram of a face super-resolution network structure based on a fused attention mechanism according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a face super-resolution system based on a fused attention mechanism according to an embodiment of the present invention;

fig. 4 is a graph showing a comparison of test results according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

Fig. 1 is a schematic flow chart of a face super-resolution method based on a fused attention mechanism, which is provided by the embodiment of the invention, and includes the following steps:

wherein the high resolution face image may be downsampled to the target low resolution face image using bicubic interpolation (Bicubic interpolation) in step S1.

In the embodiment of the invention, the FFHQ face data set is used as a training set, a verification set and a test set of the invention, wherein 850 images are included as the training data set, 100 images are used as the verification data set, and 50 images are used as the test data set. The image size in the dataset is 256 x 256 pixels, and in an embodiment of the invention the dataset may be downsampled using a bicubic degradation model, where the downsampling factor is 4, so that the downsampled low resolution image is 64 x 64 pixels in size.

as shown in fig. 2, in an embodiment of the present invention, a 3*3 convolution layer may be used to construct a shallow feature extractor and extract a shallow feature map. The shallow feature map is expressed as:

F ₀ ＝f ^3×3 (I _LR )

wherein F is ₀ Representing shallow feature map, f ^3×3 Representation 3*3 convolution, I _LR Representing the input low resolution face image.

in the embodiment of the invention, the fusion attention block consists of three parallel attention points of pixel attention, channel attention and space attention. For input vector X ^H×W×C H represents the height of the feature map, W represents the width of the feature map, C represents the number of channels in which the feature map is positioned, the feature map is input into three parallel attentions, different features extracted by different parallel attentions are fused, and finally the dimension is reduced by convolution of one 1*1, so that the input and the output are consistent in dimension, and the dimension can be expressed by a formula:

F _fusion ＝f ^1×1 (concat(PA,CA,SA))

wherein f ^1×1 Representing 1*1 convolution layers, concat represents a fusion operation, (PA, CA, SA) representing pixel attention, channel attention, and spatial attention features, respectively.

Wherein in pixel attention, first the computation can be reduced using one 1*1 convolution dimension reduction, then it is composed of three parallel branches, where the uppermost and lowermost branches are composed of one 3*3 convolution and one Sigmoid activation function for obtaining the dual pixel attention feature; the middle layer branch consists of two 3*3 convolutions and a ReLU activation function to obtain residual features, and finally the output features of the three branches are multiplied at element level, and the final pixel attention feature is obtained by one 3*3 convolution, which can be expressed as:

wherein T is ₁ 、T ₂ And T ₃ Respectively represent three layers of branch characteristics, f ^3×3 Representing 3*3 convolutions.

In the channel attention, the global spatial information of the channel can be firstly converted into the channel descriptor through a global average pool, namely, a characteristic diagram of 1×1×C is acquired, and then the characteristic diagram is compressed into the channel descriptor through downsampling

Wherein r is a coefficient of channel scaling, up-sampling is restored to 1×1×c feature map, and finally, a descriptor representing the weight of each channel of 1×1×c is obtained through a sigmoid activation function, and finally, the weight of each channel is multiplied by a two-dimensional matrix of the channel corresponding to the original feature map.

Wherein in spatial attention, the channel size is reduced by starting with a 1 x 1 convolution layer, and then the receptive field is enlarged by a convolution layer with a step size of 2 and a maximum pooling layer; immediately following a convolution group, the convolution group is composed of 3 convolution layers with the step length of 3 and the convolution kernel of 7 multiplied by 7; finally, the spatial dimension is recovered by upsampling the layer and the channel dimension is recovered using a 1 x 1 convolution, the final spatial attention feature being obtained by a Sigmoid activation function.

In the embodiment of the present invention, the size of the convolution layer and the number of convolution layers may also be other values, which are not limited in uniqueness.

in the embodiment of the invention, the deep facial feature map is expressed as:

F _BF ＝H _FAN (F ₀ )

wherein F is _BF Representing deep facial feature map, H _FAN Representing a fused attention network.

Wherein the fused attention network comprises 10 fused attention groups (Fusion Attention Group, FAG) and Long Skip-Connection (LSC). Each fused attention group also contains 10 fused attention blocks with Short Skip-Connection (SSC). The fused attention group of group m can be formulated as:

F _m ＝H _m (F _m-1 )＝H _m (H _m-1 (…H ₁ (F ₀ )…))

wherein. H _m Represents the mth fused attention group, F _m And F _m-1 Input and output for the mth fused attention group; in addition, long skip connection LSCs are introduced to stabilize training of the network while residual information can be learned. Stacking fused attention blocks within each fused attention group, the nth fused attention block in the mth fused attention group can be expressed as:

F _m,n ＝H _m,n (F _m,n-1 )＝H _m,n (H _m,n-1 (…H _m,1 (F _m-1 )…)

wherein F is _m,n-1 And F _m,n Is the input and output of the nth fused attention block in the mth fused attention group, H _m,n Is the nth fused attention block in the mth fused attention group.

in the embodiment of the present invention, the up-sampled features are expressed as follows:

F _UP ＝H _UP (F _BF )

wherein F is _UP And H _UP Representing the up-sampled features and up-sampling modules, respectively. The upsampling module may be implemented using sub-pixel convolution.

In the embodiment of the invention, the face image reconstruction module is represented as follows:

I _SR ＝H _Recon (F _UP )

wherein H is _Recon And I _SR Represented as a reconstruction module formed by a 3*3 convolution and a target high resolution face image, respectively.

Wherein the loss function L (θ) of the entire network is expressed as:

where N represents the size of the data set,

and->

The invention also provides a face super-resolution system based on the fusion attention mechanism, which is used for realizing the face super-resolution method based on the fusion attention mechanism, as shown in fig. 3, and comprises the following steps:

a downsampling module 101, configured to downsample a high-resolution face image to a target low-resolution face image;

the shallow feature extractor module 102 is configured to perform a blocking operation on the target low-resolution face image, and extract a shallow feature map by using the shallow feature extractor after separating image blocks that overlap each other;

the deep feature extractor module 103 is used for constructing a fused attention block, fusing the features of the pixel, channel and space triple attention module, generating the fused attention feature of the network, and enhancing the structural details of the reconstructed face; constructing a fused attention network as a deep feature extractor, and inputting shallow facial features into the fused attention network to obtain a deep facial feature map, wherein the fused attention network comprises a plurality of fused attention groups, and each fused attention group comprises a plurality of fused attention blocks;

an up-sampling module 104, configured to up-sample the obtained deep feature map of the face;

the face image reconstruction module 105 is configured to reconstruct the up-sampled face feature map into a high-resolution face image of the target.

The invention also provides a computer storage medium, in which a computer program executable by a computer processor is stored, the computer program executing the above-mentioned human face super-resolution method based on the fusion attention mechanism.

The invention finally provides a test embodiment, which uses the FFHQ face data set to verify the algorithm. 850 images were used as training data sets, 100 images as validation data sets, and 50 images as test data sets. The HR image size is 256×256 pixels, the downsampling factor is 4, and thus the LR image (using the bicubic degradation model) is 64×64 pixels in size. Note that all training, validation and testing are based on luminance channels in YCbCr color space and use a 4-fold magnification factor for training and testing. The SR reconstruction result is evaluated using four evaluation indexes of Peak signal-to-noise ratio (PSNR), structural similarity (Structural SIMilarity, SSIM), feature similarity (Feature Similarity, FSIM), and visual information fidelity (Visual Information Fidelity, VIF) to verify the performance of SR reconstruction under a luminance channel. The model is trained by Adam optimizer, beta ¹ ＝0.9，β ² =0.999, and e=10 ^-8 . The initial learning rate is set to 10 ^-4 Then halved every 50 cycles. Table 1 shows the comparison results under the condition that the reconstruction multiple is 4 by the three evaluation indexes, and fig. 4 is a comparison chart of the reconstruction results of 4-fold face images.

The human face SR method for comparison comprises the following steps: bicubic, LCGE, EDGAN, SRFBN, MTC and RCAN. Bicubic is a classical image interpolation algorithm; LCGE is a classical two-step face SR method; EDGAN is one of the most advanced deep learning face SR algorithms that use the generation of a countermeasure network (Generative Adversarial Networks, GAN); SRFBN is a network of the latest and most advanced deep learning face SR algorithms using a feedback network; MTC is a new face SR based on multi-view texture compensation; RCAN is a classical SR method based on depth residual channel attention network. Fig. 4 (a) shows a Bicubic image; (b) is a graph of experimental results of the invention; (c) As can be seen from the original high resolution image, the present invention achieves a very high visual effect in the visual result.

Table 1 comparison results table of the present invention with six excellent algorithms

Method	Bicubic	LCGE	EDGAN	RCAN	SRFBN	MTC	The invention is that
								PSNR/dB	29.81	31.12	30.87	32.67	32.42	32.01	32.85
SSIM	0.8451	0.8668	0.8574	0.8977	0.8944	0.8885	0.9011
								FSIM	0.8889	0.9099	0.9231	0.9337	0.9305	0.9281	0.9359
VIF	0.5246	0.5563	0.5386	0.6161	0.6077	0.5933	0.6219

From the experimental results of the table above, it can be seen that the present invention achieves significant advantages over the other six methods.

It should be noted that each step/component described in the present application may be split into more steps/components, or two or more steps/components or part of the operations of the steps/components may be combined into new steps/components, as needed for implementation, to achieve the object of the present invention.

It will be readily appreciated by those skilled in the art that the foregoing description is merely a preferred embodiment of the invention and is not intended to limit the invention, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. The human face super-resolution method based on the fusion attention mechanism is characterized by comprising the following steps of:

s6: a face image reconstruction module is constructed, and the face feature image after up-sampling is reconstructed into a high-resolution face image of the target;

the step S3 comprises the following steps:

constructing a fused attention block consisting of three parallel attentions of pixel attention, channel attention and spatial attention, wherein in pixel attention, first a convolution dimension reduction is used to reduce the computational effortThen three parallel branches, with the uppermost and lowermost branches consisting of a convolution and an activation function, for obtaining dual pixel attention features; the middle layer branch consists of two convolutions and another activation function, is used for obtaining residual characteristics, and finally performs element level multiplication on the output characteristics of the three branches, and then obtains the final pixel attention characteristics by one convolution

for input vector X ^H×W×C H represents the height of the feature map, W represents the width of the feature map, C represents the number of channels in which the feature map is located, and X ^H×W×C Inputting the three parallel attention points, fusing different characteristics extracted by different parallel attention points, and finally performing dimension reduction through one convolution to ensure that the input and the output are consistent in dimension;

the step S4 includes:

stacking a fused attention block within each fused attention group, and representing an nth fused attention block in an mth fused attention group as: f (F) _m,n ＝H _m,n (F _m,n-1 )＝H _m,n (H _m,n-1 (…H _m,1 (F _m-1 )…)，F _m,n-1 And F _m,n Is the input and output of the nth fused attention block in the mth fused attention group.

2. The method according to claim 1, wherein step S2 comprises:

3. The method according to claim 2, wherein in channel attention, global spatial information of a channel is first converted into a channel descriptor by a global averaging pool, a feature map of 1 x C is obtained, and then compressed into by downsampling

The feature map of each channel is up sampled again to be restored to 1 multiplied by C feature map, finally, a descriptor representing the weight of each channel is obtained by 1 multiplied by C through an activation function, and finally, the weight of each channel is multiplied by the two-dimensional matrix of the corresponding channel of the original feature map, and r is the coefficient of channel scaling; />

4. A method according to claim 3, wherein step S5 comprises:

5. The method according to claim 4, wherein step S6 comprises:

6. The method of claim 5, wherein the loss function L (θ) of the entire network is expressed as:

n represents the size of the dataset, +.>

And->

7. A fused attention mechanism-based face super-resolution system, comprising:

the up-sampling module is used for up-sampling the obtained deep feature map of the face; the face image reconstruction module is used for reconstructing the face feature image after up-sampling into a high-resolution face image of the target;

the deep feature extractor module is specifically configured to perform the following operations:

constructing a fused attention block consisting of three parallel attentions of pixel attentions, channel attentions and spatial attentions, wherein in the pixel attentions, first, a convolution is used to reduce the computation amount, and then three parallel branches are formed, wherein the branches of the uppermost layer and the lowermost layer are formed by a convolution and an activation function, and the two parallel branches are used for obtaining double pixel attentions; the middle layer branch consists of two convolutions and another activation function, is used for obtaining residual characteristics, and finally performs element level multiplication on the output characteristics of the three branches, and then obtains the final pixel attention characteristics by one convolution

Wherein T is ₁ 、T ₂ And T ₃ Respectively representing three layers of branch characteristics, and f represents convolution operation; for input vector X ^H×W×C H represents the height of the feature map, W represents the width of the feature map, C represents the number of channels in which the feature map is located, and X ^H×W×C Inputting the three parallel attention points, fusing different characteristics extracted by different parallel attention points, and finally performing dimension reduction through one convolution to ensure that the input and the output are consistent in dimension; the fused attention network comprises a plurality of fused attention groups FAG and long skip connections LSC, wherein each fused attention group further comprises a plurality of fused attention blocks with short skip connections SSCs, and the fused attention group of the m-th group is expressed as: f (F) _m ＝H _m (F _m-1 )＝H _m (H _m-1 (…H ₁ (F ₀ )…))，H _m Represents the mth fused attention group, F _m And F _m-1 Input and output for the mth fused attention group; stacking fusion attention blocks in each fusion attention group, and tabulating the nth fusion attention block in the mth fusion attention groupThe method is shown as follows: f (F) _m,n ＝H _m,n (F _m,n-1 )＝H _m,n (H _m,n-1 (…H _m,1 (F _m-1 )…)，F _m,n-1 And F _m,n Is the input and output of the nth fused attention block in the mth fused attention group.

8. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 6.