CN114881858A

CN114881858A - Lightweight binocular image super-resolution method based on multi-attention machine system fusion

Info

Publication number: CN114881858A
Application number: CN202210538803.7A
Authority: CN
Inventors: 裴文江; 冯程晨; 夏亦犁
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2022-05-17
Filing date: 2022-05-17
Publication date: 2022-08-09

Abstract

The invention discloses a lightweight binocular image super-resolution method based on multi-attention machine mechanism fusion, which mainly solves the problem that the model performance and the calculation efficiency are difficult to balance in a binocular image super-resolution task. Firstly, introducing a corrected binarization feature fusion frame to fuse multi-level image features extracted under the channel attention and space attention mechanism; secondly, global parallax information of the binocular image is extracted through a dual-channel attention mechanism, and meanwhile, a pyramid sampling mechanism is introduced to reduce module calculation amount. Experiments prove that the invention realizes the great improvement of the super-resolution performance under less parameters and confirms the portability of the lightweight network in the binocular image super-resolution task.

Description

Lightweight binocular image super-resolution method based on multi-attention machine system fusion

Technical Field

The invention relates to a lightweight binocular image super-resolution method based on multi-attention machine mechanism fusion, and belongs to the technical field of image processing.

Background

The binocular vision is inspired by bionics, the difference of the seen scenes forms three-dimensional space perception about scenes because of the difference of the positions of the left eye and the right eye of human beings, the binocular vision is a vision perception mechanism which is simulated by using a binocular camera, and a general binocular stereoscopic vision perception system similar to human eyes is constructed.

Unlike single image super resolution, binocular hyper-resolution is essentially a multi-input multi-output process, and a low-resolution left view and a low-resolution right view are input, so that a corresponding high-resolution binocular image needs to be reconstructed. If the binocular images are regarded as front and rear frames in the video, the task is simplified into a video super-resolution task with two frames of images, but the interaction relationship of the binocular images is represented by parallax, is different from the tiny movement offset between the video frames, and the research method has larger difference. Meanwhile, the image of the second view angle can provide extra information for a single image, and if the image is regarded as a single image hyper-segmentation task with reference, the provided information is limited to information under a low-resolution scene, and the reconstruction effect on high-level features is extremely low. Therefore, the binocular image super-resolution not only utilizes the related information between images on the basis of single image super-resolution, but also adds a parallax compensation mechanism in the traditional multi-image super-resolution task.

Most of lightweight network designs based on computer vision tasks such as graphic classification and semantic representation are gradually proposed in recent years, including classic MobileNet and xcoption structures, and most of the lightweight network designs are also introduced into the field of image super-segmentation, so that a better effect is achieved. With the addition of attention mechanism, the network performance is further improved, and how to reduce the number of parameters of the high-efficiency attention mechanism is also one of the important subjects in the research in the direction. For the binocular image super-resolution task, on one hand, the high efficiency of a feature extraction backbone network needs to be considered in the light weight of the model, and the parameter overhead can be reduced while the performance is guaranteed as far as possible in the binocular feature matching stage.

Disclosure of Invention

The technical problem is as follows: the invention aims to provide a lightweight binocular image super-resolution method based on multi-attention machine system fusion, aiming at the problem that the model performance and the operation efficiency are difficult to balance in a binocular super-resolution task, and the lightweight design idea which accords with a binocular image super-resolution network is searched by discussing the existing lightweight model in a single image super-resolution method.

The technical scheme is as follows: aiming at the problem that the model performance and the calculation efficiency are difficult to balance in a binocular super-resolution task, the invention provides a lightweight binocular image super-resolution method based on multi-attention-machine system fusion on the basis of researching a single-image super-resolution lightweight network, and the method comprises the following specific steps:

step 1: building a network model

Taking the low-resolution left view image and the right view image as network input, and performing super-resolution processing on the left view to obtain a high-resolution left view image; the construction network model comprises three sub-modules, namely a feature extraction module, a parallax attention extraction module and a feature reconstruction module;

first, a low-resolution binocular image pair is input

And

extracting shallow features of the left view image and the right view image through a 3 x3 convolutional layer:

wherein H _sfe A 3 x3 convolutional layer representing the shared weight,

and

shallow features representing a left view and a right view respectively extracted from a low-resolution binocular graphics pair; inputting m feature fusion groups sharing weight values to further extract deeper features:

in the above formula, the first and second carbon atoms are,

represents the m-th feature fusion group, and similarly,

and

respectively represent the m-1 st and 1 st feature fusion groups,

and

the characteristic tensor of a deeper layer is output after shallow layer characteristics pass through the m characteristic fusion groups;

then, after the independent features of the low-resolution image pair are extracted, matching of binocular features is carried out through a parallax attention module based on a multi-scale pyramid sampling mechanism, and a parallax fusion feature tensor of the left view image is output:

wherein H _DCPAM A two-channel parallax attention module is characterized,

representing the feature tensor obtained by the parallax attention module; subsequently, the previous stage features are further extracted and fused by using the n feature fusion groups again, and the characteristics are as follows:

similar to the feature extraction stage,

an nth feature fusion set representing a feature reconstruction phase,

and

respectively represent the (n-1) th and the 1 st feature fusion groups,

a fusion feature tensor representing a left view obtained after the n feature fusion groups;

and finally, pixel-by-pixel addition is carried out on the left view image subjected to the double-triple up-sampling and the superior output characteristic, so as to obtain a final left view super-resolution reconstruction result:

wherein H ₅ And H ₃ Respectively 5 × 5 and 3 × 3 convolution, H _ps Representing the pixel reconstruction layer, H _up For a bicubic upsampling operation, λ ₁ And λ ₂ Represents a trainable scalar parameter that is to be scaled,

the high-resolution left view after the final super-resolution is obtained;

step 2: constructing a binocular image data set, setting training parameters for network training,

dividing the data set image into a training set, a verification set and a test set, setting training parameters to train the network model, and performing network training on the training set to obtain a trained network model;

and step 3: and inputting the binocular image to be processed into the trained network model, and performing binocular image super-resolution reconstruction.

Wherein,

the feature fusion block in the step 1 takes a multi-attention fusion module as a basic module, integrates multi-level features extracted by channel attention, space attention and cavity convolution, and takes a corrected binarization feature fusion structure as a basic framework for building.

The parallax attention module in the step 1 is a dual-channel attention module and aims to extract local epipolar line characteristics and global parallax information.

And step 1, the convolution kernel sizes of all levels of the multi-scale pyramid sampling mechanism are [12,15,18 and 21 ].

In step 1, m is 2, n is 2, and the number of feature fusion blocks of the left view branch and the right view branch in the feature extraction stage is the same.

And 1, reconstructing a high-resolution right-view image in the same way by exchanging the left and right-view low-resolution images according to the low-resolution left-view image and the right-view image.

And 2, in the process of training the network model, using the super-resolution loss as a loss function.

In the process of training the network model in the step 2, Adam is used by a training optimizer, the learning rate is initialized to 0.00002, the learning rate is optimized in the training process, each 30 iterations are reduced to a half of the original one, the batch size is set to 4, and 120 iterations are trained and then converged.

And 2, in the process of training the network model, constructing a network environment based on the pytorch1.8 by using the Nvidia RTX3090Ti GPU.

Has the advantages that: due to the adoption of the technical scheme, compared with the prior art, the invention has the following advantages and beneficial effects:

1. the method disclosed by the invention integrates all levels of features extracted by a multi-attention machine mechanism by using a corrected binarization feature fusion framework, and improves the performance of super-resolution reconstruction on the premise of reducing model parameters and calculated amount.

2. A dual-channel parallax attention mechanism is introduced, so that the feature extraction along epipolar lines and global parallax is realized, and the effective compensation of the interactive information between the left view and the right view is provided.

Drawings

Figure 1 is a diagram of the overall network architecture,

FIG. 2(a) is a schematic diagram of a feature fusion group structure, FIG. 2(b) is a schematic diagram of a multi-attention fusion module, FIG. 2(c) is a schematic diagram of a dual channel attention module,

figure 3 is a graph comparing the results of the present invention with the prior art on a low resolution binocular image,

FIG. 4 is a graph comparing the efficiency of operation of the present invention with that of the prior art.

Detailed Description

The invention is explained in detail below with reference to the accompanying drawings and embodiments, and the binocular image super-resolution method based on multi-attention machine fusion provided by the invention specifically comprises the following steps:

step 1: building a network model

As shown in fig. 1, the network design of the present invention includes three sub-steps, i.e., feature extraction, disparity attention extraction, and feature reconstruction. In the characteristic extraction module stage, different hierarchical characteristics in a single view extracted by a multi-attention machine system are fused by adopting a parameter-shared modified binarization characteristic fusion framework; a Dual-channel parallax attention mechanism (DCPAM) and a pyramid sampling mechanism are introduced into a Dual-channel parallax attention extraction stage to fuse local and global information between two views; and in the characteristic reconstruction stage, the characteristic fusion group of the characteristic extraction module is continued, and the high-resolution left view image is reconstructed through the automatic parameter calculation unit.

Step 1.1: feature extraction

First, a low-resolution binocular image pair is input

And

wherein H _sfe A 3 x3 convolutional layer representing the shared weight,

and

shallow features representing the left and right views respectively extracted from the low-resolution binocular graphics pair. Then, inputting a feature fusion group of m sharing weights to further extract deeper features:

in the above formula, the first and second carbon atoms are,

represents the m-th feature fusion group,

and

respectively represent the m-1 st and 1 st feature fusion groups,

and

and the characteristic tensor of the deeper layer output after the shallow characteristic passes through the m characteristic fusion groups is referred.

Specifically, as shown in fig. 2(a), the feature fusion group is constructed by using a multi-attention fusion module as a basic module and a modified binarization fusion frame as a feature fusion frame.

As shown in fig. 2(b), the main branch of the multi-attention fusion module first performs spatial feature extraction, and first extracts input features by stacking 3 × 3 convolution layers, and 1 × 1 convolution reduces the number of channels to 1/4. To increase the receptive field, the spatial dimensions are reduced using a convolution kernel with a step size of 2, 7 × 7, followed by a hole convolution operation with a dilation rate of 1, 2, respectively, after the pooling layer. After the original input feature dimensions are up-sampled, the spatial feature tensors of the left view image and the right view image are output through a 1 × 1 convolution kernel Sigmoid normalization operation.

The auxiliary road of the multi-attention fusion module adopts an efficient channel attention mechanism and consists of 1 x 1 point-by-point convolution and depth-by-depth convolution to obtain channel characteristic tensors of the left view image and the right view image. The outputs of the multi-attention fusion modules are cascaded in a modified framework form of a binary fusion framework through multiplication of batch matrixes and are merged into the multi-attention fusion module.

And the corrected binarization fusion framework connects the currently generated output feature tensor with the input feature tensor of the lower level and is constructed in a layer-by-layer recursion mode. And simultaneously adding a channel recombination module, reducing the number of channels by utilizing 1 multiplied by 1 convolution to enable the number of channels to be the same as that of the input characteristic tensor, and adding the channels into the output characteristic tensor in a pixel-by-pixel addition mode.

Step 1.2: parallax attention extraction

After the independent features of the low-resolution image pair are extracted, matching of binocular features is conducted through a parallax attention module based on a multi-scale pyramid sampling mechanism, and a parallax fusion feature tensor of the left view image is output:

wherein H _DCPAM A two-channel parallax attention module is characterized,

representing the feature tensor obtained by the disparity attention module.

Specifically, as shown in fig. 2(c), the DLPAM Module includes a Local Parallax Attention Module (LPAM) of the left branch and a Global Parallax Attention Module (GPAM) of the right branch. In LPAM, the left view feature tensor extracted in step 1.1 is passed through a 1 × 1 convolution kernel and a transitional residual block

And right view feature tensor

Respectively produced to have a size of

Characteristic tensor K of _l And Q _l By merging channel dimensions to

To obtain k _l And q is _l Matrix of contributions of right view to left view along epipolar lines (correlation matrix)

Through k _l K after inversion _l ^T And q is _l Multiplying the batched matrixes, and performing softmax operation to obtain:

LPAM generates right view mask by single convolution

Is reshaped into

And

multiplying the batched matrixes to generate a parallax compensation tensor based on a right view

Local disparity attention feature tensor for final left view

Is formed by a matrix of contribution degrees

Disparity compensation tensor for right view

And the feature tensor of the deeper level of the left view

The three parts are cascaded and formed as follows:

the generation mode of the global parallax attention feature tensor of the right-channel GPAM is similar to that of the LPAM, and in order to reduce the huge calculation amount of matrix multiplication, a pyramid sampling mechanism is adopted for data compression. First, Q is obtained by referring to the preceding stage feature tensor extraction method in the LPAM _g 、K _g And V _g All sizes are

Providing feature information of the right view, Q _g And K _g Is stretched into

Form to obtain q _g And k _g Through k _g K after transposition _g ^T And q is _g Similar matrix of global feature is generated through softmax operation after multiplication of batched matrixes

Based on this, introduction of [12,15,18,21 ]]The pyramid sampling mechanism under the arrangement of convolution kernels with different sizes converts K into _g Is compressed into

Is modified into

The global feature aggregation is composed of three parts, a mask G and a similarity matrix M _g By dot productAnd V _g Postpooling remodeling v _g Is transferred v _g ^T Matrix multiplication is carried out, and a left view global parallax attention characteristic tensor is output

The two-channel parallax attention tensor finally extracted by the two channels is obtained by adding the local parallax feature attention tensor and the global parallax feature attention tensor pixel by pixel:

step 1.3: feature reconstruction

And further extracting and fusing the preceding stage features by using n feature fusion groups, wherein the characteristics are as follows:

similar to the feature extraction stage,

an nth feature fusion set representing a feature reconstruction phase,

and

respectively represent the (n-1) th and the 1 st feature fusion groups,

and a fused feature tensor representing the left view image obtained after the n feature fusion groups. Finally, the fused feature tensor extracted by the n feature fusion groups is sent to the pixelThe recombination module is added with a unit for automatically calculating parameters, different feature weights are respectively given to the 3 x3 convolution layer and the 5 x 5 convolution layer, the feature tensors after automatic weighting are cascaded, and the feature tensors and the results after the bicubic upsampling of the original left view are added pixel by pixel to obtain a final high-resolution left view image:

is the high-resolution left view after the final super-resolution.

Step 2: constructing a binocular image data set, dividing images of the data set into a training set, a verification set and a test set, setting training parameters to train a network model, and performing network training on the training set to obtain the trained network model;

step 2.1: loss function setting

In order to measure the generalized difference of the pixels between the super-resolution image and the ground real image, the super-resolution loss is calculated by adopting the mean square error,

and

the loss function refers to the left view result after the network hyper-resolution reconstruction and the ground real high-resolution left view image

Is calculated as follows:

step 2.2: training and setting:

the number m of feature fusion groups at each stage in the network is 2, n is 2, Adam is used by a training optimizer, the learning rate is initialized to 0.00002, the learning rate is optimized in the training process, each 30 iterations is reduced to half of the original learning rate, the batch size is set to 4, and 120 iterations are trained and then gradually converged. A network environment was built based on the pytorch1.8 using the Nvidia RTX3090Ti GPU.

And step 3: and inputting the binocular image to be processed into the trained model, and performing binocular image super-resolution reconstruction.

Fig. 3 shows that compared with the super-resolution reconstruction result of the existing binocular image super-resolution technology (bicubic up-sampling, StereoSR and PASSRnet), the reconstruction effect of the invention can clearly show the shape of the character, and the background and the character body are distinguished obviously without the influence of light spots and the like, so that the super-resolution result is good.

Fig. 4 shows a comparison between the present invention and the existing super-resolution reconstruction techniques (SRCNN, VDSR, carry, StereoSR, PASSRnet) in terms of operation efficiency, and the present invention has a shorter inference time under the same number of test samples, and achieves effective unification of performance and operation efficiency.

Claims

1. A lightweight binocular image super-resolution method based on multi-attention machine fusion is characterized by comprising the following steps:

step 1: building a network model

first, a low-resolution binocular image pair is input

And

wherein H _sfe A 3 x3 convolutional layer representing a shared weight,

and

in the above-mentioned formula, the compound has the following structure,

represents the m-th feature fusion group, and similarly,

and

respectively represent the m-1 st and 1 st feature fusion groups,

and

wherein H _DCPAM A two-channel parallax attention module is characterized,

representing the feature tensor obtained by the parallax attention module;

subsequently, the previous stage features are further extracted and fused by using the n feature fusion groups again, and the characteristics are as follows:

similar to the feature extraction stage,

an nth feature fusion set representing a feature reconstruction phase,

and

respectively represent the (n-1) th and the 1 st feature fusion groups,

representing a fusion feature tensor of the left view obtained after the n feature fusion groups;

the high-resolution left view after the final super-resolution is obtained;

2. The lightweight binocular image super-resolution method based on multi-attention mechanism fusion of claim 1, wherein the feature fusion block of step 1 is constructed with a multi-attention fusion module as a basic module, integrates channel attention, spatial attention and multilevel features extracted by void convolution, and takes a modified binarization feature fusion structure as a basic framework.

3. The multi-attention-machine-fusion-based lightweight binocular image super-resolution method according to claim 1, wherein the parallax attention module in the step 1 is a dual-channel attention module and aims to extract local epipolar line features and global parallax information.

4. The multi-attention-machine-fusion-based lightweight binocular image super-resolution method according to claim 1, wherein the multi-scale pyramid sampling mechanism of step 1 has convolution kernel sizes of [12,15,18,21 ].

5. The method for super-resolution of lightweight binocular images based on multi-attention machine mechanism fusion according to claim 1, wherein m is 2, n is 2 in step 1, and the number of feature fusion blocks of left and right view branches in the feature extraction stage is the same.

6. The multi-attention-machine-fusion-based lightweight binocular image super-resolution method according to claim 1, wherein the low-resolution left-view image and the low-resolution right-view image in the step 1 are exchanged, and the high-resolution right-view image can be reconstructed in the same way by exchanging the left-view image and the right-view image.

7. The method for super-resolution of lightweight binocular images based on multi-attention machine fusion according to claim 1, wherein the super-resolution loss is used as a loss function in the training process of the network model in the step 2.

8. The method for super-resolution of the lightweight binocular images based on the multi-attention machine mechanism fusion as claimed in claim 1, wherein in the training process of the network model in the step 2, Adam is used by a training optimizer, the learning rate is initialized to 0.00002, the learning rate is optimized in the training process, each 30 iterations are reduced to half of the original value, the batch size is set to 4, and the training converges after 120 iterations.

9. The multi-attention-machine-fusion-based lightweight binocular image super-resolution method according to claim 1, wherein in the training process of the network model in the step 2, a network environment is built based on the pytorch1.8 by using an Nvidia RTX3090Ti GPU.