CN112750076A

CN112750076A - Light field multi-view image super-resolution reconstruction method based on deep learning

Info

Publication number: CN112750076A
Application number: CN202010284067.8A
Authority: CN
Inventors: 赵圆圆; 李浩天
Original assignee: Yimu Shanghai Technology Co ltd
Current assignee: Yimu Shanghai Technology Co ltd
Priority date: 2020-04-13
Filing date: 2020-04-13
Publication date: 2021-05-04
Anticipated expiration: 2040-04-13
Also published as: CN112750076B

Abstract

A light field multi-view image super-resolution reconstruction method based on deep learning is characterized in that a training set of high-resolution and low-resolution image pairs is constructed by adopting multi-view images which are distributed in an NxN array shape and obtained from a light field camera or a light field camera array; constructing a multilayer characteristic extraction network from an NxN light field multi-view image array to an NxN light field multi-view characteristic image; stacking the characteristic images and constructing a characteristic fusion and enhancement multilayer convolution network to obtain 4D light field structural characteristics which can be used for reconstructing light field multi-view images; constructing an up-sampling module to obtain a nonlinear mapping relation from the 4D light field structural characteristics to the high-resolution NxN light field multi-view image; constructing a loss function based on the multi-scale feature fusion network, training, and finely adjusting network parameters; and inputting the low-resolution NxN light field multi-view image into the trained network to obtain the high-resolution NxN light field multi-view image.

Description

Light field multi-view image super-resolution reconstruction method based on deep learning

Technical Field

The invention relates to the technical field of image processing, in particular to a light field multi-view image super-resolution reconstruction method based on deep learning.

Background

The light field camera can capture the spatial position and the incident angle of light rays at the same time, however, the recorded light field has a trade-off restriction relation between spatial resolution and angular resolution, and the limited spatial resolution of the multi-view image limits the application range of the light field camera to a certain extent. The camera array also limits the development of the fields of 3D light field display, 3D modeling, 3D measurement and the like due to the constraints of cost and resolution. With the continuous development of the field of image processing, the demand of the super-resolution technology for the optical field multi-view image is urgently needed to be met.

In recent years, the development and progress of the neural convolutional network provides a better solution for super-resolution. As a nonlinear optimization method, the appearance of the neural convolutional network provides many good solutions and ideas for super-resolution of conventional images. However, in the existing scheme, the light field multi-view super-resolution method based on the convolutional neural network is based on the characteristics of the light field multi-view image, uses the correlation among the light field multi-view images, and does not consider the characteristic loss of image texture information under the multi-scale; secondly, the super-resolution effect is not good because the characteristic information implicit in the 4D light field is used too simply.

Disclosure of Invention

The invention provides a super-resolution reconstruction method of a light field multi-view image based on deep learning, and aims to solve the problem that the existing super-resolution method of the light field multi-view image cannot meet technical indexes.

In one embodiment of the present invention, a light field multi-view image super-resolution reconstruction method based on deep learning includes the following steps:

a1, constructing a training set of high-resolution and low-resolution image pairs by using multi-view images which are distributed in an NxN array form and obtained from a light field camera or a light field camera array;

a2, constructing a multilayer feature extraction network from an NxN light field multi-view image array to an NxN light field multi-view feature image;

a3, stacking the feature images, constructing a feature fusion and enhancement multilayer convolution network, and obtaining 4D light field structural features which can be used for reconstructing light field multi-view images;

a4, constructing an up-sampling module to obtain a nonlinear mapping relation from the 4D light field structural features to the high-resolution N multiplied by N light field multi-view images;

a5, constructing a loss function based on the multi-scale feature fusion network, training, and fine-tuning network parameters; and A6, inputting the low-resolution NxN light field multi-view image into the trained network to obtain the high-resolution NxN light field multi-view image.

The light field multi-view image super-resolution method based on the multi-scale fusion features provided by the embodiment of the invention has the following advantages:

1. the characteristics of the light field multi-view image are fully utilized, inherent structure information in the 4D light field is explored through the multi-scale feature extraction module, then the extracted texture information is fused and enhanced through the fusion module, and finally the super-resolution of the light field multi-view image array is achieved through the up-sampling module.

2. The super-resolution result of the light field multi-view image can be used for light field depth estimation, more clues can be provided for shielding or edge areas, and the calculation accuracy of the disparity map is enhanced to a certain extent.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

FIG. 1 is a flow chart of a super-resolution method for a light field multi-view image based on multi-scale fusion features according to one embodiment of the invention.

Fig. 2 is a schematic structural diagram of a super-resolution network of a light-field multi-view image according to one embodiment of the present invention.

FIG. 3 is a schematic diagram of an atomic space pyramid pooling block consisting of atomic hole convolutions and atomic hole convolutions of different expansion rates in accordance with one embodiment of the present invention.

Fig. 4 is a schematic network structure diagram of a fusion block according to one embodiment of the present invention.

Fig. 5 is a comparison table of bicubic interpolation according to one of the embodiments of the present invention and the method in the embodiment of the present invention under two evaluation indexes of PSNR and SSIM on three images with different synthetic data sets.

Fig. 6 is a comparison table of bicubic interpolation according to one of the embodiments of the present invention and the method according to the embodiment of the present invention under two evaluation indexes of PSNR and SSIM on three images with different real data sets.

Detailed Description

According to one or more embodiments, as shown in fig. 1, a light field multi-view image super-resolution method based on multi-scale fusion features includes the following steps:

a1, constructing a training set of high-resolution and low-resolution image pairs by using light field camera multi-view images or light field camera array images (N multiplied by N array-shaped distributed multi-view images);

a5, constructing a loss function based on the multi-scale feature fusion network, training, and fine-tuning network parameters;

and A6, inputting the low-resolution NxN light field multi-view image into the trained network to obtain the high-resolution NxN light field multi-view image.

According to one or more embodiments, the specific process of constructing the training set of high-resolution and low-resolution image pairs using the light field camera multi-view images or the light field camera array images (N × N array-like distributed multi-view images) in step a1 is as follows:

step A1.1, firstly, for the multi-view image G distributed in the form of N × N array^HRPerforming bicubic interpolation for 2-fold down-sampling to obtain low-resolution NxN light-field multi-view image G^LR；

Step A1.2, then, for the low-resolution light field multi-view image G^LRIs cut into small blocks with the space size of M multiplied by M pixels by the step length of K pixels, and the high-resolution light field multi-view image G^HRIs also correspondingly cut into small blocks with the size of 2 Mx 2M pixels;

step A1.3, normalization and regularization processing are respectively carried out on the two light field multi-view images, and the value of each pixel is in the range of [0,1], so that input data and real data of the deep learning network model in the embodiment are formed.

According to one or more embodiments, as shown in fig. 2, the specific process of constructing the multi-layer feature extraction network from the N × N light field multi-view image array to the N × N light field multi-view feature image in step a2 is as follows:

step A2.1, multi-view images in a low-resolution light field are subjected to 1 conventional convolution and 1 residual block (ResB) to realize low-level feature extraction;

step a2.2, performing multi-scale feature extraction and feature fusion on the extracted low-level features by using a residual block and a residual block which alternately appear twice (residual aperture spatial imaging, ResASPP), so as to obtain the medium-level features of each light-field multi-view image.

Wherein the ResASPP block is formed by cascading 3 ASPP blocks with the same structure parameter and adding the residual error into the upstream input; as shown in fig. 3, an atomic spatial pyramid pooling block (ASPP) performs multi-scale feature extraction on upstream input by using atomic hole convolutions parallel to each other and having different expansion rates; in each ASPP block, first, feature extraction is performed on the upstream input by convolution of 3 atomic holes with expansion rates of d being 1,4, and 8, respectively, and then the obtained multi-scale features are fused by verification of 1 × 1 convolution.

According to one or more embodiments, the specific process of stacking the feature images and constructing the feature fusion and enhancement multilayer convolutional network in step a3 to obtain the 4D light field structural features that can be used for reconstructing the light field multi-view image is as follows:

step A3.1, multiscale feature map array Q₀∈R^NH×NW×CEach view of (a) is stacked on channel C in order from top left to bottom right, where H denotes the number of columns of multi-view images and W denotes the number of rows of images; n represents the number of multi-view images in a single direction, and the total number is N × N; c denotes the number of channels of the image. Thereby obtaining a characteristic diagram Q epsilon R^{H×W×(N×N×C)}。

Step A3.2, the characteristic diagram Q epsilon R after stacking^{H×W×(N×N×C)}Will be sent as input to the global feature fusion module. Firstly, performing feature re-extraction on the stacked multi-scale features through 3 conventional convolutions, and then performing feature fusion through 1 residual block;

step a3.3, then enter the fusion block to achieve feature enhancement. The fusion block can accumulate more texture detail information on the original characteristics by extracting the angle characteristics in the 4D light field. The enhanced features are sent to 4 cascaded residual blocks for full feature fusion, and finally 4D light field structural features which can be used for super-resolution reconstruction of light field images are generated.

The fusion block is used for performing feature fusion and enhancement on the extracted multi-scale features, and adopts a network structure shown in fig. 4. The central perspective image can generate other peripheral perspective images through a certain warping transformation, and vice versa. The process of generating the surrounding view from the center view can be described mathematically as:

G_s',t'＝M_st→s't'·W_st→s't'·G_s,t+N_st→s't'

in the formula, G_s,tRepresenting central view angle image, G_s',t'Representing other peripheral view images, W_st→s't'Is a "warp matrix", and N_st→s't'Is a view generated after the warping transformation and a multi-view image G of the original view_s',t'An error term between; m_st→s't'Is a "mask" matrix used to remove the effects of the occlusion problem described above.

As shown in FIG. 4, the peripheral view feature Q in the NxN feature map array_s',t'Through 'warping transformation' W_s't'→stCentral view feature Q 'may be generated separately'_s,tAs shown by the characteristic block marked with (r). Likewise, central perspective feature Q_s,tThrough 'warping transformation' W_st→s't'The peripheral view angle feature W may also be generated accordingly_st→s't'As shown by the characteristic block labeled ② in fig. 4. The foregoing process can be expressed as:

in the formula,

is a batch matrix multiplication. Then, the module respectively carries out mask processing on the feature blocks (i) and (ii) so as to solve the occlusion problem existing between different visual angles. The method for acquiring the mask matrix comprises the following steps: obtaining an absolute value of an error item between the generated view and the original view, wherein the larger the absolute value is, the region is indicated as an occlusion region, and specifically:

wherein T ═ 0.9 xmax (| Q'_s,t-Q_s,t||₁) For empirical thresholds set in the algorithm, a "mask" matrix M_st→s't'Is derived from M_s't'→s,tSimilarly. And then, filtering the occlusion areas in the feature blocks (I) and (II):

in the formula,

and

respectively are the characteristic blocks obtained after the mask processing. In the process, N-NXN-1 central view angle characteristic graphs are formedTherefore, it is right

Normalization processing is carried out to obtain a characteristic diagram marked by the symbol (c) shown in figure 4

In the formula, k is an index value when other views except the center view in the N × N feature map array are arranged from top left to bottom right;

it represents the kth other surrounding view feature map generated from the central view and processed by the "mask" process. Further, will

And replacing the feature diagram of the central position with the feature diagram of the label (c) to obtain a feature block (c) after global fusion. The feature block (IV) will add up to the original input multi-scale features to realize the feature enhancement, thus finally obtaining the feature block (V) after feature fusion and enhancement.

According to one or more embodiments, the specific process of constructing the upsampling module in step a4 to obtain the non-linear mapping relationship from the 4D light field structural feature to the high-resolution N × N light field multi-view image is as follows:

step A4.1, using sub-pixel convolution, first generating r from the input feature map with channel number C²A characteristic diagram with the number of channels being C;

step A4.2, then the obtained number of channels is r²The xc profile is sampled and thus generates a high resolution profile with a resolution r times.

And step A4.3, sending the high-resolution feature map to 1 conventional convolutional layer for feature fusion, and finally generating the super-resolution light field multi-view image array.

According to one or more embodiments, step a5 is to construct a loss function based on the multi-scale feature fusion network and train the loss function, and the specific process of fine tuning the network parameters is as follows:

in the training process, the super-resolved light field multi-view images are respectively compared with the actual high-resolution light field multi-view images one by one, and a network adopts a leakage correction linear unit (leak ReLU) with a leakage factor of 0.1 as an activation function to avoid the condition that information transmission is not carried out on neurons in the training process:

wherein u, v respectively represent the positions of the multi-view images in the N × N arrayed array in the lateral and longitudinal directions, respectively; s, t denote the position of the multi-view image pixel in the x-axis direction and the y-axis direction of the image, respectively.

Step a6 is to input the low-resolution nxn light field multi-view image into the trained network to obtain the high-resolution nxn light field multi-view image.

The invention is discussed in terms of one or more embodiments implementing the method.

Training was performed using the university of Heidelberg light field dataset in Germany and the Lytro Illum light field camera dataset of Stanford, using 5 x 5 number of light field multi-view images, and the training data was sliced into 64 x 64 pixel low resolution images and 128 x 128 pixel high resolution image patches in 32 pixel steps. Data enhancement is performed by randomly flipping the image horizontally and vertically. The built neural network is trained in a Pythrch frame, and the model initializes the weight of each convolution layer by using an Adam optimization method and an Xaviers method. The initial learning rate of the model was set to 2 x 10-4, decayed 0.5 times every 20 cycles, and the training was stopped after 80 cycles.

And carrying out comparative analysis on the trained network on the synthetic data set and the real data set respectively.

Fig. 5 shows a comparison table of bicubic interpolation and the method of the present invention under two evaluation indexes of PSNR and SSIM on three images with different synthetic data sets.

Fig. 6 shows a comparison table of bicubic interpolation and the method of the present invention under two evaluation indexes of PSNR and SSIM on three images with different real data sets.

The higher the PSNR and SSIM parameter values, the better the super-resolution image effect. The specific implementation example results show that the super-resolution effect of the method is obvious.

It should be understood that, in the embodiment of the present invention, the term "and/or" is only one kind of association relation describing an associated object, and means that three kinds of relations may exist. For example, a and/or B, may represent: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A light field multi-view image super-resolution reconstruction method based on deep learning is characterized by comprising the following steps:

and A6, inputting the low-resolution NxN light field multi-view image into the trained deep learning network model to obtain the high-resolution NxN light field multi-view image.

2. The method of claim 1, wherein the step a1 further comprises:

step A1.1, recording the multi-view image with NxN array-shaped distribution as a high-resolution light field multi-view image G^HRTo G^HRPerforming bicubic interpolation and performing 2-fold down-sampling to obtain a low-resolution NxN light-field multi-view image G^LR；

Step A1.2, Low resolution light field Multi-View image G^LRIs cut into small blocks with the space size of M multiplied by M pixels by the step length of K pixels, and the high-resolution light field multi-view image G^HRIs also correspondingly cut into small blocks with the size of 2 Mx 2M pixels;

step A1.3, for G respectively^HR、G^LRThe two light field multi-view images are normalized and regularized, and each pixel takes the value of [0,1]]As input data and real data of the deep learning network model.

3. The method of claim 2, wherein the step a2 further comprises:

step A2.1, Low resolution light field Multi-View image G^LRLow-level feature extraction is achieved via 1 conventional convolution and 1 residual block;

step A2.2, performing multi-scale feature extraction and feature fusion on the extracted low-level features by using ResASPP blocks and residual blocks which alternately appear twice, so as to obtain the medium-level features of each light field multi-view image;

wherein,

the ResASPP block is formed by cascading 3 ASPP blocks with the same structural parameters and adding the residual error to the upstream input;

the ASPP block adopts atom hole convolutions which are parallel to each other and have different expansion rates to carry out multi-scale feature extraction on upstream input;

in each ASPP block, the convolution of 3 atomic holes performs feature extraction on the upstream input with expansion rates of d 1,4, and 8, respectively, and then the obtained multi-scale features are merged by checking with 1 × 1 convolution.

4. The method of claim 3, wherein the step A3 further comprises:

step A3.1, multiscale feature map array Q₀∈R^NH×NW×CAre stacked on channel C in order from top left to bottom right, resulting in a feature map Q e R^{H×W×(N×N×C)}；

Step A3.2, the characteristic diagram Q epsilon R after stacking^{H×W×(N×N×C)}The multi-scale features are sent to a global feature fusion module as input, feature re-extraction is carried out on the stacked multi-scale features through 3 conventional convolutions, and feature fusion is carried out through 1 residual block;

and step A3.3, entering a fusion block to realize feature enhancement, wherein the fusion block can accumulate more texture detail information on the original features by extracting the angle features in the 4D light field, and the enhanced features are sent to 4 cascaded residual blocks to carry out feature sufficient fusion, and finally generate 4D light field structural features for super-resolution reconstruction of the light field image.

5. The method of claim 4, wherein the step A4 further comprises:

step A4.2, then the obtained number of channels is r²The characteristic map of x C is sampled to generate a high-resolution characteristic map with the resolution of r times;

and A4.3, sending the high-resolution feature map to 1 conventional convolutional layer for feature fusion, and finally generating a super-resolution light field multi-view image array.

6. The method according to claim 5, wherein the specific process of step A5 is as follows:

in the training process, the super-resolved light field multi-view images are respectively compared with the actual high-resolution light field multi-view images one by one, and the network adopts a leakage correction linear unit with a leakage factor of 0.1 as an activation function to avoid the condition that information transmission is not carried out on neurons in the training process:

7. a light field camera system is characterized in that the system is based on deep learning for a light field multi-view image super-resolution reconstruction method, and the super-resolution reconstruction method comprises the following steps:

a1, constructing a training set of high-resolution and low-resolution image pairs by using multi-view images which are distributed in an NxN array form and obtained from a light field camera;

8. A light field camera array system is characterized in that the system is based on deep learning for a light field multi-view image super-resolution reconstruction method, and the super-resolution reconstruction method comprises the following steps:

a1, constructing a training set of high-resolution and low-resolution image pairs by using multi-view images which are obtained from a light field camera array and distributed in an NxN array;

9. A deep learning model construction method for a light field multi-view image super-resolution reconstruction method is characterized by comprising the following steps:

a5, constructing a loss function based on the multi-scale feature fusion network, training, and fine-tuning network parameters.