CN113409216A

CN113409216A - Image restoration method based on frequency band self-adaptive restoration model

Info

Publication number: CN113409216A
Application number: CN202110707809.8A
Authority: CN
Inventors: 王瑾; 王琛; 朱青
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-06-24
Filing date: 2021-06-24
Publication date: 2021-09-17

Abstract

The invention relates to an image restoration method based on a frequency band self-adaptive restoration model, which is used for solving the problem that the prior art cannot simultaneously reconstruct the reasonable structure and the fine texture of a damaged image. The invention takes an encoder-decoder network structure as a baseline network of a generator on the basis of generating a confrontation network model. To extract deep information from known regions, the corrupted image features are decomposed into low-frequency and high-frequency subbands by Discrete Wavelet Transform (DWT), and the multi-frequency subbands are then passed through a convolutional network to extract features. And finally, reconstructing each horizontal sub-band image through wavelet inverse transformation, and complementing the missing region information to generate an image. The idea of processing low frequency information separately from high frequency information is advantageous for processing the structure and texture information of the image in a targeted manner. The image generated by the method has a clearer structure and finer texture, and the repaired image is more real and has no obvious boundary.

Description

Image restoration method based on frequency band self-adaptive restoration model

The technical field is as follows:

the invention relates to the field of computer image processing, in particular to an image restoration method based on a frequency band self-adaptive restoration model.

Background art:

image inpainting is a basic task in multimedia applications and computer vision, with the goal of generating alternative global semantic structures and local detail textures for missing regions, and ultimately producing visually realistic results. It is widely applied in the multimedia fields of image editing, restoration, synthesis and the like. The conventional image block-based image inpainting method is to search and copy the best matching image block from a known area to the missing area. The traditional image restoration method has a good processing effect on static textures, but has a limited processing effect on textures of complex or non-repetitive structures such as human faces and the like, and is not suitable for capturing high-level semantic information.

In recent years, modeling image inpainting as a conditional generation problem based on a learning approach, Pathak et al first train a deep neural network with a penalty-fighting function to predict missing regions, which is advantageous for capturing the edges and global structure of large-area missing regions. Ishikawa et al improve it by combining global and local penalty functions to produce finer textures. The deep features are extracted and transmitted through the convolutional neural network, so that the defects of the traditional image restoration algorithm are overcome better, and visually real and reasonable restoration results are obtained by the methods. However, since these methods treat and process the structure and texture information of the input image equally, an edge over-smoothing or texture phenomenon often occurs.

To address this problem, Liu et al propose a two-stage network, recovering the coarse structure of the missing region in the first stage, and generating the final result using the reconstruction information of the first stage in the second stage. However, the second-stage network depends greatly on the correctness of the reconstruction structure of the first-stage network, and the two-stage training also brings additional computational burden. Meanwhile, the data distribution of the low-frequency characteristic and the high-frequency characteristic in the input image are completely different. If the feature distributions of different frequencies are computed indiscriminately, reconstruction of the structure or generation of texture may be misled.

In summary, the existing image restoration algorithm often cannot reconstruct a reasonable structure and a fine texture at the same time, and has limitations.

Disclosure of Invention

In order to solve the problem that the prior art cannot reconstruct the reasonable structure and fine texture of a damaged image at the same time, the invention provides an image restoration method based on a frequency band self-adaptive restoration model.

The image restoration method based on the frequency band self-adaptive restoration model of the invention takes a coder-decoder network structure as a baseline network of a generator on the basis of generating a confrontation network model. To extract deep information from known regions, we decompose features into low-frequency and high-frequency subbands by Discrete Wavelet Transform (DWT), and then extract features from the multi-frequency subbands by a convolutional network. And finally, reconstructing each horizontal sub-band image through wavelet inverse transformation, and complementing the missing region information to generate an image. The idea of processing low frequency information separately from high frequency information is advantageous for processing the structure and texture information of the image in a targeted manner.

The following is a description of the belief in both training and actual measurements:

the training phase comprises the following steps:

the method comprises the following steps: decomposing the damaged image into a low-frequency sub-band and a high-frequency sub-band through Discrete Wavelet Transform (DWT);

step two: directly inputting the low-frequency sub-band obtained in the step one into a low-frequency feature encoder, splicing the high-frequency sub-band in the channel direction and inputting the high-frequency sub-band into a high-frequency feature encoder, and finally cascading the features extracted from the high-frequency encoder and the low-frequency encoder together in the channel direction to obtain multi-frequency sub-band feature representation of the image;

step three: inputting the multi-frequency sub-band feature representation extracted by the encoder in the step two into a decoder to obtain multi-frequency sub-band feature representation of a generated image, and finally obtaining a repaired image through inverse discrete wavelet transform;

step four: inputting the generated image in the third step into a discriminator network, iteratively adjusting generator parameters through an overall loss function until the loss function is converged and the generator parameters reach an optimal value, and stopping training;

the specific operation of each step is as follows:

the specific operation of the step one is as follows:

aiming at the damaged image x, the image is divided into multiple frequencies by adopting discrete wavelet transformSub-bands, we use in the present invention a Haar wavelet transform, using 4 convolution filters, including the low frequency filter

And 3 high-pass filters

And

iteratively decomposing an image into 4 multi-frequency subband images x with a convolution step size of 2_sub＝{x_LL,x_HL，x_LH，x_HH}：

In the formula,

for convolution operation, ↓2represents a standard down-sampling operator with a factor of 2, and {. is a collocation operation.

Low frequency sub-band: x is the number of_L＝x_LLAnd, high-frequency sub-band: x is the number of_H＝{x_HL，x_LH，x_HH}。

The specific operation of the second step:

because the low-frequency sub-band frequency transformation is slow, in the low-frequency sub-band feature extraction, the model uses an encoder consisting of a first convolution block, a second convolution block, a first residual convolution block, a second residual convolution block, a third convolution block and a fourth residual convolution block to extract features; on the contrary, the high-frequency sub-band frequency conversion is fast, the model uses the encoder consisting of the first residual convolution block, the second residual convolution block, the third residual convolution block, the fourth residual convolution block, the fifth residual convolution block, the first residual convolution block and the sixth residual convolution block to extract the characteristics, and different encoder structures are adopted to be beneficial to the extraction of the characteristics of the high-frequency sub-band and the low-frequency sub-band.

The third step of concrete operation:

the decoder is composed of six serial attention-embedded devicesResidual convolution blocks of modules with an embedded attention module embedded behind each residual convolution block, x 'for each residual convolution block embedded with an attention module'_subRepresenting the output of the residual volume block, the working process of the attention module is as follows:

first, an attention matrix M is generated:

where σ represents sigmoid activation function, k_i(i ═ 1,2) represents different convolution operations, with a filter size of 1 × 1;

then, output x 'of the residual convolution block is passed through attention matrix M'_subThe matrix multiplication of (a) yields a weighted representation of the feature x ″)_subThe global attention model is expressed as follows:

x″_sub＝M*x′_sub (2)

wherein denotes multiplication of corresponding elements.

Step four, constructing an integral loss function:

the training process for the entire network can be written as:

I_out＝IDWT[G_gen(x_L,x_H)] (3)

wherein, I_outIndicating the repair result of the network, IDWT [ ·]Representing an inverse discrete wavelet transform, x_L,x_HHigh and low frequency subbands representing a damaged image, G_genRepresenting a generating network.

First, a reconstruction loss function is defined as a prediction result I_outAnd a real image I_gtL1 distance between:

L_r＝‖I_out-I_gt‖₁ (4)

to make the features of the real image and the reconstructed image closer in the discriminator, we use the antagonism constraint, the antagonism loss function L_advThe definition is as follows:

in the formula,

expressing the expected value, p, of the distribution function_data(I_gt) Representing the true sample distribution, p_rec(I_out) Indicating the distribution of the generated image, D indicating the discriminator, G being G_gen。

Finally, in order to keep the consistency of the content and style of the generated image and the real image, a VGG-16 network trained in advance by ImageNet is used for extracting a high-level feature space, and the texture loss L is reduced_tIs defined as follows:

L_t＝‖Gram{Φ(I_out)}-Gram{Φ(I_gt)}‖₂+λ‖Φ(I_out)-Φ(I_gt)‖₂ (6)

wherein phi represents a feature representation extracted by a VGG-16 network trained in advance by ImageNet, and Gram represents a Gram matrix operation.

Considering reconstruction loss, countermeasure loss, and texture loss, the overall loss function that defines the generated network is defined as:

L_total＝λ_rL_r+λ_advL_adv+λ_tL_t (7)

wherein λ is_r、λ_adv、λ_tRepresenting the weight coefficients.

In the actual measurement stage, the repaired image is obtained by using the generated network, and the steps are as follows:

step three: and inputting the multi-frequency sub-band feature representation extracted by the encoder in the step two into a decoder, obtaining the multi-frequency sub-band feature representation of the generated image through residual convolution operation of a plurality of embedded attention modules, and finally obtaining the repaired image through inverse discrete wavelet transformation, wherein the obtained image has strong reality and clear texture.

Compared with the prior art, the method decomposes the characteristics into high-frequency and low-frequency sub-bands through frequency division processing, and extracts and transmits the depth information in the high-frequency sub-band region and the low-frequency sub-band region respectively. The idea of processing the low-frequency information and the high-frequency information separately is beneficial to processing the structure and texture information of the image in a targeted manner, and has the beneficial effects that: the model not only can synthesize a clear image structure, but also can generate fine textures in a missing area, and is obviously superior to the most advanced method.

Description of the drawings:

FIG. 1 is a technical framework diagram of a band-adaptive repair model;

FIG. 2 is an exemplary graph of repair results on a data set;

FIG. 3 compares visual results with different algorithms;

the specific implementation mode is as follows:

in order to more clearly describe the technical contents of the present invention, the following is further described with reference to specific examples:

in the image restoration technology frame diagram based on the frequency band self-adaptive restoration model in fig. 1, firstly, a discrete wavelet transform is used for dividing a damaged image into a high-frequency sub-band and a low-frequency sub-band, different convolution blocks are adopted to respectively carry out deep feature extraction on the high-frequency sub-band and the low-frequency sub-band, then, the extracted high-frequency sub-band and the extracted low-frequency sub-band are spliced and input into a decoder, in the decoder part, a structure combining a residual block and the convolution blocks is adopted for feature restoration, an attention mechanism is utilized to further extract and transmit key features, a final multi-frequency representation of a restoration result is obtained, and a restored image is obtained after inverse wavelet transform.

the training phase comprises the following steps:

the specific operation of each step is as follows:

the specific operation of the step one is as follows:

for damaged image x, the image is divided into multi-frequency sub-bands by discrete wavelet transform, and in the invention we use Haar wavelet transform, using 4 convolution filters, including low-frequency filter

And 3 high-pass filters

And

In the formula,

The specific operation of the second step:

The third step of concrete operation:

the decoder consists of six concatenated residual convolution blocks with embedded attention module, embedded attention module behind each residual convolution block, x 'for each residual convolution block embedded attention module'_subRepresenting the output of the residual volume block, the working process of the attention module is as follows:

first, an attention matrix M is generated:

then, output x 'of the residual convolution block is passed through attention matrix M'_subThe matrix multiplication of (a) yields a weighted representation of the feature x ″)_subIntegral attention modelThe type is represented as follows:

x″_sub＝M*x′_sub (2)

wherein denotes multiplication of corresponding elements.

Step four, constructing an integral loss function:

the training process for the entire network can be written as:

I_out＝IDWT[G_gen(x_L,x_H)] (3)

L_r＝‖I_out-I_gt‖₁ (4)

in the formula,

L_t＝‖Gram{Φ(I_out)}-Gram{Φ(I_gt)}‖₂+λ‖Φ(I_out)-Φ(I_gt)‖₂ (6)

L_total＝λ_rL_r+λ_advL_adv+λ_tL_t (7)

wherein λ is_r、λ_adv、λ_tRepresenting the weight coefficients.

An exemplary result of the present invention is shown in fig. 2, where (a) represents an input image and (b) represents a repair result of the present invention and (c) represents a group Truth image.

And (3) evaluating the image quality:

FIG. 3 is a comparison of the visual results of the present invention and Context Encoder, Context attribute, and PICNet on the repaired image of the central region. Where Context Encoders (CEs) produce warped structures and blurred results. The Contextual Attention (CA) exhibits a warped structure. PICNet is intended to produce a wide variety of authentic images, but sometimes produces repetitive and structurally distorted images. The present invention deals better with these problems and produces more intuitive and realistic results than these methods. We also performed quantitative comparisons using common evaluation indices. Table 1 shows that our process achieves the best performance.

TABLE 1 Objective quality comparison of different algorithms

Claims

1. An image restoration method based on a frequency band self-adaptive restoration model is characterized in that a baseline network with an encoder-decoder network structure as a generator is used on the basis of generating a confrontation network model, the image restoration method comprises a training phase and a prediction restoration phase,

the training phase comprises the following steps:

step three: inputting the multi-frequency sub-band feature representation extracted by the second encoder into a decoder to obtain the multi-frequency sub-band feature representation of the generated image, and finally obtaining the repaired image through inverse discrete wavelet transform, namely a prediction result I_out；

the predicted repair phase comprises the following steps:

2. The image restoration method based on the band adaptive restoration model according to claim 1, wherein:

the specific operation of the first training phase is as follows,

for damaged image x, discrete wavelet transform is used to divide the image into multiple frequency sub-bands, specifically Haar wavelet transform is used, and 4 convolution filters are used, including low-frequency filter

And 3 high-pass filters

And

In the formula,

for convolution operation, ↓2represents a standard down-sampling operator with a factor of 2, and {. is collocation operation;

3. The image restoration method based on the band adaptive restoration model according to claim 1, wherein:

the low-frequency feature encoder in the step two of the training stage consists of a first convolution block, a second convolution block, a first residual convolution block, a second residual convolution block, a third convolution block and a fourth residual convolution block in sequence; the high-frequency characteristic encoder is sequentially composed of a first residual convolution block, a second residual convolution block, a third residual convolution block, a fourth residual convolution block, a fifth residual convolution block, a sixth residual convolution block, and different encoder structures are adopted to facilitate extraction of high-frequency and low-frequency sub-band characteristics.

4. The image restoration method based on the band adaptive restoration model according to claim 1, wherein:

first, an attention matrix M is generated:

then, the user can use the device to perform the operation,output x 'of the residual convolution block by attention matrix M'_subThe matrix multiplication of (a) yields a weighted representation of the feature x ″)_subThe global attention model is expressed as follows:

x″_sub＝M*x′_sub (2)

wherein denotes multiplication of corresponding elements.

5. The image restoration method based on the band adaptive restoration model according to claim 1, wherein:

the overall loss function in the fourth step is constructed as follows:

first, a reconstruction loss function is constructed, which is defined as a prediction result I_outAnd a real image I_gtL1 distance between:

L_r＝||I_out-I_gt||₁ (4)

wherein,

I_out＝IDWT[G_gen(x_L，x_H)] (3)

wherein, I_outIndicating the repair result of the network, IDWT [ ·]Representing an inverse discrete wavelet transform, x_L，x_HHigh and low frequency subbands representing a damaged image, G_genThe representation is generated as a network of networks,

construction of the antagonism loss function L_advFor approximating the characteristics of the real image and the reconstructed image in the discriminator, a disfigurement function L_advThe definition is as follows:

in the formula,

expressing the expected value, p, of the distribution function_data(I_gt) Representing the true sample distribution, p_rec(I_out) Representing the generated image distribution, D-tableSignal discriminator, G is G_gen；

Constructing texture loss L_tFor maintaining the consistency of the content and style of the generated image and the real image, the loss of texture L_tIs defined as follows:

L_t＝||Gram{Φ(I_out)}-Gram{Φ(I_gt)}||₂+λ||Φ(I_out)-Φ(I_gt)||₂ (6)

wherein phi represents that a VGG-16 network pre-trained by ImageNet is utilized to extract a high-level feature space, and Gram represents the operation of a Gram matrix.

And finally, constructing a total loss function of the generated network, which is specifically as follows:

L_total＝λ_rL_r+λ_advL_adv+λ_tL_t (7)

wherein λ is_r、λ_adv、λ_tRepresenting the weight coefficients.