CN113962893A

CN113962893A - Face image restoration method based on multi-scale local self-attention generation countermeasure network

Info

Publication number: CN113962893A
Application number: CN202111253713.5A
Authority: CN
Inventors: 梁美彦
Original assignee: Shanxi University
Current assignee: Shanxi University
Priority date: 2021-10-27
Filing date: 2021-10-27
Publication date: 2022-01-21

Abstract

The invention relates to a human face image restoration method based on a multi-scale local self-attention generation antagonistic network, which comprises the following steps: acquiring a face missing image and a corresponding mask and preprocessing the face missing image and the corresponding mask; constructing a countervailing network based on multi-scale local self-attention generation, and training and modeling the countervailing network based on the multi-scale local self-attention generation by using a defective face image data set to obtain a face repairing model; and generating an antagonistic network model through multi-scale local self-attention, and repairing the defected human face image to be detected. According to the invention, a multi-scale structure and a dual-channel local self-attention module are added in the generated network, so that the technical problems of unstable training, low restoration precision and efficiency, lack of symmetry and mode collapse in the face restoration problem of the generated confrontation network are effectively solved, and an efficient, accurate and stable restoration method is provided for the face restoration.

Description

Face image restoration method based on multi-scale local self-attention generation countermeasure network

Technical Field

The invention belongs to the technical field of computer face image restoration, and particularly relates to a face image restoration method based on a multi-scale local self-attention generation countermeasure network.

Background

The image restoration is to restore the damaged area of the image by a certain technical means, so that the damaged area of the image has good consistency with surrounding features, and the restored image and the original image have the same semantic content. At present, the classic algorithms for repairing face images mainly include diffusion-based algorithms and image block-based matching algorithms. However, these classical image inpainting algorithms are mainly based on mathematical and physical models, so that the input image is required to contain similar information as the missing region, such as similar pixels, structures or image blocks, and cannot generate new content. If the image has a large area missing, the image can not be effectively reconstructed.

The image reconstruction method based on the depth generation countermeasure network repairs a missing image by learning the distribution of input images. Compared with the traditional reconstruction method, the method can capture the high-level semantic information of the image without having similar pixels or image blocks in the missing image, and generate the missing area with the same semantic as the original image, thereby realizing the effective restoration of the image. Therefore, the method for generating the countermeasure network based on the depth can repair the images which are lost in a small range, can realize the repair and reconstruction of large-area lost images according to semantic content, and is an effective face repair method.

At present, the networks for implementing image restoration by using the generation countermeasure network mainly include a Context Encoder (CE) proposed by Pathak et al and a global local image restoration network (GLCIC) proposed by Iizuka et al. Both architectures can reconstruct images through semantic information, using reconstruction losses and countermeasures to guide the image generation process. However, the method of the context encoder mainly focuses on repairing the missing region, and the method of replacing the non-missing region with the original image may cause a repair boundary between the missing region and the non-missing region, thereby affecting the integrity of the generated image. The GLCIC network controls the generation process of the image through a global discriminator and a local discriminator, and the problem that the image generated by the CE has a repaired boundary is solved. However, this method does not focus on repairing the missing area image, and therefore the generated image is blurred in the missing area.

Disclosure of Invention

The invention aims to solve the technical problems that the existing method for generating the countermeasure network to realize image restoration does not pay close attention to restoration of images in the missing area and the generated images are fuzzy in the missing area, and provides a human face image restoration method for generating the countermeasure network based on multi-scale local self-attention. According to the method, a multi-scale local self-attention module is adopted in a generator, and the information of a missing area is focused, so that the problem of high-precision restoration of the face image can be solved in a targeted manner, and multi-scale image information is added, so that the training process is more efficient and stable.

In order to solve the technical problems, the invention adopts the technical scheme that:

a facial image restoration method based on a multi-scale local self-attention generation countermeasure network comprises the following steps:

the method comprises the following steps: acquiring an original face image x and a corresponding binary defect mask M; construction of a defective face image dataset { x_M|x_MM ☉ x, and a corresponding original image data set { x }, and dividing the defective face image data set into a training set and a test set according to a preset proportion. Wherein ☉ represents element multiplication;

step two: constructing a multi-scale local self-attention generating countermeasure network, wherein the network consists of a generating network and a judging network, and embedding a dual-channel local self-attention module on different scales of the generating network, wherein the dual-channel local self-attention module comprises a cross attention channel and a space self-attention channel, and the cross attention channel and the space self-attention channel are connected in a parallel mode;

step three: setting a network model hyper-parameter, training and modeling a multi-scale local self-attention generation and countermeasure network model by using a defect face image training set, respectively adopting an Adam optimizer and a random gradient descent (SGD) algorithm in the modeling process to generate a network and a discrimination network, and optimizing the sum of a plurality of loss functions in the countermeasure training process to obtain a multi-scale local self-attention generation and countermeasure defect face restoration model;

step four: and testing the anti-defect human face restoration model by adopting a defect human face image test set to generate the multi-scale local self-attention, and evaluating the restoration performance of the model through peak signal to noise ratio (PSNR) and Structural Similarity (SSIM) indexes.

Further, the generation network comprises an encoder module, a semantic feature repair module and a decoder module, and the encoder module and the decoder module are basically symmetrical in structure;

the encoder module comprises a plurality of encoding feature extraction modules, and the encoding feature extraction modules comprise encoding convolution layers, batch normalization and Leaky ReLu activation functions. The method comprises the steps that each coding convolution layer carries out feature extraction on an input defect image through convolution kernels with the size of k, the scanning step length of s and the number of filling pixels of p, batch normalization is carried out after each convolution operation, the convolution kernels are activated through a nonlinear activation function Leaky ReLu, and along with the increase of the number of convolution layers, extracted features gradually evolve from low-level features based on colors and textures to high-level abstract features based on image semantic information; compressing the input defect image into feature maps of different scales through coding operation;

the semantic feature repairing module comprises a plurality of feature restoring modules, and each feature restoring module consists of an expansion convolution layer, batch normalization and a Leaky ReLu activating function. The convolution kernel of each expansion convolution layer is 3 × 3 expansion convolution, and the expansion rate of the t-th layer convolution kernel is 2^tWherein T is 1,2, … T₀(ii) a The method is used for performing semantic feature extraction and face image restoration on the compressed feature map;

the decoder module consists of a plurality of decoding feature mapping modules, m-scale dual-channel local self-attention modules, a plurality of up-sampling modules and a nonlinear image balancing module; the decoding feature mapping module consists of a decoding convolution layer-batch normalization-Leaky ReLu activation function; the up-sampling module consists of a deconvolution layer, batch normalization and a Leaky ReLu activation function; the nonlinear image equalization module consists of a decoding convolution layer-Tanh activation function; a double-channel local self-attention module is added in front of each up-sampling module; the concrete connection mode is as follows: the first decoding feature mapping module is connected with the second decoding feature mapping module and used for extracting a feature map of a corresponding scale, the second decoding feature mapping module is connected with the mth scale dual-channel local self-attention module, and the missing information is repaired through the difference between the known area and the missing area of the focused image; the m-scale dual-channel local self-attention module is connected with the 1 st up-sampling module, up-sampling of the image is achieved through deconvolution operation, and batch normalization operation and Leaky ReLu function activation are conducted; the 1 st upsampling module is connected with the m +1 th scale double-channel local self-attention module through the third decoding feature mapping module, and the function of the module is to focus the difference between a known region and a missing region in the upsampled feature map again on the m +1 scale to repair and adjust the missing information, so that the repair of the feature map is realized on multiple scales; the (m + 1) th scale dual-channel local self-attention module is connected with the (2) th up-sampling module, the (2) th up-sampling module is connected with the nonlinear image equalization module after passing through the fourth decoding feature mapping module, namely, the image is converted into a three-channel RGB image, so that the effective reconstruction of the image is realized.

Furthermore, the discrimination network comprises a plurality of feature discrimination modules, and each feature discrimination module consists of a discrimination convolution layer-batch normalization-Leaky ReLu activation function. Each judging convolutional layer carries out feature extraction and compression on the reconstructed image through the convolutional layer with the size of k 'and the scanning step length of s', finally a probability value is output for judging the repairing effect of the generated image, and the number of channels of the characteristic image output by each judging convolutional layer of the judging network is at least doubled compared with the number of channels of the convolutional layer corresponding to the generating network.

Further, performing convolution operation on a feature map obtained before each dual-channel local self-attention module in the decoder to obtain an RGB image with a corresponding scale, and enabling the RGB image to pass through L together with a real image with the corresponding scale in the image restoration process₂Reconstructing loss, and comparing with real image with corresponding scale in the process of image reconstruction, thereby gradually controlling generation of face imageThe training process is more stable.

Further, the cross attention channel in the dual-channel local self-attention module repairs the defective region of the image by focusing attention of the missing region and the non-missing region, specifically:

(I) the input of each channel of the dual-channel local self-attention module is a characteristic diagram F before each deconvolution layer in a decoder, and the size of the characteristic diagram F is M₁×M₂×C,M₁、M₂And C is the height dimension pixel number, the width dimension pixel number and the channel number of the characteristic image F respectively;

(II) dividing the characteristic diagram F into a defective area and a non-defective area according to the size of the mask, wherein the defective area is defined as a foreground F_fThe non-defective area is defined as the background F_b；

(III) converting the foreground F_fAnd background F_bIs adjusted to be P_fX C and P_b' × C one-dimensional vector, wherein: p_f＝m₁×m₂,P_b'＝(M₁×M₂)-(m₁×m₂)；m₁And m₂Are respectively foreground F_fHeight dimension pixel number and width dimension pixel number, P_fAnd P_b' number of pixels of foreground and background;

(IV) adjusted foreground F in the cross attention channel_fAnd background F_bRespectively carrying out one-dimensional convolution operation to obtain a foreground F_fTransformation characteristic Q of (1), and background F_bThe two transformation characteristics K and V are as follows: q ═ W_qF_f，K＝W_kF_bAnd V ═ W_vF_bWherein: w_q、W_kAnd W_vThe feature transformation matrix of the cross attention channel is a learnable parameter of the network;

(V) element E in the cross attention channel attention map E_ijCan be expressed as:

wherein, subscripts i and j represent indexes of elements in corresponding physical quantities respectively, and superscript T represents transposition operation;

(VI) the output of the cross attention channel is:

β₁is a weight assignment parameter of the cross attention channel, is a network learnability parameter, pad (-) denotes zero-padding operation, V^TFor background F in the cross attention channel_bTransposition of the transformation characteristics V, E^TTransposing attention map E in cross-attention channel.

Further, the space focuses on the attention in the missing region from the attention channel, and obtains the internal relation of the facial image features to repair the image, specifically:

(i) the spatial self-attention channel connects foreground F_fIs adjusted to be P_fObtaining a foreground F after three one-dimensional convolution operations of the one-dimensional vector of the x C_fThree types of transformation characteristics Q ', K ' and V ' are specifically represented as follows: q ═ W_q'F_f，K'＝W_k'F_fAnd V ═ W_v'F_f(ii) a Wherein: w'_q、W'_kAnd W'_vThe feature transformation matrix of the spatial self-attention channel is a learnable parameter of the network;

(ii) element E 'in the spatial self-attention channel attention map E'_i,jCan be expressed as:

(iii) the output of the spatial self-attention channel is:

wherein: beta is a₂Is the weight distribution parameter of the spatial self-attention channel, is the learnability parameter of the network, and pad (-) represents the zero-padding operationDo, V'^TFor foreground F in spatial self-attention channel_fTransposition of the transform feature V ', E'^TTransposing an attention map E' in a spatial self-attention channel;

(iv) fusing the feature graphs of the cross attention channel and the spatial self-attention channel to obtain a simplified feature graph Y, wherein the expression is as follows: y ═ conv (Y)_f+Y'_f) (ii) a Where conv (·) denotes a 1 × 1 convolution operation.

Further, the loss function includes: multi-scale reconstruction loss function L_mReconstructed image contrast loss function L_advThe perceptual loss function L_perceptualStyle loss function L_styleAnd Total Variation loss function (Total Variation loss) L_TV(ii) a The method specifically comprises the following steps:

the multi-scale reconstruction loss function is defined as:

wherein: x is the number of_MX denotes the original image, M denotes a binary mask, x ☉ x_MRepresenting a defect image, G (-) representing a generated image, S_iRepresenting the RGB output image of the ith scale extracted from the decoder, T_iRepresenting the true image of the same image at the ith scale, λ_iIs the weight of each scale;

loss-fighting function L for reconstructed images_advIs derived from a cost function in the confrontation training

And transforming to obtain the expression of the resistance loss function of the reconstructed image as follows:

the perceptual loss function expression is:

the style loss function expression is:

the total variation loss function expression is as follows:

the total loss function expression is: l ═ alpha₁L_m+α₂L_adv+α₃L_perceptual+α₄L_style+α₅L_TV

Wherein:

and x denotes a restored face image and a real face image, phi denotes a VGG-16 network,

and

respectively representing the extraction of the j-th layer characteristic diagram of the repaired image and the real image by using a VGG-16 network, H_j,W_j,C_jRepresenting the height, width and channel number of the feature map extracted from the layer j by the VGG-16 network, N being the layer number in the VGG-16 feature extractor, D (-) representing the discrimination of the image in the brackets, E_x(. cndot.) represents the expectation of a distribution function,

the gram matrix representing the VGG-16 network layer j characteristic graph, | | · | | purple₂Represents L₂The norm of the number of the first-order-of-arrival,

representing the corresponding pixel values when the height dimension, width dimension and channel dimension values are h, w and c respectively in the repaired face RGB image, { alpha [ [ alpha ]₁,...,α₅Denotes the weight that each loss takes in the total loss function.

Further, tongOptimizing the sum of the loss functions to obtain a parameter theta of the multi-scale local self-attention generation antagonistic network_d,θ_gArgmin (L), and further obtaining a repaired face image

Wherein, theta_dAnd theta_gTo discriminate the network and to generate parameters for the network.

The invention has the beneficial effects that:

1. according to the invention, the double-channel local self-attention module is added in the generator, and the network acquires the internal relation of the facial image characteristics by focusing the attention of the missing region and the non-missing region and the self-attention in the missing region, so that the network learning efficiency is improved, the repair of the fine part of the face is realized, and an effective way is provided for the reconstruction of the high-precision missing facial image;

2. according to the method, a multi-scale local self-attention mechanism is added on each scale in the image generation process, the generation process of the face image is gradually controlled, so that a dual-channel local self-attention module can play a role on each scale, and the training process is more stable;

3. according to the invention, the 'jump' connection is adopted in the generator, so that the expression and repair efficiency of high-level semantic information of the image is enhanced, and the mode collapse is avoided;

4. the invention adopts a 'high-capacity' judging network, wherein the 'high capacity' means that the number of channels of an output characteristic diagram of each judging convolution layer of the judging network is at least doubled compared with the number of channels of the convolution layers corresponding to the generating network. The large-capacity discrimination network discriminates a large number of feature maps of the generated image, so that the small difference between the restored image and the original image is effectively discriminated, and the precision of the image to be restored is improved.

Drawings

FIG. 1 is a schematic overall framework diagram of the present invention for face defect image restoration;

FIG. 2 is a schematic diagram of the operation of the self-attention mechanism of the present invention;

fig. 3 is a schematic diagram of the test results of the present invention for repairing a defective image of a human face.

Detailed Description

The invention is described in detail below with reference to the figures and examples.

In this embodiment, a method for repairing a face image based on a multi-scale local self-attention generation countermeasure network includes:

the method comprises the following steps: acquiring an original face image x and a corresponding binary defect mask M; construction of a defective face image dataset { x_M|x_MM ☉ x, and a corresponding original image data set { x }, and the acquired defective face image data set is preprocessed, i.e. the image size is uniformly set to N₀×N₀，N₀Number of pixels in width and height dimensions, N, of an image₀128, ☉ represents the element multiplication and normalization before entering the network; dividing the preprocessed face image data into a training set and a test set according to the proportion of 10: 1; in this embodiment, 22,000 different face image data sets are divided into a training set and a test set according to a ratio of 10:1, and the number of images in the training set and the test set is 20k and 2k respectively;

step two: constructing a multi-scale local self-attention generating countermeasure network, wherein the network consists of a generating network and a judging network, and embedding a dual-channel local self-attention module on different scales of the generating network, wherein the dual-channel local self-attention module comprises a cross attention channel and a space self-attention channel, and the cross attention channel and the space self-attention channel are connected in a parallel mode as shown in FIG. 1; the generation network comprises an encoder module, a semantic feature repair module and a decoder module, and the encoder module and the decoder module are basically symmetrical in structure;

the encoder module comprises 6 encoding feature extraction modules, and the encoding feature extraction module comprises encoding convolution layers, batch normalization and Leaky ReLu activation functions. The number of the convolution kernel size k, the corresponding step length s and the feature map filling pixel number p adopted by each coding convolution layer is { k, s, p } { (5,1, 1); (3,2, 1); (3,1, 1); (3,2, 1); (3,1, 1); (3,1,1) }, generating feature maps having sizes of 128 × 128 × 64,64 × 64 × 128,64 × 64 × 128,32 × 32 × 256,32 × 32 × 256, and 32 × 32 × 256, respectively; carrying out batch normalization operation after each coding convolution layer and activating the coding convolution layer through a Leaky ReLu function with the slope of 0.2; with the increase of the number of the convolution layers, the extracted features gradually evolve from low-level features based on color and texture to high-level abstract features based on image semantic information; compressing the input defect image into feature maps of different scales through coding operation;

the semantic feature repairing module comprises 4 feature restoring modules, and each feature restoring module consists of an expansion convolution layer, batch normalization and a Leaky ReLu activating function. The convolution kernel of each expansion convolution layer is expansion convolution of 3 multiplied by 3, the expansion rates are respectively 2,4,8 and 16, and the expansion convolution kernels are used for performing semantic feature extraction and face image restoration on the compressed feature map;

the decoder consists of 4 decoding feature mapping modules, 2 scale double-channel local self-attention modules, 2 upsampling modules and 1 nonlinear image balancing module. The decoding feature mapping module consists of a decoding convolution layer-batch normalization-Leaky ReLu activation function; the up-sampling module consists of a deconvolution layer, batch normalization and a Leaky ReLu activation function; the nonlinear image equalization module consists of a decoding convolution layer-Tanh activation function; a two-channel local self-attention module is added in front of each up-sampling module. The concrete connection mode is as follows: the first decoding feature mapping module and the second decoding feature mapping module with convolution kernel of 3 multiplied by 3 are connected and used for extracting feature maps with corresponding scales, the second decoding feature mapping module is connected with the first scale dual-channel local self-attention module, and feature map missing information with the scale of 32 multiplied by 32 is repaired through the difference between the known area and the missing area of the focused image; the first-scale dual-channel local self-attention module is connected with the No. 1 up-sampling module, up-sampling of an image is realized through deconvolution operation with a convolution kernel of 4 x 4, and the image is restored to 64 x 64; the 1 st upsampling module is connected with a second-scale dual-channel local self-attention module through a third decoding feature mapping module with convolution kernel of 3 x 3, and the function of the module is to focus on the difference between a known area and a missing area in the upsampled feature map again on the second scale so as to repair and adjust the missing information of the feature map with the scale of 64 x 64, thereby realizing the repair of the feature map on multiple scales. The second scale dual-channel local self-attention module is connected with the 2 nd up-sampling module, namely, the up-sampling of the image is realized again through the deconvolution operation with convolution kernel of 4 x 4, namely, the image is restored to the original size of 128 x 128, the 2 nd up-sampling module is connected with the nonlinear image equalization module through the fourth decoding feature mapping module with convolution kernel of 3 x 3, the image is converted into a three-channel RGB image, and therefore effective reconstruction of the image is realized.

Performing convolution operation on the feature map obtained in front of each dual-channel local self-attention module in the decoder to obtain an RGB image with a corresponding scale, and enabling the RGB image to pass through L together with a real image with the corresponding scale in the image restoration process₂And loss is reconstructed, and the real images with corresponding scales are compared in the image reconstruction process, so that the generation process of the face images is gradually controlled, and the training process is more stable.

The discrimination network comprises 6 feature discrimination modules, and the feature discrimination modules consist of discrimination convolution layers, batch normalization and Leaky ReLu activation functions. The first 5 discriminating convolutional layers adopt 4 × 4 convolutional kernels, the scanning step is 2 × 2, the number of channels for generating feature maps is about 2-4 times of that of the corresponding layers of the generated network, the number of channels is 128,128,256,512,1024 'large-capacity' feature maps respectively, the size of a network output tensor after the 5 th convolutional operation is 4 × 4 × 1024, the tensor is subjected to feature extraction again by adopting the 4 × 4 convolutional kernels, activation is carried out through a Sigmoid function, a 1 × 1 × 1 probability value is output, and the result is used for representing the truth of an input image. And adding batch normalization operation after the convolution layers of the generated network and the judgment network, and carrying out batch normalization processing on the feature graph after convolution to accelerate network convergence.

In this embodiment, a dual-channel local self-attention module is added in front of each deconvolution layer of the decoder module, as shown in fig. 2; the two-channel local self-attention module comprises a cross attention channel and a space self-attention channel, wherein the cross attention channel and the space self-attention channel are connected in a parallel mode, and a generated feature map needs to pass through the cross attention channel and the space self-attention channel; the network model carries out image restoration through two dimensions of feature information of a known region and self attention in an unknown region, and therefore high-precision and high-efficiency reconstruction of the missing region of the face image is achieved.

The cross attention channel restores the image by focusing attention of the missing region and the non-missing region, specifically:

(III) converting the foreground F_fAnd background F_bIs adjusted to be P_fX C and P_b' × C one-dimensional vector, wherein: p_f＝m₁×m₂,P_b'＝(M₁×M₂)-(m₁×m₂)；m₁And m₂Are respectively foreground F_fHeight dimension pixel number and width dimension pixel number, P_fAnd P_b' is the number of pixels of the foreground and the background, and C is the number of channels;

(V) an attention map of the intersecting attention channelsElement E of E_i,jCan be expressed as:

(VI) the output of the cross attention channel is:

The space self-attention channel focuses on the attention in the missing area, acquires the internal relation of the human face image characteristics to repair the human face image, and specifically comprises the following steps:

(iii) the output of the spatial self-attention channel is:

wherein: beta is a₂Is a weight assignment parameter of the spatial self-attention channel, is a learnability parameter of the network, pad (-) denotes a zero-padding operation, V'^TFor foreground F in spatial self-attention channel_fTransposition of the transform feature V ', E'^TTransposing an attention map E' in a spatial self-attention channel;

Step three: setting network model hyper-parameters, wherein the hyper-parameters comprise an initial learning rate (gamma), an optimization algorithm for distinguishing a network and generating the network, a batch size (batch size) and iteration times (epoch), and values of the hyper-parameters are respectively as follows: the method comprises the steps that gamma is 0.001, batch size is 64, epoch is 200, a multi-scale local self-attention generation countermeasure network model is trained and modeled by using a defective face image training set, in the modeling process, an Adam optimizer and a random gradient descent (SGD) algorithm are respectively adopted for generating a network and a discrimination network, in the training process of countermeasures, the sum of a plurality of loss functions is optimized, and a parameter theta of the multi-scale local self-attention generation countermeasure is obtained_d,θ_gArgmin (L), and further obtaining a repaired face image

Wherein, theta_dAnd theta_gTo discriminate between the network and generate the parameters of the network,

representing a repaired face image; thereby obtaining a multi-scale local self-attention generation disfigurement-resistant face repairing model;

the loss function includes: multi-scale reconstruction loss function L_mReconstructed image contrast loss function L_advThe perceptual loss function L_perceptualStyle loss function L_styleAnd total variation loss function (Tota)l Variation loss)L_TV(ii) a The method specifically comprises the following steps:

the multi-scale reconstruction loss function is defined as:

wherein: x is the number of_MX denotes the original image, M denotes a binary mask, x ☉ x_MRepresenting a defect image, G (-) representing a generated image, S_iRepresenting the RGB output image of the ith scale extracted from the decoder, T_iRepresenting the true image of the same image at the ith scale, λ_iThe weight of each scale is, the total number m of scales in this embodiment is 3, and the corresponding weights are 0.4,0.6, and 0.8, respectively;

the perceptual loss function expression is:

the style loss function expression is:

the total variation loss function expression is as follows:

Wherein: x represents trueThe face image, phi, represents the VGG-16 network,

and

respectively representing the extraction of the j-th layer characteristic diagram of the repaired image and the real image by using a VGG-16 network, H_j、W_jAnd C_jRepresenting the height, width and channel number of the feature map extracted from the layer j by the VGG-16 network, N being the layer number in the VGG-16 feature extractor, D (-) representing the discrimination of the image in the brackets, E_x(. cndot.) represents the expectation of a distribution function,

representing the corresponding pixel values when the height dimension, width dimension and channel dimension values are h, w and c respectively in the repaired face RGB image, { alpha [ [ alpha ]₁,...,α₅Denotes the weight that each loss takes in the total loss function. Set to 100,10,1,1,1 in the present embodiment.

Step four: and testing the anti-human face restoration model by adopting a defective human face image test set to generate the multi-scale local self-attention, and evaluating the restoration performance of the model through peak signal to noise ratio (PSNR) and Structural Similarity (SSIM) indexes.

Fig. 3 shows the repair results of the generated confrontation network model based on multi-scale local self-attention on 2k face defect image test sets, where the peak signal-to-noise ratio (PSNR) and the Structural Similarity (SSIM) reach 25.39 and 0.87, respectively. The method not only improves the network learning efficiency, but also realizes the repair of the fine part of the face, and proves the excellent performance of the method in the aspect of repairing the defective face image.

Claims

1. A facial image restoration method based on a multi-scale local self-attention generation countermeasure network is characterized by comprising the following steps:

the method comprises the following steps: acquiring an original face image x and a corresponding binary defect mask M; construction of a defective face image dataset { x_M|x_MM ☉ x, and a corresponding original image data set { x }, and dividing the defective face image data set into a training set and a test set according to a preset proportion; wherein ☉ represents element multiplication;

2. The facial image restoration method based on the multi-scale local self-attention generation countermeasure network as claimed in claim 1, characterized in that: the generation network comprises an encoder module, a semantic feature repair module and a decoder module, and the encoder module and the decoder module are basically symmetrical in structure;

the encoder module comprises a plurality of encoding feature extraction modules, and each encoding feature extraction module comprises an encoding convolution layer, batch normalization and a Leaky ReLu activation function; the method comprises the steps that each coding convolution layer carries out feature extraction on an input defect image through convolution kernels with the size of k, the scanning step length of s and the number of filling pixels of p, batch normalization is carried out after each convolution operation, the convolution kernels are activated through a nonlinear activation function Leaky ReLu, and along with the increase of the number of convolution layers, extracted features gradually evolve from low-level features based on colors and textures to high-level abstract features based on image semantic information; compressing the input defect image into feature maps of different scales through coding operation;

the semantic feature repairing module comprises a plurality of feature restoring modules, and each feature restoring module consists of an expansion convolution layer, batch normalization and a Leaky ReLu activation function; the convolution kernel of each expansion convolution layer is 3 × 3 expansion convolution, and the expansion rate of the t-th layer convolution kernel is 2^tWherein T is 1,2, … T₀(ii) a The method is used for performing semantic feature extraction and face image restoration on the compressed feature map;

3. The facial image restoration method based on the multi-scale local self-attention generation countermeasure network as claimed in claim 1, characterized in that: the discrimination network comprises a plurality of characteristic discrimination modules, and each characteristic discrimination module consists of a discrimination convolution layer, batch normalization and a Leaky ReLu activation function; each judging convolutional layer carries out feature extraction and compression on the reconstructed image through the convolutional layer with the size of k 'and the scanning step length of s', finally a probability value is output for judging the repairing effect of the generated image, and the number of channels of the characteristic image output by each judging convolutional layer of the judging network is at least doubled compared with the number of channels of the convolutional layer corresponding to the generating network.

4. The facial image restoration method based on the multi-scale local self-attention generation countermeasure network as claimed in claim 2, characterized in that: performing convolution operation on the feature map obtained in front of each dual-channel local self-attention module in the decoder to obtain an RGB image with a corresponding scale, and enabling the RGB image to pass through L together with a real image with the corresponding scale in the image restoration process₂And loss is reconstructed, and the real images with corresponding scales are compared in the image reconstruction process, so that the generation process of the face images is gradually controlled, and the training process is more stable.

5. The facial image restoration method based on the multi-scale local self-attention generation countermeasure network as claimed in claim 1, characterized in that: the cross attention channel in the dual-channel local self-attention module repairs the defective region of the image by focusing attention of the missing region and the non-missing region, and specifically comprises the following steps:

(III) converting the foreground F_fAnd background F_bIs adjusted to be P_fX C and P_b' × C one-dimensional vector, wherein: p_f＝m₁×m₂，P_b'＝(M₁×M₂)-(m₁×m₂)；m₁And m₂Are respectively foreground F_fHeight dimension pixel number and width dimension pixel number, P_fAnd P_b' number of pixels of foreground and background;

(VI) the Cross attentionThe output of the channel is:

6. The facial image restoration method based on the multi-scale local self-attention generation countermeasure network as claimed in claim 1, characterized in that: the method comprises the following steps that the space self-attention channel focuses on the attention in the missing area, and the internal relation of the human face image characteristics is obtained to repair the image, and specifically comprises the following steps:

(iii) the output of the spatial self-attention channel is:

wherein: beta is a₂Is a weight distribution parameter of a spatial self-attention channel, is of a networkThe learnability parameter, pad (-), represents a zero-stuffing operation, V'^TFor foreground F in spatial self-attention channel_fTransposition of the transform feature V ', E'^TTransposing an attention map E' in a spatial self-attention channel;

7. The facial image restoration method based on the multi-scale local self-attention generation countermeasure network as claimed in claim 1, characterized in that: the loss function includes: multi-scale reconstruction loss function L_mReconstructed image contrast loss function L_advThe perceptual loss function L_perceptualStyle loss function L_styleAnd total variation loss function L_TV(ii) a The method specifically comprises the following steps:

the multi-scale reconstruction loss function is defined as:

the perceptual loss function expression is:

the style loss function expression is:

the total variation loss function expression is as follows:

Wherein:

and

8. The facial image restoration method based on the multi-scale local self-attention generation countermeasure network as claimed in claim 7, characterized in that: obtaining a parameter theta of the multi-scale local self-attention generation antagonistic network by optimizing the sum of the loss functions_d,θ_gArgmin (L), and further obtaining a repaired face image