CN112184582A

CN112184582A - Attention mechanism-based image completion method and device

Info

Publication number: CN112184582A
Application number: CN202011038187.6A
Authority: CN
Inventors: 赫然; 马鑫; 侯峦轩; 黄怀波; 王海滨
Original assignee: Cas Artificial Intelligence Research Qingdao Co ltd
Current assignee: Cas Artificial Intelligence Research Qingdao Co ltd
Priority date: 2020-09-28
Filing date: 2020-09-28
Publication date: 2021-01-05
Anticipated expiration: 2040-09-28
Also published as: CN112184582B

Abstract

The invention relates to an image completion method and device based on an attention mechanism, belonging to the technical field of computer image processing, and the method comprises the following processes: step S1, preprocessing the database image data, synthesizing a damaged image by using a binary mask, and taking the damaged image and the corresponding binary mask as the input of a network model; step S2, obtaining a generated confrontation network model capable of image completion through training; and step S3, using the trained generated confrontation network model to perform completion processing on the test data. The invention provides a generation countermeasure network model based on an attention mechanism aiming at the problem of image completion. The binary mask is used as an additional information guide, training learning is carried out by combining with an input image, and the model can enable a completion result to contain rich detail information and can keep structural continuity.

Description

Attention mechanism-based image completion method and device

Technical Field

The disclosure belongs to the technical field of computer image processing, and particularly relates to an image completion method and device based on an attention mechanism.

Background

The statements herein merely provide background related to the present disclosure and may not necessarily constitute prior art.

Image inpainting refers to the generation of substitute content for missing regions in a given damaged image, and makes the repaired image visually realistic and semantically reasonable. Image completion tasks may be used in other applications, such as image editing, when scene elements distracting from human attention, such as people or objects (which are often unavoidable), are present in an image, allowing a user to remove unwanted elements from the image while filling in blank areas with visually and semantically reasonable content.

The inventor finds that: with the continuous development of science and technology, the demands of people in different fields are correspondingly improved, including movie advertisement animation production, online games and the like, and the vivid image restoration technology has important significance on the good experience of users. Therefore, under the background, an image completion method based on an attention mechanism is developed, so that the repaired image is vivid visually and reasonable semantically, and the method has important significance.

Disclosure of Invention

Aiming at the technical problems in the prior art, the disclosure provides an attention mechanism-based image completion method and device.

At least one embodiment of the present disclosure provides an image completion method based on an attention mechanism, including the following steps:

step S1, preprocessing the database image data, synthesizing a damaged image by using a binary mask, and taking the damaged image and the corresponding binary mask as the input of a network model;

step S2: training input data to obtain a generated confrontation network model capable of performing image completion;

step S3: and (4) using the trained generated confrontation network model to perform completion processing on the test data.

Further, the database face image and the natural image after the preprocessing in the step S1 are consistent in size; in the image completion task, a damaged image and a corresponding binary mask are combined to be used as input, and an undamaged image is used as a real image label.

Further, the process of generating the countermeasure network model in step S2 includes:

step S21: initializing a network weight parameter in an image completion task; wherein the loss function of the generator is L_totalThe loss function of the discriminator is L_D；

Step S22: inputting the damaged image and the binary mask image into a generator network G for image completion task, inputting the generated completed image and the target image into a discriminator network D, and sequentially performing iterative training to enable a loss function L of the generator_totalLoss function L of sum discriminator_DAll reduce to tend to be stable;

step S23: and simultaneously training the expression generation and removal tasks until all loss functions are not reduced any more, thereby obtaining a final generation confrontation network model.

Further, the damaged image is denoted as x, and the generated image is denoted as x

The target image is denoted as y and the binary mask image is denoted as M.

Further, note that the output value of the local convolution layer in the force mechanism depends on the undamaged area, which is mathematically described as follows:

wherein, l indicates a pixel-level multiplication, and 1 indicates a matrix in which all elements are 1 and the shape and M are the same. W represents the parameter of the convolutional layer, F represents the output characteristic diagram of the convolutional layer of the previous layer, b represents the deviation of the convolutional layer, and M represents the corresponding binary mask diagram;

which can be considered as a scaling factor, adjusts the weight of the known region.

After the partial convolution is performed again, the binary mask map M also needs to be updated, and the mathematical description thereof is as follows:

that is, if the convolutional layer can get an output result according to a valid input, the position in the binary mask is marked as 1.

Further, a dual attention fusion module in an attention mechanism that fuses together known regions and generations, comprising: firstly, channel-level statistical information is obtained:

wherein z is_c(i, j) is the value of the c-th dimension of z. H_GPRepresenting a global pooling layer. f. of_cRepresenting the c-th dimension in the feature map F;

then, acquiring the dependency relationship between the channels:

ω＝f(W_U(W_Dz))

where f and are denoted sigmoid function and ReLU activation function, respectively. W_UAnd W_DIs a parameter of the lap layer. The obtained channel dimension information ω can be used to adjust the weight of the input:

wherein ω is_cAnd f_cRespectively representing a scaling factor and a feature map;

second attention is sought for α obtained by:

where x' is a different scale version of the damaged image x. A is a learnable variation function, and is composed of a plurality of convolution functions.

And x' are first joined and then fed into the convolutional layer. f is a sigmoid function which can change alpha into an attention diagram to some extent;

finally, the final image completion result is obtained

Wherein, l and B denote Hadamard product-sum combining functions, respectively.

Further, the loss function is divided into a structure loss and a texture loss function:

where k denotes the penalty function computation at the kth layer of the decoder. L is_structRepresenting the structural loss function, L_textRepresenting the texture loss function, L_recIndicating L between images₁Norm, L_perRepresenting the perceptual loss function, L_styleRepresenting a style loss function, L_tvRepresenting the total variation loss function, L_advRepresenting the penalty function. Lambda [ alpha ]_rec、λ_per、λ_style、λ_tvAnd λ_advRepresenting a weighting factor.

At least one embodiment of the present disclosure provides an image completion apparatus based on attention mechanism, the apparatus including

A data processing module: the method comprises preprocessing database image data, synthesizing damaged images with binary masks, and inputting the damaged images and the corresponding binary masks as network models;

a model generation module: it is formulated to get through training and can carry on the generating of the image completion and resist the network model;

an image completion module: and (4) using the trained generated confrontation network model to perform completion processing on the test data.

Further, the size of the database face image after the data processing module is prepared and preprocessed is consistent with that of the natural image; in the image completion task, a damaged image and a corresponding binary mask are combined to be used as input, and an undamaged image is used as a real image label.

Further, the model generation module includes: initializing a network weight parameter in an image completion task; wherein, the loss function of the generator is Ltotal, and the loss function of the discriminator is LD; inputting the damaged image and the binary mask image into a generator network G for image completion task, inputting the generated completed image and the target image into a discriminator network D, and sequentially carrying out iterative training to reduce the loss function Ltotal of the generator and the loss function LD of the discriminator to be stable; and simultaneously training the expression generation and removal tasks until all loss functions are not reduced any more, thereby obtaining a final generation confrontation network model.

The beneficial effects of this disclosure are as follows:

(1) in order to improve the generation quality (including rich texture details and structural continuity) of an image in an image completion task, the image completion method based on the attention mechanism is provided. Through the local convolution layer, the generation countermeasure network can utilize the prior information of the binary mask, and the quality of the generated image is improved. By means of the dual attention module, a multi-scale decoder is formed, and high-resolution images can be gradually generated.

(2) The image completion method introduces a reconstruction loss function, a style loss function, a total variation loss function and an antagonistic loss function as constraints at an image level and a characteristic level, and improves the robustness and accuracy of the network.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to limit the disclosure.

FIG. 1 is a flow chart of an attention-based image completion method provided by an embodiment of the present disclosure;

FIG. 2 is a flow diagram of a dual attention module provided by embodiments of the present disclosure;

fig. 3 is a diagram illustrating the effect of image completion on a public data set provided by an embodiment of the present disclosure.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

The embodiment of the disclosure provides an image completion method based on an attention mechanism, which comprises the following steps:

step S1, preprocessing the database image data, synthesizing the damaged image by using a binary mask, and taking the damaged image and the corresponding binary mask as the input of the network model;

specifically, a binary mask map is first generated offline using a binary mask algorithm. For the face image, normalizing the image according to the positions of the two eyes and cutting the image to be 256 × 256 with a uniform size; for natural images, the image size is first enlarged to 350 × 350, and then the enlarged image is randomly cropped to a uniform size of 256 × 256. And randomly selecting an off-line generated binary mask image, and multiplying the binary mask image by the damaged image to obtain the damaged image.

Further, in step S1, the size of the preprocessed database face image is consistent with that of the natural image, and meanwhile, in the next image completion task, the damaged image and the corresponding binary mask are combined as input, and the undamaged image is used as a real image label.

And step S2, training the generated confrontation network model based on the attention mechanism by using the training input data so as to complete the image completion task.

It should be noted that, in this step, in order to expand the sample size of the input data and improve the generalization capability of the network, the embodiment may employ data augmentation operations, including random inversion, so as to increase the number of the sequential training data.

Specifically, the step S2 includes:

step S21: initializing network weight parameters in the image completion task, wherein the loss function of the generator is L_totalThe loss function of the discriminator is L_D；

Step S22: inputting the damaged image and the binary mask image into a generator network G for image completion task, inputting the generated completed image and the target image into a discriminator network D, and sequentially performing iterative training to enable a loss function L of the generator_totalAnd the loss function LD of the discriminator are both reduced to be stable;

Further, assuming that the lesion image is denoted as x, the generated image is denoted as x

The target image is denoted as y and the binary mask image is denoted as M, then the output values of the local convolution layer in the above-mentioned attention mechanism depend on the undamaged area, which is mathematically described as follows:

wherein, l indicates a pixel-level multiplication, and 1 indicates a matrix in which all elements are 1 and the shape and M are the same. W represents a parameter of a convolutional layer, F represents an output characteristic diagram of a convolutional layer of a previous layer, b represents a deviation of the convolutional layer, and M represents a corresponding binary mask diagram.

The present embodiment also needs to update the binary mask map M after performing the partial convolution, and the mathematical description thereof is as follows:

Further, in step S2, the dual attention fusion module in the attention mechanism fuses the known region and the generated region together, including:

the statistical information at the channel level is obtained first,

wherein z is_c(i, j) is the value of the c-th dimension of z. H_GPRepresenting a global pooling layer. f. of_cRepresenting the c-th dimension in the feature map F.

Then, the dependency relationship between the channels is obtained,

ω＝f(W_U(W_Dz))

wherein ω is_cAnd f_cRespectively representing a scaling factor and a feature map.

Second, attention is sought for α obtained by:

And x' are first joined and then fed into the convolutional layer. f is a sigmoid function, and alpha can be changed into an attention map to some extent.

Finally, the final image completion result is obtained

Wherein, l and B denote Hadamard product-sum combining functions, respectively.

Therefore, the present embodiment provides a method with a wider application meaning for the problem of image completion. According to the method, the damaged image can be completed more accurately by using the prior information of the binary mask through the local convolution layer, and in addition, the resolution of the generated image can be gradually increased through the dual attention fusion module, so that rich detail information is continuously generated.

Further, the objective function in the image completion task in this embodiment is divided into a structure loss function and a texture loss function, which are expressed as follows:

Wherein the reconstruction loss function in the structural loss function is represented as:

wherein | · | purple sweet₁Represents L₁And (4) norm.

cat represents the linking operation;

wherein the perceptual loss function of the texture loss function is represented as:

where φ is the pre-trained VGG-16 network. Phi is aⁱAnd outputting the characteristic map of the ith pooling layer. The pool-1, pool-2 and pool-3 layers of VGG-16 are used in the present invention.

The lattice loss function in the texture loss function is expressed as:

wherein C is_iThe number of channels of the feature map representing the i-th layer output of the pre-trained model VGG-16.

The total variation loss function in the texture loss function is expressed as:

where omega represents a damaged area in the image. The total variation loss function is a smooth penalty term and is defined on the expansion domain of one pixel in the missing region.

The penalty function in the texture loss function is expressed as:

where D denotes a discriminator. y 'is a randomly scaled version of a sample sampled from y' and y. In the present invention, λ is set to 10.

The total loss function of this embodiment is defined as:

where P and Q are the number of layers of the decoder selected.

The generation countermeasure network based on the attention mechanism mainly completes the image completion task, and the final goal of the generation countermeasure network is L_totalThe loss function is minimized and stabilized.

The attention-based mechanism of generation confrontation network is trained as follows:

step S21: initializing a weight parameter of the network, wherein λ_rec、λ_per、λ_style、λ_tvAnd λ_adv6, 0.1, 240, 0.1, 0.001, batch size 32, learning rate 10^-4P and Q are {1, 2, 3, 4, 5, 6} and{1，2，3}。

step S22: and inputting the damaged image and the binary mask image into a generator G for image completion. The generated complete image and the real target image are input into a discriminator D, and the iteration is carried out in sequence to ensure that the network total loss function L_totalAnd decreases to tend to stabilize.

It should be noted that, in the embodiment of the present disclosure, an encoder is used to extract features from input data, and a decoder is used to decode the obtained hidden code into an image. And the dual attention fusion module outputs a final complete image. In this example, the encoder and decoder each consist of 8 convolutional layers. Wherein, the sizes of the convolution layer filters in the encoder are respectively 7, 5, 3, 3, 3, 3, 3 and 3; the convolutional layer filters in the decoder are all 3 in size. In the present example, the feature map is upsampled using conventional methods. The number of layers of the convolutional layers and the number and size of the filters in each convolutional layer can be selected and set according to actual conditions. In the discriminator, a convolution neural network structure is adopted to take the real image pair and the generated complementary image pair as input, and the output adopts a block countermeasure loss function to judge whether the real image pair and the generated complementary image pair are true or false.

The embodiment of the disclosure provides that the local convolution layer utilizes the prior information in the binary mask map for the task of image completion by utilizing the high nonlinear fitting capability of the generated countermeasure network based on the attention mechanism. Secondly, the embodiment of the present disclosure provides a dual attention fusion module, which can form a multi-scale encoder. The encoder may gradually increase the texture detail in the generated image. In particular, the network advantageously produces high quality images with the constraint of an applied loss function. Thus, a model with image completion can be trained by the network shown in fig. one. In the testing stage, the binary mask and the damaged image are also used as the input of the model, and the generated image completion result is obtained, as shown in fig. three.

Step S3: and (4) performing completion processing on the test data by using a trained attention-based generation countermeasure network model.

To illustrate the specific implementation of the disclosed embodiment in detail and to verify the validity of the disclosed method, we apply the method proposed in this embodiment to four public databases (one face database and three nature databases) -CelebA-HQ, ImageNet, Places2 and pair Street View. CelebA-HQ contains 30000 high-quality face images. The plants 2 contained 365 scenes, with a total number of images exceeding 8000000. A Pairs Street View contains 15000 Paris Street View maps. ImageNet is a large data set, exceeding 14 hundred million images. For Places2, Pairs Street View, and ImageNet, the original validation and test set was used in the present invention. For CelebA-HQ, 28000 images were randomly selected for training and the remaining images were used for testing in the present invention. 60000 binary mask graphs are generated off line by using a binary mask algorithm. 55000 binary mask images are randomly selected for training, and the rest 5000 binary mask images are used for testing (the binary mask images are used for generating damaged images). The method comprises the steps of using a generated confrontation network based on an attention mechanism and an objective function designed in the invention, taking a damaged image and a corresponding binary mask image as input, and training the deep neural network by using confrontation and gradient back propagation between a generator and a discriminator. And continuously adjusting the weights of different tasks in the training process until the network converges finally to obtain the model for editing the facial expressions.

In order to test the effectiveness of the model, the image completion operation is performed by using the test set data, and the visualization result is shown in fig. three. This embodiment effectively demonstrates that the proposed method can generate high quality images.

Finally, the above embodiments are only intended to illustrate the technical solutions of the present disclosure and not to limit, although the present disclosure has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made to the technical solutions of the present disclosure without departing from the spirit and scope of the technical solutions, and all of them should be covered in the claims of the present disclosure.

Although the present disclosure has been described with reference to specific embodiments, it should be understood that the scope of the present disclosure is not limited thereto, and those skilled in the art will appreciate that various modifications and changes can be made without departing from the spirit and scope of the present disclosure.

Claims

1. An image completion method based on an attention mechanism is characterized by comprising the following processes:

step S1: preprocessing the database image data, synthesizing a damaged image by using a binary mask, and taking the damaged image and the corresponding binary mask as input data of a network model;

step S2: training the input data to obtain a generated confrontation network model capable of performing image completion;

2. The attention-based image completion method according to claim 1, wherein the database face image and the natural image are identical in size after the preprocessing in the step S1; in the image completion task, a damaged image and a corresponding binary mask are combined to be used as input, and an undamaged image is used as a real image label.

3. The attention-based image completion method according to claim 1, wherein the step S2 includes:

4. The image completion method based on attention mechanism as claimed in claim 3, wherein the damaged image is marked as x, and the generated image is marked as x

The target image is denoted as y and the binary mask image is denoted as M.

5. The attention-based image completion method according to claim 3, wherein the output value of the local convolution layer in the attention-based system depends on the undamaged area, and is mathematically described as follows:

wherein, l indicates a pixel-level multiplication, and 1 indicates a matrix in which all elements are 1 and the shape and M are the same. W represents the parameter of the convolutional layer, F represents the output characteristic diagram of the convolutional layer in the previous layer, b represents the deviation of the convolutional layer, and M represents the corresponding binary mask diagram;

6. The method of image completion based on attention mechanism as claimed in claim 3, wherein the dual attention fusion module in the attention mechanism fuses the known region and the generation together, comprising: firstly, channel-level statistical information is obtained:

then, acquiring the dependency relationship between the channels:

ω＝f(W_U(W_Dz))

second attention is sought for α obtained by:

finally, the final image completion result is obtained

Wherein, l and B denote Hadamard product-sum combining functions, respectively.

7. The attention-based mechanism image inpainting method according to claim 3, wherein the loss function is divided into a structural loss function and a texture loss function:

8. An image complementing device based on attention mechanism is characterized by comprising

9. The attention-based image completion apparatus according to claim 8, wherein the data processing module is configured to pre-process the database facial image and the natural image to have the same size; in the image completion task, a damaged image and a corresponding binary mask are combined to be used as input, and an undamaged image is used as a real image label.

10. The attention-based image completion apparatus according to claim 8, wherein the model generation module comprises: initializing a network weight parameter in an image completion task; wherein the loss function of the generator is L_totalThe loss function of the discriminator is L_D(ii) a Inputting the damaged image and the binary mask image into a generator network G for image completion task, inputting the generated completed image and the target image into a discriminator network D, and sequentially performing iterative training to enable a loss function L of the generator_totalLoss function L of sum discriminator_DAll reduce to tend to be stable; and simultaneously training the expression generation and removal tasks until all loss functions are not reduced any more, thereby obtaining a final generation confrontation network model.