CN112184582B

CN112184582B - Attention mechanism-based image completion method and device

Info

Publication number: CN112184582B
Application number: CN202011038187.6A
Authority: CN
Inventors: 赫然; 马鑫; 侯峦轩; 黄怀波; 王海滨
Original assignee: Cas Artificial Intelligence Research Qingdao Co ltd
Current assignee: Cas Artificial Intelligence Research Qingdao Co ltd
Priority date: 2020-09-28
Filing date: 2020-09-28
Publication date: 2022-08-19
Anticipated expiration: 2040-09-28
Also published as: CN112184582A

Abstract

The invention relates to an image completion method and device based on an attention mechanism, belonging to the technical field of computer image processing, and the method comprises the following processes: step S1, preprocessing the database image data, synthesizing a damaged image by using a binary mask, and taking the damaged image and the corresponding binary mask as the input of a network model; step S2, obtaining a generated confrontation network model capable of image completion through training; and step S3, using the trained generated confrontation network model to perform completion processing on the test data. The invention provides a generation countermeasure network model based on an attention mechanism aiming at the problem of image completion. The binary mask is used as an additional information guide, training learning is carried out by combining with an input image, and the model can enable a completion result to contain rich detail information and can keep structural continuity.

Description

Attention mechanism-based image completion method and device

Technical Field

The disclosure belongs to the technical field of computer image processing, and particularly relates to an image completion method and device based on an attention mechanism.

Background

The statements herein merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Image inpainting refers to the generation of substitute content for missing regions in a given damaged image, and makes the repaired image visually realistic and semantically reasonable. Image completion tasks may be used in other applications, such as image editing, when scene elements distracting from human attention, such as people or objects (which are often unavoidable), are present in an image, allowing a user to remove unwanted elements from the image while filling in blank areas with visually and semantically reasonable content.

The inventor finds that: with the continuous development of science and technology, the demands of people in different fields are correspondingly improved, including movie advertisement animation production, online games and the like, and the vivid image restoration technology has important significance on the good experience of users. Therefore, under the background, an image completion method based on an attention mechanism is developed, so that the repaired image is vivid visually and reasonable semantically, and the method has important significance.

Disclosure of Invention

Aiming at the technical problems in the prior art, the invention provides an image completion method and device based on an attention mechanism.

At least one embodiment of the present disclosure provides an image completion method based on an attention mechanism, including the following steps:

step S1, preprocessing the database image data, synthesizing a damaged image by using a binary mask, and taking the damaged image and the corresponding binary mask as the input of a network model;

step S2, training the input data to obtain a generated confrontation network model capable of image completion;

and step S3, using the trained generated confrontation network model to perform completion processing on the test data.

Further, the database face image and the natural image after the preprocessing in the step S1 are consistent in size; in the image completion task, a damaged image and a corresponding binary mask are combined as input, and an undamaged image is used as a real image label.

Further, the process of generating the countermeasure network model in step S2 includes:

step (ii) ofS21: initializing a network weight parameter in an image completion task; wherein the loss function of the generator is L _total The loss function of the discriminator is L _D ；

Step S22: inputting the damaged image and the binary mask image into a generator network G for image completion task, inputting the generated completed image and the target image into a discriminator network D, and sequentially performing iterative training to enable a loss function L of the generator _total Loss function L of sum arbiter _D All are reduced to tend to be stable;

step S23: and simultaneously training the expression generation and removal tasks until all loss functions are not reduced any more, thereby obtaining a final generation confrontation network model.

Further, the damaged image is denoted as x, and the generated image is denoted as x

The target image is denoted as y and the binary mask image is denoted as M.

Further, note that the output value of the local convolution layer in the force mechanism depends on the undamaged area, which is mathematically described as follows:

where, an |, indicates pixel level multiplication, and 1 indicates a matrix with all elements 1 and the same shape and M. W represents the parameter of the convolutional layer, F represents the output characteristic diagram of the convolutional layer in the previous layer, b represents the deviation of the convolutional layer, and M represents the corresponding binary mask diagram;

is a scaling factor that adjusts the weight of the known region.

The binary mask map M also needs to be updated after the local convolution is performed, which is mathematically described as follows:

that is, if the convolutional layer can get an output result according to a valid input, the position in the binary mask is marked as 1.

Further, a dual attention fusion module in an attention mechanism that fuses together known regions and generations, comprising: firstly, channel-level statistical information is obtained:

wherein z is _c (i, j) is the value of the c-th dimension of z. H _GP Representing a global pooling layer. f. of _c Representing the c-th dimension in the feature graph F;

then, acquiring the dependency relationship between the channels:

ω＝f(W _U δ(W _D z))

where f and δ are denoted as sigmoid function and ReLU activation function, respectively. W _U And W _D Is a parameter of the convolutional layer. The obtained channel dimension information ω can be used to adjust the weight of the input:

wherein ω is _c And f _c Respectively representing a scaling factor and a feature map;

second attention is sought for α obtained by:

where x' is a different scale version of the damaged image x. A is a learnable variation function, and is composed of a plurality of convolution functions.

And x' are first joined and then fed into the convolutional layer. f is a sigmoid function, and alpha can be changed into attentionForce is tested;

finally, the obtained image completion result

Wherein, l and B denote Hadamard product-sum combining functions, respectively.

Further, the loss function is divided into a structural loss and a texture loss function:

where k denotes the penalty function computation at the kth layer of the decoder. L is a radical of an alcohol _struct Representing the structural loss function, L _text Representing the texture loss function, L _rec Indicating L between images ₁ Norm, L _per Representing the perceptual loss function, L _style Representing a style loss function, L _tv Representing the total variation loss function, L _adv Representing the penalty function. Lambda [ alpha ] _rec 、λ _per 、λ _style 、λ _tv And λ _adv Representing a weighting factor.

At least one embodiment of the present disclosure provides an image completion apparatus based on attention mechanism, the apparatus including

A data processing module: the system comprises a database, a binary mask and a network model, wherein the database is configured to preprocess image data, synthesize a damaged image by using the binary mask, and combine the damaged image and a corresponding binary mask as input of the network model;

a model generation module: the system comprises a generating confrontation network model, a generating and confrontation network model and a control center, wherein the generating and confrontation network model is configured to be trained to obtain a generating and confrontation network model which can be subjected to image completion;

an image completion module: and (4) using the trained generated confrontation network model to perform completion processing on the test data.

Further, the data processing module is configured to make the database face image and the natural image size consistent after preprocessing; in the image completion task, a damaged image and a corresponding binary mask are combined as input, and an undamaged image is used as a real image label.

Further, the model generation module includes: initializing a network weight parameter in an image completion task; wherein the loss function of the generator is L _total The loss function of the discriminator is L _D (ii) a Inputting the damaged image and the binary mask image into a generator network G for image completion task, inputting the generated completed image and the target image into a discriminator network D, and sequentially performing iterative training to enable a loss function L of the generator _total Loss function L of sum discriminator _D All are reduced to tend to be stable; and simultaneously training the expression generation and removal tasks until all loss functions are not reduced any more, thereby obtaining a final generation confrontation network model.

The beneficial effects of this disclosure are as follows:

(1) in order to improve the generation quality (including rich texture details and structural continuity) of an image in an image completion task, the image completion method based on the attention mechanism is provided. Through the local convolution layer, the generation countermeasure network can utilize the prior information of the binary mask, and the quality of the generated image is improved. By means of the dual attention module, a multi-scale decoder is formed, and high-resolution images can be gradually generated.

(2) The image completion method introduces a reconstruction loss function, a style loss function, a total variation loss function and an antagonistic loss function as constraints at an image level and a characteristic level, and improves the robustness and accuracy of the network.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to limit the disclosure.

The image completion method based on the attention mechanism provided by the embodiment of the disclosure is a flow chart;

fig. is a flowchart of the dual attention module provided in the embodiment of the present disclosure;

fig. three is a diagram illustrating the effect of image completion on the public data set according to the embodiment of the present disclosure.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

The embodiment of the disclosure provides an attention mechanism-based image completion method, which comprises the following steps:

step S1, preprocessing the database image data, synthesizing the damaged image by using a binary mask, and taking the damaged image and the corresponding binary mask as the input of the network model;

specifically, a binary mask map is first generated offline using a binary mask algorithm. For the face image, normalizing the image according to the positions of the two eyes and cutting the image to be 256 × 256 with a uniform size; for natural images, the image size is first enlarged to 350 × 350, and then the enlarged image is randomly cropped to a uniform size 256 × 256. And randomly selecting an off-line generated binary mask image, and multiplying the off-line generated binary mask image by the damaged image to obtain the damaged image.

Further, in step S1, the size of the preprocessed database face image is consistent with that of the natural image, and meanwhile, in the next image completion task, the damaged image and the corresponding binary mask are combined as input, and the undamaged image is used as a real image label.

And step S2, training the generated confrontation network model based on the attention mechanism by using the training input data so as to complete the image completion task.

It should be noted that, in this step, in order to expand the sample size of the input data and improve the generalization capability of the network, the embodiment may employ data augmentation operations, including random inversion, so as to increase the number of the order training data.

Specifically, the step S2 includes:

step S21: initializing the network weight parameter in the image completion task, wherein the loss function of the generator is L _total The penalty function of the arbiter is L _D ；

Step S22: inputting the damaged image and the binary mask image into a generator network G for image completion task, inputting the generated completed image and the target image into a discriminator network D, and sequentially performing iterative training to enable a loss function L of the generator _total Loss function L of sum discriminator _D All reduce to tend to be stable;

step S23: and simultaneously training the expression generation and task removal until all loss functions are not reduced any more, thereby obtaining a final generated confrontation network model.

Further, assuming that the lesion image is denoted as x, the generated image is denoted as x

The target image is denoted as y and the binary mask image is denoted as M, then the output values of the local convolution layer in the above-mentioned attention mechanism depend on the undamaged area, which is mathematically described as follows:

wherein, l indicates a pixel-level multiplication, and 1 indicates a matrix in which all elements are 1 and the shape and M are the same. W represents a parameter of a convolutional layer, F represents an output characteristic diagram of a convolutional layer of a previous layer, b represents a deviation of the convolutional layer, and M represents a corresponding binary mask diagram.

Is a scaling factor that adjusts the weight of the known region.

The present embodiment also needs to update the binary mask map M after performing the partial convolution, and the mathematical description thereof is as follows:

Further, in step S2, the dual attention fusion module in the attention mechanism fuses the known region and the generation region together, including:

first of all the statistical information at the channel level is obtained,

wherein z is _c (i, j) is the value of the c-th dimension of z. H _GP Representing a global pooling layer. f. of _c Representing the c-th dimension in the feature map F.

Then, the dependency relationship between the channels is obtained,

ω＝f(W _U δ(W _D z))

where f and δ are denoted sigmoid function and ReLU activation function, respectively. W _U And W _D Is a parameter of the convolutional layer. The obtained channel dimension information ω can be used to adjust the weight of the input:

wherein omega _c And f _c Respectively representing a scaling factor and a feature map.

Next, the attention map α is obtained by:

where x' is a different scale version of the damaged image x. A is a learnable variation function, which is composed of a plurality of convolution functions.

And x' are first joined and then fed into the convolutional layer. f is a sigmoid function, and alpha can be changed into an attention diagram.

Finally, the obtained image completion result

Wherein, the |, and B denote Hadamard product-sum join functions, respectively.

Therefore, the present embodiment provides a method with a wider application meaning for the problem of image completion. According to the method, the damaged image can be completed more accurately by using the prior information of the binary mask through the local convolution layer, and in addition, the resolution of the generated image can be gradually increased through the dual attention fusion module, so that rich detail information is continuously generated.

Further, the objective function in the image completion task in this embodiment is divided into a structure loss function and a texture loss function, which are expressed as follows:

where k denotes the penalty function computation at the kth layer of the decoder. L is _struct Representing the structural loss function, L _text Representing the texture loss function, L _rec Indicating L between images ₁ Norm, L _per Representing the perceptual loss function, L _style Representing a style loss function, L _tv Representing the total variation loss function, L _adv Representing the penalty function. Lambda _rec 、λ _per 、λ _style 、λ _tv And λ _adv Representing a weighting factor.

Wherein the reconstruction loss function in the structural loss function is represented as:

wherein | ₁ Represents L ₁ And (4) norm.

cat represents the linking operation;

wherein the perceptual loss function of the texture loss function is represented as:

where φ is the pre-trained VGG-16 network. Phi is a unit of ⁱ And outputting the characteristic map of the ith pooling layer. The pool-1, pool-2 and pool-3 layers of VGG-16 are used in the present invention.

The lattice loss function in the texture loss function is expressed as:

wherein C is _i The number of channels of the feature map representing the i-th layer output of the pre-trained model VGG-16.

The total variation loss function in the texture loss function is expressed as:

where omega represents a damaged area in the image. The total variation loss function is a smooth penalty term defined in the dilation domain of a pixel in the missing region.

The penalty function in the texture loss function is expressed as:

where D denotes a discriminator. y 'is a randomly scaled version of a sample sampled from y' and y. In the present invention, λ is set to 10.

The total loss function of this embodiment is defined as:

where P and Q are the number of layers of the decoder chosen.

The generation countermeasure network based on the attention mechanism mainly completes the image completion task, and the final goal of the generation countermeasure network is L _total The loss function is minimized and stabilized.

The attention-based mechanism of generation confrontation network is trained as follows:

step S21: initializing a weight parameter of the network, wherein λ _rec 、λ _per 、λ _style 、λ _tv And λ _adv 6, 0.1, 240, 0.1, 0.001, respectively, batch size 32, learning rate 10 ^-4 P and Q are {1,2,3,4,5,6} and {1,2,3} respectively.

Step S22: and inputting the combined damaged image and the binary mask image into a generator G for image completion. The generated complete image and the real target image are input into a discriminator D, and the iteration is carried out in sequence to ensure that the network total loss function L _total And decreases to tend to stabilize.

It should be noted that, in the embodiment of the present disclosure, an encoder is used to extract features from input data, and a decoder is used to decode the obtained hidden code into an image. And the dual attention fusion module outputs a final completion image. In this example, the encoder and decoder each consist of 8 convolutional layers. Wherein, the sizes of the convolution layer filters in the encoder are respectively 7, 5, 3, 3, 3, 3, 3 and 3; the convolutional layer filters in the decoder are all 3 in size. In the present example, the feature map is upsampled using conventional methods. The number of layers of the convolutional layers and the number and size of the filters in each convolutional layer can be selected and set according to actual conditions. In the discriminator, a convolution neural network structure is adopted to take the real image pair and the generated complementary image pair as input, and the output adopts a block countermeasure loss function to judge whether the real image pair and the generated complementary image pair are true or false.

The embodiment of the disclosure provides that the local convolution layer utilizes the prior information in the binary mask map for the task of image completion by utilizing the high nonlinear fitting capability of the generated countermeasure network based on the attention mechanism. Second, the embodiments of the present disclosure provide a dual attention fusion module, which can form a multi-scale encoder. The encoder may gradually increase the texture detail in the generated image. In particular, the network advantageously produces high quality images with the constraint of an applied loss function. Thus, a model with image completion can be trained by the network shown in fig. one. In the testing stage, the binary mask and the damaged image are also used as the input of the model, and the generated image completion result is obtained, as shown in fig. three.

Step S3: and (4) performing completion processing on the test data by using a trained attention-based generation countermeasure network model.

To illustrate the specific implementation of the disclosed embodiment in detail and to verify the validity of the disclosed method, we apply the method proposed in this embodiment to four public databases (one face database and three nature databases) -CelebA-HQ, ImageNet, Places2 and pair Street View. CelebA-HQ contains 30000 high-quality face images. The plants 2 contained 365 scenes, with a total number of images exceeding 8000000. A Pairs Street View contains 15000 Paris Street View maps. ImageNet is a large data set, exceeding 14 hundred million images. For Places2, Pairs Street View, and ImageNet, the original validation and test set was used in the present invention. For CelebA-HQ, 28000 images were randomly selected for training and the remaining images were used for testing in the present invention. 60000 binary mask graphs are generated off line by using a binary mask algorithm. 55000 binary mask images are randomly selected for training, and the rest 5000 binary mask images are used for testing (the binary mask images are used for generating damaged images). The method comprises the steps of using a generated confrontation network based on an attention mechanism and an objective function designed in the invention, taking a damaged image and a corresponding binary mask image as input, and training the deep neural network by using confrontation and gradient back propagation between a generator and a discriminator. And continuously adjusting the weights of different tasks in the training process until the network converges finally to obtain the model for editing the facial expressions.

In order to test the effectiveness of the model, the image completion operation is performed by using the test set data, and the visualization result is shown in fig. three. This embodiment effectively proves that the method proposed by the present invention can generate high quality images.

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present disclosure and not for limiting, and although the present disclosure is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions can be made on the technical solutions of the present disclosure without departing from the spirit and scope of the technical solutions, which should be covered by the claims of the present disclosure.

Although the present disclosure has been described with reference to specific embodiments, it should be understood that the scope of the present disclosure is not limited thereto, and those skilled in the art will appreciate that various modifications and changes can be made without departing from the spirit and scope of the present disclosure.

Claims

1. An image completion method based on an attention mechanism is characterized by comprising the following processes:

step S1, preprocessing the database image data, synthesizing a damaged image by using a binary mask, and combining the damaged image and the corresponding binary mask as input data of a network model;

s3, using the trained generated confrontation network model to perform completion processing on the test data;

step S2 includes:

step S21: initializing a network weight parameter in an image completion task; wherein the loss function of the generator is L _total The loss function of the discriminator is L _D ；

Step S22: inputting the damaged image and the binary mask image into a generator network G for image completion task, inputting the generated completed image and the target image into a discriminator network D, and sequentially performing iterative training to enable a loss function L of the generator _total Loss function L of sum arbiter _D All reduce to tend to be stable;

step S23: training the expression generation and removal tasks at the same time until all loss functions are not reduced any more, thereby obtaining a final generation confrontation network model;

a dual attention fusion module in an attention mechanism that fuses together known regions and generations, comprising: firstly, channel-level statistical information is obtained:

wherein z is _c (i, j) is the value of the c-th dimension of z, H _GP Representing a global pooling layer, f _c Representing the c-th dimension in the feature map F;

then, acquiring the dependency relationship between the channels:

ω＝f(W _U δ(W _D z))

where f and δ are denoted sigmoid function and ReLU activation function, W, respectively _U And W _D Is a parameter of the convolutional layer, the obtained channel dimension information ω can be used to adjust the input weight:

second attention is sought for α obtained by:

where x' is a version of the corrupted image x at different scales, A is a learnable variation function, consisting of multiple convolution functions,

and x' are firstly connected and then sent into the convolutional layer, wherein f is a sigmoid function and can change alpha into an attention map;

finally, the obtained image completion result

Wherein, l and B denote Hadamard product-sum combining functions, respectively.

2. The attention-based image completion method according to claim 1, wherein the database face image is identical in size to the natural image after the preprocessing in step S1; in the image completion task, a damaged image and a corresponding binary mask are combined as input, and an undamaged image is used as a real image label.

3. The attention-based image completion method according to claim 1,wherein the damaged image is recorded as x, and the generated image is recorded as x

The target image is denoted as y and the binary mask image is denoted as M.

4. The attention-based mechanism image completion method according to claim 1, wherein the output value of the local convolution layer in the attention-based mechanism depends on the undamaged area, and is mathematically described as follows:

wherein, 1 denotes a matrix having all elements of 1 and the same shape as M, W denotes a parameter of a convolution layer, F denotes an output characteristic diagram of a previous convolution layer, b denotes a deviation of the convolution layer, and M denotes a corresponding binary mask diagram;

is a scaling factor, adjusts the weight of the known region;

5. The attention-based image completion method according to claim 1, wherein the loss function L of the generator _total A structural loss and texture loss function is divided:

where k denotes the calculation of the loss function at the k-th layer of the decoder, L _struct Representing the structural loss function, L _text Representing the texture loss function, L _rec Indicating L between images ₁ Norm, L _per Representing the perceptual loss function, L _style Representing a style loss function, L _tv Representing the total variation loss function, L _adv Representing a function of antagonistic losses, λ _rec 、λ _per 、λ _style 、λ _tv And λ _adv Representing a weighting factor.

6. An image complementing device based on an attention mechanism, comprising:

a model generation module: the method comprises the steps that a generated confrontation network model capable of image completion is obtained through training;

an image completion module: using the trained generated confrontation network model to perform completion processing on the test data;

the preparation steps for generating the confrontation network model are as follows:

step S21: initializing a network weight parameter in an image completion task; wherein the loss function of the generator is L _total The penalty function of the arbiter is L _D ；

step S23: simultaneously training the expression generation and task removal until all loss functions are not reduced any more, thereby obtaining a final generated confrontation network model;

a dual attention fusion module in an attention mechanism that fuses known regions and generations together, comprising: firstly, channel-level statistical information is obtained:

wherein z is _c (i, j) is the value of the c-th dimension of z, H _GP Representing a global pooling layer, f _c Representing the c-th dimension in the feature graph F;

then, acquiring the dependency relationship between the channels:

ω＝f(W _U δ(W _D z))

second attention is sought for α obtained by:

finally, the obtained image completion result

Wherein, the |, and B denote Hadamard product-sum join functions, respectively.

7. The attention-based image completion apparatus according to claim 6, wherein the data processing module is configured to pre-process the database face image and the natural image to have the same size; in the image completion task, a damaged image and a corresponding binary mask are combined to be used as input, and an undamaged image is used as a real image label.

8. The attention-based image completion apparatus according to claim 6, wherein the model generation module comprises: initializing a network weight parameter in an image completion task; wherein the loss function of the generator is L _total The loss function of the discriminator is L _D (ii) a Inputting the damaged image and the binary mask image into a generator network G for image completion task, inputting the generated completed image and the target image into a discriminator network D, and sequentially performing iterative training to enable a loss function L of the generator _total Loss function L of sum discriminator _D All reduce to tend to be stable; and simultaneously training the expression generation and removal tasks until all loss functions are not reduced any more, thereby obtaining a final generation confrontation network model.