CN114897738A

CN114897738A - Image blind restoration method based on semantic inconsistency detection

Info

Publication number: CN114897738A
Application number: CN202210574618.3A
Authority: CN
Inventors: 李昕; 王志宽; 刘航源; 孙百乐
Original assignee: China University of Petroleum East China
Current assignee: China University of Petroleum East China
Priority date: 2022-05-25
Filing date: 2022-05-25
Publication date: 2022-08-12

Abstract

The invention discloses an image blind restoration method based on semantic inconsistency detection, which comprises the following steps: preprocessing an image with noise pollution to be used as input; amplifying semantic difference between a polluted region and a background through a mask prediction network constructed by the annular residual block, and roughly positioning a degraded region in the polluted image; then, by utilizing the texture similarity among the regions of different classes, a fine prediction mask is obtained through a mask refinement network; combining the damaged image and the prediction mask to input into an image restoration network, and iteratively utilizing the information of the effective area to complement the content of the damaged area based on the confidence consideration of the mask; meanwhile, the consistency of the structure is improved by using the context attention aggregation module at different scales; and fusing the multi-feature information and then decoding and restoring the multi-feature information into an image, thereby realizing the blind repair of the degraded image. The method can accurately detect the noise pollution in the real damaged image, and meets the requirement of robust restoration of various degraded images.

Description

Image blind restoration method based on semantic inconsistency detection

Technical Field

The invention belongs to the field of computer graphics and image processing, and relates to an image blind restoration method based on semantic inconsistency detection.

Background

With the development of computer technology and multimedia technology, digital images become important information carriers. Over time and with some factors of resistance, the preservation process of photographs may be subject to various degradations, such as ink mark contamination, crease breakage, mildew and fade, etc.; in addition, accidents such as robbing of glasses during photographing and stains on camera lenses can also be caused when the time is recorded. These various techniques greatly affect the expression of image content. Therefore, image restoration techniques for restoring image content and improving image quality have been developed rapidly in recent years, and are widely used in the fields of image editing, target removal, biomedical image processing, criminal investigation, and the like. The image restoration technology has achieved a lot of important research results through years of development, and the processing means widely used at present, such as a Photoshop repair tool, etc., apply the traditional restoration method, and utilize the redundancy of image information to fill the damaged area with the pixels of the known area. Such methods can produce good patches of scene images with repetitive textures, but cannot produce new content due to lack of understanding of image semantics.

As a great research focus of computer vision direction, in recent years, researchers try to introduce deep learning methods into the field of image restoration, and although these models can infer missing content through provided effective pixels, these methods all assume blank content in an image as a damaged area and clearly need to provide a binary mask for calibration. The methods can train the model to infer the content of the missing region, however, the degradation mode and the position region of the damaged image in real life are often unknown, and it is difficult to provide an accurate mask to guide the region to be repaired in advance, which greatly limits the popularization of the methods in real scenes. Therefore, how to identify and repair the damaged content in the image only by the damaged image becomes a difficult problem to be solved urgently.

Disclosure of Invention

In order to overcome the defects, the invention provides an image blind repairing method based on semantic inconsistency detection, which comprises the following specific steps:

s1, inputting a damaged image I _m A clean pixel region and a contaminated pixel region;

s2, constructing a mask prediction network through the multilayer residual block,generating single-channel coarse prediction soft mask for locating damaged area

S3, inputting the rough prediction mask and the damaged image obtained in the S2 into a mask thinning network again, improving the prediction accuracy of the regions such as the boundary and the like through reinforcement learning, and obtaining a fine damaged region prediction mask

S4, inputting the fine prediction mask obtained in S3 as prior information and the damaged image into a shared encoder, extracting the characteristics of effective pixels according to the guidance of the mask and transmitting the characteristics to a damaged area;

s5, inputting the deep characteristic diagram extracted by the encoder network into a multi-task parallel decoding branch, speculating the content of the missing area through a plurality of layers of rolling blocks, and ensuring the global semantic consistency by using context information;

s6, fusing the features extracted from different branches in S5, decoding by a decoder network, and recovering into an image;

and S7, utilizing the fine prediction mask in S3, cutting out pixels at the position of the damaged area in the S6 result, splicing the pixels with effective pixels in the damaged image, and outputting a final repaired image.

The technical scheme of the invention is characterized by comprising the following steps:

with respect to step S1, the invention first defines the damaged image, and instead of simply using blank pixels to represent the area to be repaired as in the prior art, the invention considers that the damaged image should consist of clean valid pixels and different types of degraded and contaminated pixels. Because no data set specially used for blind repair research exists at present, batch training data is firstly synthesized according to the thought for model training, and the mathematical expression is as follows:

in the formula (1), I _m Representing a stitched defective image, I _gt Representing a completely clean image, N representing the contaminating noise content, and M the binarization mask. In order to improve the robustness of the method, N simulates graffiti, creases, character occlusion, content of other images which are randomly intercepted, and the like and is spliced to I _gt Generating a damaged image I containing multiple types of contamination and degradation _m 。

Preferably, in step S1, in order to make the original image more natural to blend the pollution noise, a smooth gaussian function is used in the present invention to perform the smoothing process, and the formula is as follows:

I＝I _m *G _σ (2)

in the formula (2), I represents the smoothed damaged image, I _m Representing damaged images spliced directly, G _σ Representing a two-dimensional gaussian kernel with standard deviation sigma.

For step S2, the present invention uses the modified circular residual convolution block as a feature extractor to locate the damaged area by enlarging the difference between the effective pixel area and the contaminated area and comparing the intrinsic properties of the different areas of the image. The annular residual block used in the invention comprises three steps, designs a recall and consolidation mechanism from human brain, and is realized through the propagation and feedback process of residual in CNN. The first stage is forward residual propagation, and solves the problem of gradient degradation in a deeper network by recalling input characteristic information, and formula definition can be expressed as:

y _f ＝F(x，{W _i })+W _s *x (3)

in the formula (3), x represents an input feature map, y _f The representation represents the learned residual map. F (x, { W) _i }) represents the learned residual map, whose structure includes two convolution layers and an activation function ELU, W _s Is a convolution of 1 × 1. The residual propagation looks like the memory mechanism of the human brain. Previous knowledge may be forgotten when the model learns more new knowledge, so a recall mechanism is needed to help evoke those previously ambiguous memories.

To further enhance the difference between the corrupted content and the valid content attributes, the second stage integrates the input feature information using residual feedback. By using a simple gating mechanism to learn the nonlinear relation between distinguishable characteristic channels, the diffusion of characteristic information is avoided, a response value is superposed on an input characteristic through an activation function, the difference of the image essential attributes of a noise region and an effective region is amplified, and a formula is defined as follows:

y _b ＝(s(G(y _f ))+1)*x (4)

in the formula (4), x is a residual mapping feature, y _b Is the residual feedback feature, G (-) is the linear mapping, s is the activation function, sigmoid function used here. Unlike the recall mechanism simulated by residual propagation, residual feedback seems to be in the process of simulating human brain consolidated knowledge, and is a new understanding of features. In the third stage, the operation of the first stage is repeated, and residual error propagation is carried out on the new features, so that amplified feature differences are further learned. Two forward residual transmissions are combined with one reverse residual feedback to form a ring residual structure.

For step S3, the present invention introduces an attention mechanism to refine the coarse prediction mask, and improves the recognition result at details such as contour by paying attention to similar texture on the whole image. In particular, if a low confidence region predicted to be corrupt shares a similar texture with a high confidence region, the low confidence region should be modified. For this reason, it is necessary to extract key features of the damaged content from the high-confidence region to be used as global visual features of the class. According to the method, cosine similarity is calculated for a rough prediction mask to serve as new bias, a score map of a prediction region is reduced by Softmax, and the region which is still high after score is reduced can be considered as a region with obvious enough characteristics, so that key characteristics can be extracted from the regions to serve as global characteristics of a damaged region, and the calculation formula is as follows:

CosSim(x′ _sem )＝X∈R ^c×c

in the formula (5), CosSim (·) represents a modified cosine similarity calculation function, x' _sem Representing a prediction weight matrix, i and j representing prediction classes, which may be divided into damaged and non-damaged areas, X _i，j Indicating the cosine similarity between pixels of different prediction classes,

is x' _sem The ith channel of (a), indicates the prediction results belonging to a certain class for each pixel. X _i，j The closer to 1 the more closely the image is,

and

the more similar the activation results, the less trustworthy the location prediction. By setting the deviation of the same type of pixels to 0 and setting the deviation of different types of pixels to similarity score X _i，j Thus, the region that still maintains a high activation value in the classification is the key feature, and the whole process is called key feature pooling.

Preferably, in said step S3, the invention utilizes a prediction weight matrix x' _sem And a feature map x _f Calculating the weighted sum to obtain the key feature v _k The method comprises the following steps:

where i represents a prediction category. Merging the key features v _k As Key, the feature x _f Viewed as Query, highlight and key feature v _k Obtaining an AttentionMap in the similar area, and performing convolution operation with the original image to predict the final thinning prediction mask

For step S4, the present invention introduces a gated convolution mechanism to improve residual convolution blocks, identifies damaged regions by learning, and dynamically selects effective pixel content in images, so that the convolution result depends only on effective pixels, and replaces the conventional residual convolution structure to perform feature extraction and integration of effective regions. The output of the gated convolution is calculated as:

Gating _y，x ＝∑∑W _g ·I

Feature _y，x ＝∑∑W _f ·I (7)

in the formula (7), I represents an input characteristic, W _g And W _f Two different convolution kernels are represented, #denotesthe use of the LeakyReLU activation function, and σ denotes the sigmoid function, all values are restricted to [0,1 ]]To indicate the importance of each local region,

denotes the element-by-element corresponding multiplication, O _y，x Representing the output characteristics of the soft-gated weights.

Preferably, in step S4, in order to avoid the influence of the error accumulation of the prediction mask on the image restoration result, the invention uses a new Probability Context Normalization (PCN) to perform statistical information transfer at the end of the improved residual block, and propagates the statistical information such as the mean and variance of the effective pixel region to the damaged region, so as to ensure that the distribution of the features of the inner and outer regions is consistent, and the notations are as follows:

in equation (8), X represents the output of the last layer of convolution in the gated residual block, H tableWill prediction mask

Sampling to the same size as X, β is a learnable channel attention weight, and "represents information transfer, specifically:

in the formula (9), X _P And X _Q Respectively representing the contaminated area and the valid pixel area, mu (-) represents the area mean, and sigma (-) represents the area variance. For an image, the feature mean is related to global semantics, and the variance is related to local texture features.

For step S5, the present invention obtains image context information using a multi-scale context attention aggregation branch, where the context similarity calculates a cosine similarity measure similarity between the patches inside and outside the missing region, finds the content with the highest similarity in the valid region for the patch of the region to be complemented, and assigns a higher reference weight so that the complemented content is kept consistent with the context in terms of semantics and texture. The similarity metric is formulated as follows:

in the formula (10), p _i And p _j Feature patches representing valid and missing regions respectively, and then obtaining the attention score of each patch through a softmax function:

where N denotes the number of patches of the effective area division. Through calculation, each patch in the missing region finds a region in the effective pixel which is more worth focusing, and a higher reference weight is given to the feature fusion.

Preferably, in step S5, in order to reduce the amount of computation and increase the inference speed, the present invention propagates the inter-patch attention similarity score by using context information transfer, specifically, the present invention calculates the similarity score once on the feature map with the network deep layer size of 32 × 32, and then propagates the attention score to the lower layers of different scales by using context attention transfer to perform feature weighting, as follows:

in the formula (12), l represents different network shallow layers,

indicating the missing region patch at a different scale,

representing an effective area, s, corresponding to the same size _i，j The attention score is shown and N represents the number of taps in the background. Since the feature map size is hierarchically varied, the size of the patch should be varied accordingly, specifically to enlarge the mapping region by comparing the current feature map size with the attention score map, for example, every four neighboring pixels in the feature map with a size of 128 × 128 share an attention score value. Through the attention score sharing mode, the model reasoning result not only obtains better global semantic consistency, but also improves the storage and calculation speed efficiency obviously.

The blind image restoration method based on the nonsingular consistency detection solves the problems that the prior art cannot solve the problem that the damaged images of multiple degradation modes in a real scene are restored and the calibration mask is difficult to directly obtain, and has the following advantages:

(1) compared with the existing repairing method, the method analyzes and designs an end-to-end network model, does not need to provide a mask for calibrating a damaged area, automatically identifies the polluted and damaged areas in the image, repairs the result with consistent semantics and complete vision, repairs various damaged modes in the real image, and has robustness and authenticity.

(2) The method can be conveniently expanded to other research fields of image processing, such as target removal, highlight removal, image rain removal and defogging, and has good mobility and applicability.

Drawings

FIG. 1 is a flow chart of blind image restoration based on semantic inconsistency detection according to the present invention.

FIG. 2 is a schematic diagram of a prediction mask refinement module according to the present invention.

FIG. 3 is a diagram illustrating a structure of a probabilistic context aggregation convolutional block according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

An image blind restoration method based on semantic inconsistency detection is shown in fig. 1, which is a flow chart of the image blind restoration method based on semantic inconsistency detection of the present invention, and the method includes:

s1, preprocessing data, reading a damaged image I with noise pollution _m The image size is uniformly adjusted to 256 × 256, and then the image size is input into the network model through normalization processing. In the training stage, various degradation modes in the simulated real scene are synthesized into a damaged image, and then Gaussian smoothing operation processing is additionally used, so that the image is more real and natural.

S2, coarse prediction of damaged area, inputting the processed degraded image into coarse mask prediction network constructed by six layers of annular residual blocks, wherein the whole structure is a coder-decoder network, learning image inherent attribute by convolution integration image context information, amplifying difference between effective pixel area and damaged area by alternative calculation of annular structure of residual propagation and residual feedback, generating single-channel coarse damaged area prediction mask

When loss is calculated in the training stage, since only whether each position belongs to an effective area or a damaged area needs to be judged, binary cross entropy loss is used as a loss function and is expressed as follows:

in the formula (13), T is an adaptive weight, and p ∈ { p | M _p 1 represents the real damage region, q ∈ { q | M _q 0 represents a real effective area.

S3, refining the prediction mask, inputting the rough prediction mask generated in S2 and the damaged image into a mask refining network, firstly extracting image characteristics through a simple encoder, calculating the cosine similarity between the pixels predicted to be different categories, and then limiting the value to 0,1 by a softmax function as shown in figure 2]The more the numerical value is close to 1, the more the prediction category of the area is unreliable, the Key feature of the damaged area with high confidence level is screened out as Key, the overall image feature Query is traversed according to the Query mode of the attention mechanism to obtain the global attention weight, and finally the updated feature information is integrated through deconvolution and the image is restored to obtain the refined prediction mask with clearer and more accurate detail outline

And S4, extracting content features, inputting the damaged image into an encoder, and in order to avoid the influence caused by the wrong accumulation of the prediction mask, simultaneously, zooming the prediction thinning mask to be the same size as the feature map and inputting the prediction thinning mask into each layer of the encoder so as to guide the extraction of effective pixel information and the transmission of the effective pixel information to a damaged area. The encoder is composed of four layers of gating residual convolution blocks newly designed by the invention, the structure of the encoder is shown in fig. 3, the output of the standard convolution layers of two different tasks is multiplied element by element, wherein one layer is followed by a leayrelu function, the other layer is followed by a sigmoid function, the soft mask is automatically learned and updated from the input in a learnable mode, and the convolution operation is limited to be carried out only in an effective pixel area. In addition, probability context normalization is selected to replace batch normalization, so that transfer of image statistical information is realized, and the characteristic information distribution inside and outside the mask is ensured to be consistent.

S5, deducing the content of the missing region, the invention provides a multi-task parallel framework, and two parallel decoding branches are designed for feature reasoning and content propagation. As shown in fig. 1, the uplink branch is formed by multilayer cavity convolutions with expansion rates of 2, 4 and 8, and the receptive field is enlarged by different expansion rates to capture multi-scale context information; the descending branch uses a multi-scale context attention integration module, the attention scores among different patches are calculated on a feature map with the network deep layer size of 32 x 32, and feature weighting is carried out on the network shallow layers with different scales through a context attention transfer module, so that the feature global structure and semantic consistency are ensured.

And S6, decoding the features and restoring the images, and splicing the feature graphs extracted from different branches in the S5 according to channels and inputting the feature graphs into a decoder network for decoding. The structural design of the decoder is symmetrical to that of the encoder, the characteristics of four layers of gated residual convolution blocks and up-sampling are alternatively fused, and finally, the predicted restored image is restored through one layer of 3 multiplied by 3 common convolution;

and S7, outputting a final repairing result, selecting effective contents of the input image and contents of the predicting result by using the predicting mask to splice in order to ensure that the result is clearer, and outputting a clean repairing result with complete structure and consistent semantics through smoothing.

In conclusion, the blind restoration method based on the semantic inconsistency detection image is suitable for restoring real damaged images in real life, does not need to additionally provide a binarization mask for marking damaged areas, realizes high-quality restoration of degraded images through an end-to-end network, ensures that the restoration result has visual integrity and structural rationality, can robustly solve various image degradation and pollution in different real scenes, and has wide application value.

While the present invention has been described in detail with reference to the preferred embodiments, it should be understood that the above description should not be taken as limiting the invention. Various modifications and alterations to this invention will become apparent to those skilled in the art upon reading the foregoing description. Accordingly, the scope of the invention should be determined from the following claims.

Claims

1. An image blind restoration method based on semantic inconsistency detection is characterized by comprising the following specific steps:

s1, inputting a damaged image I _m Including a clean pixel region and a damaged pixel region;

s2, constructing a mask prediction network through multiple layers of residual blocks, and generating a single-channel rough prediction soft mask for positioning a damaged area

2. The blind image restoration method based on semantic inconsistency detection according to claim 1, wherein for step S1, the invention first defines the damaged image, and unlike the existing research that simply uses blank pixels to represent the region to be restored, the invention considers that the damaged image should be composed of clean effective pixels and different types of degraded and polluted pixels. Because no data set specially used for blind repair research exists at present, batch training data is firstly synthesized according to the thought for model training, and the mathematical expression is as follows:

I＝I _m *G _σ (2)

y _f ＝F(x，{W _i })+W _s *x (3)

y _b ＝(s(G(y _f ))+1)*x (4)

For step S3, the present invention introduces an attention mechanism to refine the coarse prediction result, pay attention to similar texture on the whole image, and improve the recognition result at details such as contour. In particular, if a low confidence region predicted to be corrupt shares a similar texture with a high confidence region, the low confidence region should be modified. For this reason, it is necessary to extract key features of the damaged content from the high-confidence region to be used as global visual features of the class. According to the method, cosine similarity is calculated for a rough prediction mask to serve as new bias, a score map of a prediction region is reduced by Softmax, and the region which is still high after score is reduced can be considered as a region with obvious enough characteristics, so that key characteristics can be extracted from the regions to serve as global characteristics of a damaged region, and the calculation formula is as follows:

CosSim(x′ _sem )＝X∈R ^c×c

in the formula (5), CosSim (. cndot.) represents an improved cosine similarity calculation function, x' _sem Representing a prediction weight matrix, i and j representing prediction categories, which may be divided into damaged and non-damaged areas, X _i，j Indicating the cosine similarity between pixels of different prediction classes,

is x' _sem The ith channel of (2) indicates the prediction result belonging to a certain class for each pixel. X _i，j The closer to 1 the more closely the image is,

and

the more similar the activation results, the less trustworthy the location prediction. By setting the deviation of the same type of pixels to 0 and setting the deviation of different types of pixels to similarity score X _i，j Thus, the regions that still maintain high activation values in the classification are key features, and the whole process is called as keyAnd (4) pooling the characteristics.

Preferably, in said step S3, the present invention utilizes a prediction weight matrix x' _sen And a feature map x _f Calculating the weighted sum to obtain the key feature v _k The method comprises the following steps:

where i represents a prediction category. Merging the key features v _k As Key, the feature x _f Viewed as Query, highlight and key feature v _k Obtaining an Attention Map in a similar area, and performing convolution operation with the original image to predict a final thinning prediction mask

in equation (8), X represents the output of the last layer convolution in the gated residual block, and H represents the prediction mask

Sampled to the same size as X, β is a learnable channel attention weight,

the information transfer is represented, and the specific content is as follows:

in the formula (9), X _P And X _Q Respectively representing the contaminated area and the valid pixel area, mu (-) represents the area mean, and sigma (-) represents the area variance. For the image, the feature mean is related to global semantics, and the variance is related to local texture features.

in the formula (12), l represents different network shallow layers,

indicating the missing region patch at a different scale,

representing an effective area, s, corresponding to the same size _i，j The attention score is shown and N represents the number of taps in the background. Since the size of the feature map isThe hierarchy varies, and therefore the size of the patch should also vary accordingly, specifically to enlarge the mapping region by comparing the current feature size with the attention score map, for example, every four neighboring pixels in a 128 × 128 size feature map share an attention score value. Through the attention score sharing mode, the model reasoning result not only obtains better global semantic consistency, but also improves the storage and calculation speed efficiency obviously.